|
1 |
| -# COVID-19 Risk Prediction System using Big Data Architecture |
| 1 | +# 🌍 Big Data Architecture for Pandemic Risk Prediction |
2 | 2 |
|
3 |
| - |
| 3 | +  |
4 | 4 |
|
5 |
| -## Overview |
6 |
| -This project implements a comprehensive Big Data architecture to predict pandemic risk levels, focusing on COVID-19 data analysis. The system processes historical COVID-19 data, trains a machine learning model, and provides real-time risk predictions through an interactive dashboard. |
| 5 | +Welcome to the **BigData-Architecture** repository! This project focuses on predicting pandemic risk, specifically COVID-19, through data analysis, machine learning modeling, and a real-time dashboard. Our goal is to provide a robust system that helps in understanding and assessing risks associated with pandemics. |
7 | 6 |
|
8 |
| -## Architecture |
9 |
| -The solution leverages several key Big Data technologies: |
10 |
| -- **Storage**: Hadoop HDFS (partitioned Parquet files) |
11 |
| -- **Processing**: Apache Spark for batch processing and machine learning |
12 |
| -- **Streaming**: Kafka and Spark Streaming for real-time data pipelines |
13 |
| -- **Database**: PostgreSQL for prediction storage |
14 |
| -- **Visualization**: Streamlit dashboard and Grafana monitoring |
| 7 | +## Table of Contents |
15 | 8 |
|
16 |
| - |
| 9 | +1. [Introduction](#introduction) |
| 10 | +2. [Features](#features) |
| 11 | +3. [Technologies Used](#technologies-used) |
| 12 | +4. [Installation](#installation) |
| 13 | +5. [Usage](#usage) |
| 14 | +6. [Real-Time Dashboard](#real-time-dashboard) |
| 15 | +7. [Data Analysis](#data-analysis) |
| 16 | +8. [Machine Learning Modeling](#machine-learning-modeling) |
| 17 | +9. [Contributing](#contributing) |
| 18 | +10. [License](#license) |
| 19 | +11. [Links](#links) |
| 20 | + |
| 21 | +## Introduction |
| 22 | + |
| 23 | +In the face of global health challenges, the ability to predict pandemic risks is crucial. This project employs big data analytics to assess risks, using various data sources and machine learning techniques. By analyzing patterns and trends, we aim to provide insights that can guide decision-making. |
17 | 24 |
|
18 | 25 | ## Features
|
19 |
| -- Conversion of CSV data to optimized Parquet format with time-based partitioning |
20 |
| -- Machine learning model (RandomForest) for risk classification |
21 |
| -- Real-time data streaming pipeline with Kafka |
22 |
| -- Interactive dashboards for risk visualization |
23 |
| -- Geographic risk distribution with choropleth maps |
24 |
| -- Time-series analysis of pandemic trends |
25 | 26 |
|
26 |
| -## Components |
| 27 | +- **Data Analysis**: Analyze large datasets to identify trends and patterns. |
| 28 | +- **Machine Learning Models**: Implement classification models to predict risks. |
| 29 | +- **Real-Time Dashboard**: Visualize data and predictions in an interactive dashboard. |
| 30 | +- **Risk Assessment**: Provide assessments based on data-driven insights. |
27 | 31 |
|
28 |
| -### Data Processing |
29 |
| -- `csv_to_parquet.py`: Converts raw COVID-19 CSV data to partitioned Parquet format in HDFS |
30 |
| -- `risk_model_training.py`: Trains and saves a RandomForest classification model for risk prediction |
| 32 | +## Technologies Used |
31 | 33 |
|
32 |
| -### Real-time Pipeline |
33 |
| -- `risk_kafka_producer.py`: Reads data from HDFS and streams to Kafka topic "risk_data" |
34 |
| -- `postgre_consumer.py`: Consumes data stream, applies ML model, and stores predictions in PostgreSQL |
| 34 | +- **Big Data Technologies**: Hadoop, HDFS |
| 35 | +- **Machine Learning**: Scikit-learn, TensorFlow |
| 36 | +- **Data Visualization**: D3.js, Plotly |
| 37 | +- **Database**: Real-time databases for live data updates |
| 38 | +- **Languages**: Python, JavaScript |
35 | 39 |
|
36 |
| -### Visualization |
37 |
| -- `streamlit_dashboard.py`: Interactive web dashboard for data exploration and visualization |
38 |
| -- Grafana dashboards for monitoring and analytics |
| 40 | +## Installation |
39 | 41 |
|
40 |
| -## Dataset |
41 |
| -The project uses US COVID-19 data from 2023 with the following structure: |
42 |
| -``` |
43 |
| -date, county, state, cases, deaths |
44 |
| -``` |
| 42 | +To get started with the project, follow these steps: |
45 | 43 |
|
46 |
| -Data is processed and augmented with risk scores and categories. |
| 44 | +1. Clone the repository: |
47 | 45 |
|
48 |
| -## Getting Started |
| 46 | + ```bash |
| 47 | + git clone https://github.com/Flixteu356/BigData-Architecture.git |
| 48 | + ``` |
49 | 49 |
|
50 |
| -### Prerequisites |
51 |
| -- Apache Hadoop |
52 |
| -- Apache Spark |
53 |
| -- Apache Kafka |
54 |
| -- PostgreSQL |
55 |
| -- Python 3.x with required packages (pyspark, kafka-python, streamlit, pandas, plotly) |
| 50 | +2. Navigate to the project directory: |
56 | 51 |
|
57 |
| -### Installation |
| 52 | + ```bash |
| 53 | + cd BigData-Architecture |
| 54 | + ``` |
58 | 55 |
|
59 |
| -1. Clone the repository: |
60 |
| -```bash |
61 |
| -git clone https://github.com/Houssam-11/BigData-Architecture.git |
62 |
| -cd covid-risk-prediction |
63 |
| -``` |
| 56 | +3. Install the required dependencies: |
64 | 57 |
|
65 |
| -2. Set up your Hadoop environment: |
66 |
| -```bash |
67 |
| -hdfs dfs -mkdir -p /data/pandemics |
68 |
| -hdfs dfs -mkdir -p /models |
69 |
| -hdfs dfs -mkdir -p /checkpoints/pandemic_v2 |
70 |
| -``` |
| 58 | + ```bash |
| 59 | + pip install -r requirements.txt |
| 60 | + ``` |
71 | 61 |
|
72 |
| -3. Upload your COVID-19 data: |
73 |
| -```bash |
74 |
| -hdfs dfs -put us-covid_19-2023.csv /data/pandemics/ |
75 |
| -``` |
| 62 | +4. Set up the Hadoop environment. Follow the [Hadoop installation guide](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html). |
76 | 63 |
|
77 |
| -4. Install Python dependencies: |
78 |
| -```bash |
79 |
| -pip install pyspark kafka-python streamlit pandas plotly psycopg2-binary us |
80 |
| -``` |
| 64 | +5. Download the necessary datasets from the [Releases section](https://github.com/Flixteu356/BigData-Architecture/releases) and execute the required scripts. |
81 | 65 |
|
82 |
| -5. Set up PostgreSQL database: |
83 |
| -```sql |
84 |
| -CREATE DATABASE pandemic_db; |
85 |
| -CREATE USER spark_user WITH PASSWORD '1234'; |
86 |
| -GRANT ALL PRIVILEGES ON DATABASE pandemic_db TO spark_user; |
87 |
| - |
88 |
| -\c pandemic_db |
89 |
| -CREATE TABLE risk_predictions ( |
90 |
| - state TEXT, |
91 |
| - county TEXT, |
92 |
| - date DATE, |
93 |
| - risk_category INTEGER, |
94 |
| - predicted_risk_category INTEGER |
95 |
| -); |
96 |
| -GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO spark_user; |
97 |
| -``` |
| 66 | +## Usage |
98 | 67 |
|
99 |
| -### Running the Pipeline |
| 68 | +To run the system, use the following command: |
100 | 69 |
|
101 |
| -1. Process CSV data to Parquet: |
102 | 70 | ```bash
|
103 |
| -spark-submit csv_to_parquet.py |
| 71 | +python main.py |
104 | 72 | ```
|
105 | 73 |
|
106 |
| -2. Train the risk prediction model: |
107 |
| -```bash |
108 |
| -spark-submit risk_model_training.py |
109 |
| -``` |
| 74 | +This command will start the data processing and machine learning tasks. You can monitor the progress in the console. |
110 | 75 |
|
111 |
| -3. Start Kafka and create necessary topics: |
112 |
| -```bash |
113 |
| -kafka-topics.sh --create --topic risk_data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1 |
114 |
| -``` |
| 76 | +## Real-Time Dashboard |
115 | 77 |
|
116 |
| -4. Run the Kafka producer: |
117 |
| -```bash |
118 |
| -python risk_kafka_producer.py |
119 |
| -``` |
| 78 | +The real-time dashboard provides an interactive way to visualize data and predictions. It displays key metrics and trends related to pandemic risk. To access the dashboard, open your web browser and navigate to: |
120 | 79 |
|
121 |
| -5. Run the Spark Streaming consumer: |
122 |
| -```bash |
123 |
| -spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.postgresql:postgresql:42.2.27 postgre_consumer.py |
124 | 80 | ```
|
125 |
| - |
126 |
| -6. Launch the Streamlit dashboard: |
127 |
| -```bash |
128 |
| -streamlit run streamlit_dashboard.py |
| 81 | +http://localhost:5000 |
129 | 82 | ```
|
130 | 83 |
|
131 |
| -## Results |
| 84 | +The dashboard updates automatically as new data comes in, allowing users to see the latest insights. |
| 85 | + |
| 86 | +## Data Analysis |
| 87 | + |
| 88 | +Data analysis is a critical component of this project. We use various techniques to clean, preprocess, and analyze the data. Key steps include: |
| 89 | + |
| 90 | +1. **Data Cleaning**: Remove inconsistencies and missing values. |
| 91 | +2. **Exploratory Data Analysis (EDA)**: Use statistical methods to explore the data. |
| 92 | +3. **Feature Engineering**: Create new features that enhance model performance. |
| 93 | + |
| 94 | +We analyze data from multiple sources, including health organizations and social media, to gather a comprehensive view of the pandemic landscape. |
| 95 | + |
| 96 | +## Machine Learning Modeling |
| 97 | + |
| 98 | +Machine learning plays a vital role in predicting pandemic risks. We implement various classification models, including: |
| 99 | + |
| 100 | +- **Logistic Regression**: A simple yet effective model for binary classification. |
| 101 | +- **Random Forest**: An ensemble method that improves accuracy by combining multiple decision trees. |
| 102 | +- **Support Vector Machines (SVM)**: A powerful model for high-dimensional data. |
| 103 | + |
| 104 | +Each model undergoes rigorous testing and validation to ensure accuracy and reliability. |
| 105 | + |
| 106 | +## Contributing |
| 107 | + |
| 108 | +We welcome contributions from the community! If you want to help improve the project, please follow these steps: |
| 109 | + |
| 110 | +1. Fork the repository. |
| 111 | +2. Create a new branch for your feature or bug fix. |
| 112 | +3. Make your changes and commit them with clear messages. |
| 113 | +4. Push your branch to your forked repository. |
| 114 | +5. Submit a pull request. |
| 115 | + |
| 116 | +Please ensure that your code adheres to our coding standards and includes appropriate tests. |
| 117 | + |
| 118 | +## License |
| 119 | + |
| 120 | +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. |
| 121 | + |
| 122 | +## Links |
132 | 123 |
|
133 |
| -The final system provides: |
134 |
| -- Risk classification with 96% accuracy |
135 |
| -- Identification of high-risk pandemic zones |
136 |
| -- Geographic visualization of risk distribution |
137 |
| -- Time-based analysis of pandemic trends |
| 124 | +For the latest releases, please visit the [Releases section](https://github.com/Flixteu356/BigData-Architecture/releases). Here, you can download necessary files and execute them as needed. |
138 | 125 |
|
139 |
| -## Future Improvements |
140 |
| -- Integration with external data sources (weather, population density) |
141 |
| -- Enhanced prediction models with deep learning |
142 |
| -- Mobile application for real-time alerts |
143 |
| -- Deployment to cloud infrastructure for scalability |
| 126 | +Thank you for your interest in the **BigData-Architecture** project! Together, we can make a difference in understanding and mitigating pandemic risks. |
0 commit comments