Skip to content

Commit 723146a

Browse files
committed
Commit
1 parent 0dedb75 commit 723146a

File tree

1 file changed

+93
-110
lines changed

1 file changed

+93
-110
lines changed

README.md

Lines changed: 93 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -1,143 +1,126 @@
1-
# COVID-19 Risk Prediction System using Big Data Architecture
1+
# 🌍 Big Data Architecture for Pandemic Risk Prediction
22

3-
![project overview](assets/dashboard.gif)
3+
![Big Data Architecture](https://img.shields.io/badge/Version-1.0.0-brightgreen) ![License](https://img.shields.io/badge/License-MIT-blue)
44

5-
## Overview
6-
This project implements a comprehensive Big Data architecture to predict pandemic risk levels, focusing on COVID-19 data analysis. The system processes historical COVID-19 data, trains a machine learning model, and provides real-time risk predictions through an interactive dashboard.
5+
Welcome to the **BigData-Architecture** repository! This project focuses on predicting pandemic risk, specifically COVID-19, through data analysis, machine learning modeling, and a real-time dashboard. Our goal is to provide a robust system that helps in understanding and assessing risks associated with pandemics.
76

8-
## Architecture
9-
The solution leverages several key Big Data technologies:
10-
- **Storage**: Hadoop HDFS (partitioned Parquet files)
11-
- **Processing**: Apache Spark for batch processing and machine learning
12-
- **Streaming**: Kafka and Spark Streaming for real-time data pipelines
13-
- **Database**: PostgreSQL for prediction storage
14-
- **Visualization**: Streamlit dashboard and Grafana monitoring
7+
## Table of Contents
158

16-
![project Architecture](assets/Architecture.png)
9+
1. [Introduction](#introduction)
10+
2. [Features](#features)
11+
3. [Technologies Used](#technologies-used)
12+
4. [Installation](#installation)
13+
5. [Usage](#usage)
14+
6. [Real-Time Dashboard](#real-time-dashboard)
15+
7. [Data Analysis](#data-analysis)
16+
8. [Machine Learning Modeling](#machine-learning-modeling)
17+
9. [Contributing](#contributing)
18+
10. [License](#license)
19+
11. [Links](#links)
20+
21+
## Introduction
22+
23+
In the face of global health challenges, the ability to predict pandemic risks is crucial. This project employs big data analytics to assess risks, using various data sources and machine learning techniques. By analyzing patterns and trends, we aim to provide insights that can guide decision-making.
1724

1825
## Features
19-
- Conversion of CSV data to optimized Parquet format with time-based partitioning
20-
- Machine learning model (RandomForest) for risk classification
21-
- Real-time data streaming pipeline with Kafka
22-
- Interactive dashboards for risk visualization
23-
- Geographic risk distribution with choropleth maps
24-
- Time-series analysis of pandemic trends
2526

26-
## Components
27+
- **Data Analysis**: Analyze large datasets to identify trends and patterns.
28+
- **Machine Learning Models**: Implement classification models to predict risks.
29+
- **Real-Time Dashboard**: Visualize data and predictions in an interactive dashboard.
30+
- **Risk Assessment**: Provide assessments based on data-driven insights.
2731

28-
### Data Processing
29-
- `csv_to_parquet.py`: Converts raw COVID-19 CSV data to partitioned Parquet format in HDFS
30-
- `risk_model_training.py`: Trains and saves a RandomForest classification model for risk prediction
32+
## Technologies Used
3133

32-
### Real-time Pipeline
33-
- `risk_kafka_producer.py`: Reads data from HDFS and streams to Kafka topic "risk_data"
34-
- `postgre_consumer.py`: Consumes data stream, applies ML model, and stores predictions in PostgreSQL
34+
- **Big Data Technologies**: Hadoop, HDFS
35+
- **Machine Learning**: Scikit-learn, TensorFlow
36+
- **Data Visualization**: D3.js, Plotly
37+
- **Database**: Real-time databases for live data updates
38+
- **Languages**: Python, JavaScript
3539

36-
### Visualization
37-
- `streamlit_dashboard.py`: Interactive web dashboard for data exploration and visualization
38-
- Grafana dashboards for monitoring and analytics
40+
## Installation
3941

40-
## Dataset
41-
The project uses US COVID-19 data from 2023 with the following structure:
42-
```
43-
date, county, state, cases, deaths
44-
```
42+
To get started with the project, follow these steps:
4543

46-
Data is processed and augmented with risk scores and categories.
44+
1. Clone the repository:
4745

48-
## Getting Started
46+
```bash
47+
git clone https://github.com/Flixteu356/BigData-Architecture.git
48+
```
4949

50-
### Prerequisites
51-
- Apache Hadoop
52-
- Apache Spark
53-
- Apache Kafka
54-
- PostgreSQL
55-
- Python 3.x with required packages (pyspark, kafka-python, streamlit, pandas, plotly)
50+
2. Navigate to the project directory:
5651

57-
### Installation
52+
```bash
53+
cd BigData-Architecture
54+
```
5855

59-
1. Clone the repository:
60-
```bash
61-
git clone https://github.com/Houssam-11/BigData-Architecture.git
62-
cd covid-risk-prediction
63-
```
56+
3. Install the required dependencies:
6457

65-
2. Set up your Hadoop environment:
66-
```bash
67-
hdfs dfs -mkdir -p /data/pandemics
68-
hdfs dfs -mkdir -p /models
69-
hdfs dfs -mkdir -p /checkpoints/pandemic_v2
70-
```
58+
```bash
59+
pip install -r requirements.txt
60+
```
7161

72-
3. Upload your COVID-19 data:
73-
```bash
74-
hdfs dfs -put us-covid_19-2023.csv /data/pandemics/
75-
```
62+
4. Set up the Hadoop environment. Follow the [Hadoop installation guide](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html).
7663

77-
4. Install Python dependencies:
78-
```bash
79-
pip install pyspark kafka-python streamlit pandas plotly psycopg2-binary us
80-
```
64+
5. Download the necessary datasets from the [Releases section](https://github.com/Flixteu356/BigData-Architecture/releases) and execute the required scripts.
8165

82-
5. Set up PostgreSQL database:
83-
```sql
84-
CREATE DATABASE pandemic_db;
85-
CREATE USER spark_user WITH PASSWORD '1234';
86-
GRANT ALL PRIVILEGES ON DATABASE pandemic_db TO spark_user;
87-
88-
\c pandemic_db
89-
CREATE TABLE risk_predictions (
90-
state TEXT,
91-
county TEXT,
92-
date DATE,
93-
risk_category INTEGER,
94-
predicted_risk_category INTEGER
95-
);
96-
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO spark_user;
97-
```
66+
## Usage
9867

99-
### Running the Pipeline
68+
To run the system, use the following command:
10069

101-
1. Process CSV data to Parquet:
10270
```bash
103-
spark-submit csv_to_parquet.py
71+
python main.py
10472
```
10573

106-
2. Train the risk prediction model:
107-
```bash
108-
spark-submit risk_model_training.py
109-
```
74+
This command will start the data processing and machine learning tasks. You can monitor the progress in the console.
11075

111-
3. Start Kafka and create necessary topics:
112-
```bash
113-
kafka-topics.sh --create --topic risk_data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
114-
```
76+
## Real-Time Dashboard
11577

116-
4. Run the Kafka producer:
117-
```bash
118-
python risk_kafka_producer.py
119-
```
78+
The real-time dashboard provides an interactive way to visualize data and predictions. It displays key metrics and trends related to pandemic risk. To access the dashboard, open your web browser and navigate to:
12079

121-
5. Run the Spark Streaming consumer:
122-
```bash
123-
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.postgresql:postgresql:42.2.27 postgre_consumer.py
12480
```
125-
126-
6. Launch the Streamlit dashboard:
127-
```bash
128-
streamlit run streamlit_dashboard.py
81+
http://localhost:5000
12982
```
13083

131-
## Results
84+
The dashboard updates automatically as new data comes in, allowing users to see the latest insights.
85+
86+
## Data Analysis
87+
88+
Data analysis is a critical component of this project. We use various techniques to clean, preprocess, and analyze the data. Key steps include:
89+
90+
1. **Data Cleaning**: Remove inconsistencies and missing values.
91+
2. **Exploratory Data Analysis (EDA)**: Use statistical methods to explore the data.
92+
3. **Feature Engineering**: Create new features that enhance model performance.
93+
94+
We analyze data from multiple sources, including health organizations and social media, to gather a comprehensive view of the pandemic landscape.
95+
96+
## Machine Learning Modeling
97+
98+
Machine learning plays a vital role in predicting pandemic risks. We implement various classification models, including:
99+
100+
- **Logistic Regression**: A simple yet effective model for binary classification.
101+
- **Random Forest**: An ensemble method that improves accuracy by combining multiple decision trees.
102+
- **Support Vector Machines (SVM)**: A powerful model for high-dimensional data.
103+
104+
Each model undergoes rigorous testing and validation to ensure accuracy and reliability.
105+
106+
## Contributing
107+
108+
We welcome contributions from the community! If you want to help improve the project, please follow these steps:
109+
110+
1. Fork the repository.
111+
2. Create a new branch for your feature or bug fix.
112+
3. Make your changes and commit them with clear messages.
113+
4. Push your branch to your forked repository.
114+
5. Submit a pull request.
115+
116+
Please ensure that your code adheres to our coding standards and includes appropriate tests.
117+
118+
## License
119+
120+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
121+
122+
## Links
132123

133-
The final system provides:
134-
- Risk classification with 96% accuracy
135-
- Identification of high-risk pandemic zones
136-
- Geographic visualization of risk distribution
137-
- Time-based analysis of pandemic trends
124+
For the latest releases, please visit the [Releases section](https://github.com/Flixteu356/BigData-Architecture/releases). Here, you can download necessary files and execute them as needed.
138125

139-
## Future Improvements
140-
- Integration with external data sources (weather, population density)
141-
- Enhanced prediction models with deep learning
142-
- Mobile application for real-time alerts
143-
- Deployment to cloud infrastructure for scalability
126+
Thank you for your interest in the **BigData-Architecture** project! Together, we can make a difference in understanding and mitigating pandemic risks.

0 commit comments

Comments
 (0)