A comprehensive data analytics project that analyzes pharmaceutical drug data from the OpenFDA database, focusing on drug ingredients, delivery routes, and manufacturer-specific insights with predictive modeling capabilities.
This project performs in-depth analysis of OpenFDA drug label data to extract meaningful insights about pharmaceutical products, including:
- Analysis of drug ingredient complexity over time
- Manufacturer-specific drug composition analysis (with focus on AstraZeneca)
- Delivery route analysis across all manufacturers
- Predictive modeling for future drug ingredient trends
- Drug interaction analysis
Part A: AstraZeneca Analysis
- Average number of ingredients in AstraZeneca medicines per year
- Temporal trends in drug complexity
- Statistical analysis with descriptive statistics
Part B: Cross-Manufacturer Analysis
- Average number of ingredients per year across all manufacturers
- Analysis by delivery route (oral, topical, intravenous, etc.)
- Comparative analysis of administration methods
Optional Advanced Analysis
- Linear regression model for predicting ingredient counts
- Drug interaction analysis for AstraZeneca products
- Text processing and pattern matching for drug interactions
- Data Ingestion: Loads multiple OpenFDA JSON files (9 files total)
- Data Cleaning:
- Removes invalid date formats
- Handles missing values
- Converts list data to strings
- Calculates ingredient counts from product data elements
- Feature Engineering: Creates derived features like year extraction and ingredient counting
- Data Validation: Ensures data quality and consistency
- Python 3.7+
- Data Processing: pandas, numpy
- Visualization: matplotlib, seaborn
- Machine Learning: scikit-learn
- Data Format: JSON processing
- Text Processing: Regular expressions (re module)
# Clone the repository
git clone <repository-url>
cd openfda-drug-analysis
# Run the setup script
python setup.pypip install pandas numpy matplotlib seaborn scikit-learn jupyterpip install -r requirements.txtThe project expects OpenFDA drug label data files in the following format:
drug-label-0001-of-0009.jsonthroughdrug-label-0009-of-0009.json- Files should be placed in a
Data Source/directory
Note: If you don't have the actual OpenFDA data files, you can still run the example script which uses synthetic data for demonstration purposes.
-
With actual OpenFDA data:
# Place data files in 'Data Source' directory jupyter notebook OpenFDA_DrugEndPointAnalysis.ipynb -
With synthetic data (for testing/demo):
python example_usage.py
-
Running tests:
python test_openfda_analysis.py
- Total Records: 162,807 drug entries
- Date Range: 1978-2022
- Average Ingredients: ~2.9 ingredients per drug
- Correlation: Weak negative correlation (-0.016) between year and ingredient count
- Correlation matrices showing relationships between variables
- Time series plots of ingredient trends
- Distribution histograms of ingredient counts
- Pie charts showing delivery route frequencies
- Scatter plots for predictive model validation
- Model: Linear Regression
- Features: Year as predictor variable
- Target: Number of ingredients
- Performance Metrics:
- Mean Absolute Error: ~2.07
- Root Mean Squared Error: ~4.31
/workspace/
├── README.md # This file
├── OpenFDA_DrugEndPointAnalysis.ipynb # Main analysis notebook
├── openfda_functions.py # Core analysis functions module
├── test_openfda_analysis.py # Unit tests for the functions
├── example_usage.py # Example script demonstrating usage
├── requirements.txt # Python dependencies
└── Data Source/ # Data directory (not included)
├── drug-label-0001-of-0009.json
├── drug-label-0002-of-0009.json
└── ... (additional data files)
The notebook provides step-by-step analysis with clear markdown explanations for each section.
from openfda_functions import (
load_openfda_data, process_openfda_data,
analyze_ingredients_by_year, create_prediction_model
)
# Load and process data
raw_data = load_openfda_data('path/to/data')
processed_data = process_openfda_data(raw_data)
# Analyze by manufacturer
az_analysis = analyze_ingredients_by_year(processed_data, 'AstraZeneca')
# Create prediction model
model_results = create_prediction_model(processed_data)python example_usage.pypython -m pytest test_openfda_analysis.py -v
# or
python test_openfda_analysis.pyUsers can modify the analysis by:
- Changing the manufacturer filter (currently set to AstraZeneca)
- Adjusting the delivery route analysis parameters
- Modifying the prediction model features
- Customizing visualization parameters
- Primary Source: OpenFDA Drug Labels API
- Data Format: JSON files containing drug label information
- Key Fields:
openfda.generic_name: Drug generic namesspl_product_data_elements: Product composition datadrug_interactions: Drug interaction informationopenfda.manufacturer_name: Manufacturer informationeffective_time: Drug approval/effective datesopenfda.route: Administration routes
- Integration with real-time OpenFDA API
- Advanced machine learning models (Random Forest, Neural Networks)
- Interactive dashboard development
- Automated report generation
- Extended manufacturer analysis beyond AstraZeneca
- Drug safety signal detection
- Regulatory compliance analysis
- Fork the repository
- Create a feature branch
- Make your changes
- Add appropriate tests and documentation
- Submit a pull request
This project is intended for educational and research purposes. Please ensure compliance with OpenFDA data usage policies.
For questions or collaboration opportunities, please open an issue in the repository.
Note: This analysis is based on publicly available OpenFDA data and is intended for research and educational purposes only. Results should not be used for medical decision-making without proper validation and expert consultation.