Skip to content

A curated collection of privacy-preserving machine learning techniques, tools, and practical evaluations. Focuses on differential privacy, federated learning, secure computation, and synthetic data generation for implementing privacy in ML workflows.

Notifications You must be signed in to change notification settings

yashmaurya01/Awesome-ML-Privacy-Mitigations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome ML Privacy Mitigation Awesome

A community-driven collection of privacy-preserving machine learning techniques, tools, and practical evaluations

This repository serves as a living catalog of privacy-preserving machine learning (PPML) techniques and tools. Building on the NIST Adversarial Machine Learning Taxonomy (2025), our goal is to create a comprehensive resource where practitioners can find, evaluate, and implement privacy-preserving solutions in their ML workflows.

An example scenario has been included under each section to assist practicitioners in relating each phase back to a real-world issue:

We follow MedAI Healthcare Solutions, a fictional company developing an AI system to predict patient readmission risk using electronic health records (EHRs) from 50 hospitals across the country. The system will help hospitals optimize discharge planning and reduce healthcare costs. However, each phase of their ML pipeline presents unique privacy challenges that could expose sensitive patient information or violate HIPAA regulations.

About Our Team

Our team actively maintains and evaluates the repository by:

  • Testing and benchmarking each framework/tool
  • Documenting pros, cons, and integration challenges
  • Providing practical examples and use cases
  • Maintaining an evaluation website with detailed analyses
  • Keeping the collection updated with the latest PPML developments
How to Contribute

We welcome contributions from the community! Whether you're a researcher, practitioner, or enthusiast, you can help by:

  • Adding new privacy-preserving tools and frameworks
  • Sharing your experiences with existing tools
  • Contributing evaluation results
  • Suggesting improvements to our documentation
Repository Structure

Each section includes:

  1. Libraries & Tools: Practical implementations and frameworks you can use
  2. References: Research papers, tutorials, and resources for deeper understanding

The techniques covered include:

  • Data minimization and synthetic data generation
  • Local differential privacy and secure multi-party computation
  • Differentially private training and federated learning
  • Private inference and model protection
  • Privacy governance and evaluation

Contents

1. Data Collection Phase

1.1 Data Minimization

Libraries & Tools:

References:

Example Scenario:

MedAI initially collected comprehensive patient data including full medical histories, demographic information, insurance details, social security numbers, and even lifestyle data from wearable devices. While more data seemed better for model accuracy, this approach created several issues:

  • Over-collection Risk: The company stored 200+ features per patient when only 20-30 were actually predictive of readmission
  • Compliance Violation: Collecting more data than necessary violates GDPR's data minimization principle
  • Attack Surface: Extra data created more opportunities for membership inference attacks
  • Storage Costs: Unnecessary data inflated storage and processing costs

Privacy-Preserving Solution: Using tools like SHAP and Data Shapley, MedAI identified the minimum set of features needed for accurate predictions. They implemented feature selection algorithms from scikit-learn and used ML Privacy Meter to audit which features contributed most to privacy leakage.

This reduced feature set from 200 to 28 critical variables (age, diagnosis codes, length of stay, etc.) while maintaining 94% model accuracy and significantly reducing privacy risk.

1.2 Synthetic Data Generation

Libraries & Tools:

References:

Example Scenario:

MedAI needed to share data with research partners and train models across different hospital systems, but couldn't transfer real patient records due to HIPAA restrictions. Traditional anonymization techniques were insufficient because:

  • Re-identification Risk: Simple anonymization could be reversed using auxiliary datasets
  • Utility Loss: Heavy anonymization made data unsuitable for ML training
  • Legal Barriers: Hospitals were reluctant to share even anonymized real patient data

Privacy-Preserving Solution: MedAI implemented CTGAN to generate synthetic patient records that preserved statistical relationships without containing real patient information. They used PATE-GAN to add differential privacy guarantees and validated synthetic data quality using the SDV framework.

This generated 100,000 synthetic patient records that maintained the same readmission prediction patterns as real data, enabling safe data sharing while protecting patient privacy.

2. Data Processing Phase

2.1 Local Differential Privacy (LDP)

Libraries & Tools:

References:

Example Scenario:

MedAI wanted to collect real-time health metrics from patient mobile apps to improve their readmission model, but patients were concerned about privacy. Traditional data collection would reveal:

  • Individual Health Status: Exact blood pressure, weight, and activity levels
  • Behavioral Patterns: Sleep schedules, medication adherence, lifestyle habits
  • Location Privacy: When and where health measurements were taken

Privacy-Preserving Solution: Implemented Local Differential Privacy using OpenDP and TensorFlow Privacy. Each patient's mobile app added calibrated noise to health metrics before transmission, ensuring individual readings couldn't be precisely determined while still allowing for useful aggregate statistics.

This solution allowed MedAI to collect privacy-preserving health metrics from 10,000+ patients, improving readmission prediction by 8% while ensuring no individual patient data could be reconstructed.

2.2 Secure Multi-Party Computation (SMPC)

Libraries & Tools:

References:

Example Scenario

Three major hospital networks wanted to collaboratively train a readmission model using their combined data, but legal and competitive concerns prevented direct data sharing:

  • Competitive Advantage: Hospitals didn't want competitors seeing their patient demographics or treatment outcomes
  • Regulatory Compliance: Cross-institutional data sharing required extensive legal agreements
  • Data Sovereignty: Each hospital needed to maintain control over their data

Privacy-Preserving Solution: Used MP-SPDZ and CrypTen to implement secure multi-party computation. Hospitals could jointly train models on encrypted data without revealing individual records to each other.

Using this solution, the hospital networks successfully trained a collaborative model using 150,000 patient records across three hospital systems, achieving 15% better accuracy than individual hospital models while maintaining complete data privacy.

3. Model Training Phase

3.1 Differentially Private Training

Libraries & Tools:

References:

Example Scenario:

MedAI's initial model training process was vulnerable to privacy attacks:

  • Membership Inference: Adversaries could determine if specific patients were in the training data
  • Model Inversion: Attackers might reconstruct patient information from model parameters
  • Gradient Leakage: Model gradients could leak sensitive patient information during training

Privacy-Preserving Solution: Implemented Differentially Private SGD using Opacus for PyTorch training. Added calibrated noise to gradients during training and used TensorFlow Privacy for privacy accounting to track the total privacy budget consumed.

Using this solution resulted in trained models with formal privacy guarantees (ε=1.0 differential privacy) while maintaining 91% accuracy on readmission prediction tasks.

3.2 Federated Learning

Libraries & Tools:

References:

Example Scenario:

MedAI wanted to continuously improve their model using data from all partner hospitals, but centralized training created privacy and logistical issues:

  • Data Centralization Risk: Moving all hospital data to a central location violated data governance policies
  • Bandwidth Limitations: Transferring large healthcare datasets was impractical
  • Regulatory Requirements: Some hospitals couldn't legally share raw patient data

Privacy-Preserving Solution: Deployed TensorFlow Federated framework where each hospital trained local models on their own data and only shared encrypted model updates. Used Flower for orchestrating the federated learning process across hospital networks.

This method enables continuous improvement of the global model using data from 50 hospitals without centralizing patient records, achieving state-of-the-art readmission prediction while maintaining data locality.

4. Model Deployment Phase

4.1 Private Inference

Libraries & Tools:

References:

Example Scenario:

When hospitals queried MedAI's readmission prediction model, they had to send patient data to MedAI's servers, creating privacy risks:

  • Data in Transit: Patient information was exposed during transmission
  • Server-Side Storage: MedAI could potentially store and analyze sensitive patient queries
  • Insider Threats: MedAI employees could access real patient data during inference

Privacy-Preserving Solution: Implemented Homomorphic Encryption using TenSEAL and Microsoft SEAL. Hospitals could encrypt patient data locally and receive encrypted predictions without MedAI ever seeing the raw patient information.

Using homomorphic encryption enabled privacy-preserving predictions where hospitals could get readmission risk scores without exposing patient data to MedAI or third parties.

4.2 Model Anonymization and Protection

Libraries & Tools:

References:

Example Scenario:

MedAI needed to deploy their model to hospital systems, but the model itself contained privacy risks:

  • Model Inversion Attacks: Adversaries could extract training data patterns from model parameters
  • Membership Inference: Attackers could determine which patients were used for training
  • Intellectual Property: Model weights revealed MedAI's proprietary algorithms

Privacy-Preserving Solution: Used Knowledge Distillation and Model Compression techniques to create privacy-preserving model versions. Implemented defenses against membership inference attacks using techniques from Adversarial Robustness Toolbox.

The solution enables deployment of sanitized models that maintaine prediction accuracy while preventing privacy attacks and protecting both patient data and intellectual property.

5. Privacy Governance

5.1 Privacy Budget Management

Libraries & Tools:

References:

Example Scenario:

MedAI was using differential privacy across multiple model versions and data analysis tasks, but poor privacy budget management created vulnerabilities:

  • Budget Exhaustion: Researchers were using up privacy budgets too quickly
  • Composition Attacks: Multiple queries with small epsilon values could accumulate to reveal private information
  • Inconsistent Accounting: Different teams used different privacy accounting methods

Privacy-Preserving Solution: Implemented centralized privacy budget management using Google's dp-accounting and TensorFlow Privacy Accountant. Created governance policies for privacy budget allocation across research projects.

The outcome of this solution is that sustainable privacy budget management that allowed for long-term research while maintaining strong privacy guarantees was established.

5.2 Privacy Impact Evaluation

Libraries & Tools:

References:

Example Scenario:

MedAI needed to continuously assess whether their privacy protections were working effectively:

  • Unknown Vulnerabilities: New attack methods could bypass existing protections
  • Compliance Auditing: Regulators required proof that privacy protections were effective
  • Risk Assessment: Needed quantitative measures of privacy risk for business decisions

Privacy-Preserving Solution: Deployed ML Privacy Meter for continuous privacy auditing, implemented membership inference attack simulations using IMIA, and established regular privacy risk assessments.

MedAI has now created a comprehensive privacy monitoring system that provided early warning of potential privacy breaches and demonstrated compliance to regulators.

7. Libraries & Tools

7.1 Differential Privacy

7.2 Federated Learning

7.3 Secure Computation

7.4 Synthetic Data

7.5 Privacy Evaluation

8. Tutorials & Resources

8.1 Differential Privacy Tutorials

8.2 Federated Learning Tutorials

8.3 Secure Computation Tutorials

8.4 Synthetic Data Tutorials

8.5 Privacy Evaluation Tutorials

Contribute

Contributions welcome! Read the contribution guidelines first.

License

CC0

About

A curated collection of privacy-preserving machine learning techniques, tools, and practical evaluations. Focuses on differential privacy, federated learning, secure computation, and synthetic data generation for implementing privacy in ML workflows.

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Contributors 2

  •  
  •