Awesome ML Privacy Mitigation

A community-driven collection of privacy-preserving machine learning techniques, tools, and practical evaluations

This repository serves as a living catalog of privacy-preserving machine learning (PPML) techniques and tools. Building on the NIST Adversarial Machine Learning Taxonomy (2025), our goal is to create a comprehensive resource where practitioners can find, evaluate, and implement privacy-preserving solutions in their ML workflows.

An example scenario has been included under each section to assist practicitioners in relating each phase back to a real-world issue:

We follow MedAI Healthcare Solutions, a fictional company developing an AI system to predict patient readmission risk using electronic health records (EHRs) from 50 hospitals across the country. The system will help hospitals optimize discharge planning and reduce healthcare costs. However, each phase of their ML pipeline presents unique privacy challenges that could expose sensitive patient information or violate HIPAA regulations.

About Our Team

Our team actively maintains and evaluates the repository by:

Testing and benchmarking each framework/tool
Documenting pros, cons, and integration challenges
Providing practical examples and use cases
Maintaining an evaluation website with detailed analyses
Keeping the collection updated with the latest PPML developments

How to Contribute

We welcome contributions from the community! Whether you're a researcher, practitioner, or enthusiast, you can help by:

Adding new privacy-preserving tools and frameworks
Sharing your experiences with existing tools
Contributing evaluation results
Suggesting improvements to our documentation

Repository Structure

Each section includes:

Libraries & Tools: Practical implementations and frameworks you can use
References: Research papers, tutorials, and resources for deeper understanding

The techniques covered include:

Data minimization and synthetic data generation
Local differential privacy and secure multi-party computation
Differentially private training and federated learning
Private inference and model protection
Privacy governance and evaluation

1. Data Collection Phase
- 1.1 Data Minimization
- 1.2 Synthetic Data Generation
2. Data Processing Phase
- 2.1 Local Differential Privacy (LDP)
- 2.2 Secure Multi-Party Computation (SMPC)
3. Model Training Phase
- 3.1 Differentially Private Training
- 3.2 Federated Learning
4. Model Deployment Phase
- 4.1 Private Inference
- 4.2 Model Anonymization and Protection
5. Privacy Governance
- 5.1 Privacy Budget Management
- 5.2 Privacy Impact Evaluation
6. Evaluation & Metrics
- 6.1 Privacy Metrics
- 6.2 Utility Metrics
7. Libraries & Tools
8. Tutorials & Resources
Contribute

1. Data Collection Phase

1.1 Data Minimization

Libraries & Tools:

References:

The Data Minimization Principle in Machine Learning (Ganesh et al., 2024) - Empirical exploration of data minimization and its misalignment with privacy, along with potential solutions
Data Minimization for GDPR Compliance in Machine Learning Models (Goldsteen et al., 2022) - Method to reduce personal data needed for ML predictions while preserving model accuracy through knowledge distillation
From Principle to Practice: Vertical Data Minimization for Machine Learning (Staab et al., 2023) - Comprehensive framework for implementing data minimization in machine learning with data generalization techniques
Data Shapley: Equitable Valuation of Data for Machine Learning (Ghorbani & Zou, 2019) - Introduces method to quantify the value of individual data points to model performance, enabling systematic data reduction
Algorithmic Data Minimization for ML over IoT Data Streams (Kil et al., 2024) - Framework for minimizing data collection in IoT environments while balancing utility and privacy
Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Pioneering work on membership inference attacks that can be used to audit privacy leakage in ML models
Selecting critical features for data classification based on machine learning methods (Dewi et al., 2020) - Demonstrates that feature selection improves model accuracy and performance while reducing dimensionality

Example Scenario:

MedAI initially collected comprehensive patient data including full medical histories, demographic information, insurance details, social security numbers, and even lifestyle data from wearable devices. While more data seemed better for model accuracy, this approach created several issues:

Over-collection Risk: The company stored 200+ features per patient when only 20-30 were actually predictive of readmission
Compliance Violation: Collecting more data than necessary violates GDPR's data minimization principle
Attack Surface: Extra data created more opportunities for membership inference attacks
Storage Costs: Unnecessary data inflated storage and processing costs

Privacy-Preserving Solution: Using tools like SHAP and Data Shapley, MedAI identified the minimum set of features needed for accurate predictions. They implemented feature selection algorithms from scikit-learn and used ML Privacy Meter to audit which features contributed most to privacy leakage.

This reduced feature set from 200 to 28 critical variables (age, diagnosis codes, length of stay, etc.) while maintaining 94% model accuracy and significantly reducing privacy risk.

1.2 Synthetic Data Generation

Libraries & Tools:

References:

Synthetic Data: Revisiting the Privacy-Utility Trade-off (Sarmin et al., 2024) - Analysis of privacy-utility trade-offs between synthetic data and traditional anonymization
Machine Learning for Synthetic Data Generation: A Review (Zhao et al., 2023) - Comprehensive review of synthetic data generation techniques and their applications
Modeling Tabular Data using Conditional GAN (Xu et al., 2019) - Introduces CTGAN, designed specifically for mixed-type tabular data generation
Tabular and latent space synthetic data generation: a literature review (Garcia-Gasulla et al., 2023) - Review of data generation methods for tabular data
Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacks (Yan et al., 2024) - Novel hybrid approach combining VAE and GAN
SMOTE: Synthetic Minority Over-sampling Technique (Chawla et al., 2002) - Classic approach for generating synthetic samples for minority classes
Empirical privacy metrics: the bad, the ugly... and the good, maybe? (Desfontaines, 2024) - Critical analysis of common empirical privacy metrics in synthetic data
Challenges of Using Synthetic Data Generation Methods for Tabular Microdata (Winter & Tolan, 2023) - Empirical study of trade-offs in different synthetic data generation methods
Privacy Auditing of Machine Learning using Membership Inference Attacks (Yaghini et al., 2021) - Framework for privacy auditing in ML models
PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (Jordon et al., 2019) - Integrates differential privacy into GANs using the PATE framework
A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning (Domingo-Ferrer & Soria-Comas, 2022) - Analysis of privacy in ML including synthetic data approaches
Protect and Extend - Using GANs for Synthetic Data Generation of Time-Series Medical Records (2024) - Application and evaluation of synthetic data in healthcare domain

Example Scenario:

MedAI needed to share data with research partners and train models across different hospital systems, but couldn't transfer real patient records due to HIPAA restrictions. Traditional anonymization techniques were insufficient because:

Re-identification Risk: Simple anonymization could be reversed using auxiliary datasets
Utility Loss: Heavy anonymization made data unsuitable for ML training
Legal Barriers: Hospitals were reluctant to share even anonymized real patient data

Privacy-Preserving Solution: MedAI implemented CTGAN to generate synthetic patient records that preserved statistical relationships without containing real patient information. They used PATE-GAN to add differential privacy guarantees and validated synthetic data quality using the SDV framework.

This generated 100,000 synthetic patient records that maintained the same readmission prediction patterns as real data, enabling safe data sharing while protecting patient privacy.

2. Data Processing Phase

2.1 Local Differential Privacy (LDP)

Libraries & Tools:

References:

A friendly introduction to differential privacy (Desfontaines) - Accessible explanation of differential privacy concepts and fundamentals
Local Differential Privacy: a tutorial (Xiong et al., 2020) - Comprehensive overview of LDP theory and applications
RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response (Erlingsson et al., 2014) - Google's LDP system for Chrome usage statistics
The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy
Approximate Differential Privacy (Programming Differential Privacy) - Detailed guide to approximate DP implementation
Rényi Differential Privacy (Mironov, 2017) - Original paper introducing RDP
Gaussian Differential Privacy (Dong et al., 2022) - Framework connecting DP to hypothesis testing
Getting more useful results with differential privacy (Desfontaines) - Practical advice for improving utility in DP systems
A reading list on differential privacy (Desfontaines) - Curated list of papers and resources for learning DP
Rényi Differential Privacy of the Sampled Gaussian Mechanism (Mironov et al., 2019) - Analysis of privacy guarantees for subsampled data
On the Rényi Differential Privacy of the Shuffle Model (Wang et al., 2021) - Analysis of shuffling for privacy amplification
Differential Privacy: An Economic Method for Choosing Epsilon (Hsu et al., 2014) - Framework for epsilon selection based on economic principles
Functional Rényi Differential Privacy for Generative Modeling (Jalko et al., 2023) - Extension of RDP to functional outputs
Precision-based attacks and interval refining: how to break, then fix, differential privacy (Haney et al., 2022) - Analysis of vulnerabilities in DP implementations
Differential Privacy: A Primer for a Non-technical Audience (Wood et al., 2018) - Accessible introduction for non-technical readers
Using differential privacy to harness big data and preserve privacy (Brookings, 2020) - Overview of real-world applications

Example Scenario:

MedAI wanted to collect real-time health metrics from patient mobile apps to improve their readmission model, but patients were concerned about privacy. Traditional data collection would reveal:

Individual Health Status: Exact blood pressure, weight, and activity levels
Behavioral Patterns: Sleep schedules, medication adherence, lifestyle habits
Location Privacy: When and where health measurements were taken

Privacy-Preserving Solution: Implemented Local Differential Privacy using OpenDP and TensorFlow Privacy. Each patient's mobile app added calibrated noise to health metrics before transmission, ensuring individual readings couldn't be precisely determined while still allowing for useful aggregate statistics.

This solution allowed MedAI to collect privacy-preserving health metrics from 10,000+ patients, improving readmission prediction by 8% while ensuring no individual patient data could be reconstructed.

2.2 Secure Multi-Party Computation (SMPC)

Libraries & Tools:

References:

Secure Multiparty Computation (Lindell, 2020) - Comprehensive overview of SMPC theory and applications

Example Scenario

Three major hospital networks wanted to collaboratively train a readmission model using their combined data, but legal and competitive concerns prevented direct data sharing:

Competitive Advantage: Hospitals didn't want competitors seeing their patient demographics or treatment outcomes
Regulatory Compliance: Cross-institutional data sharing required extensive legal agreements
Data Sovereignty: Each hospital needed to maintain control over their data

Privacy-Preserving Solution: Used MP-SPDZ and CrypTen to implement secure multi-party computation. Hospitals could jointly train models on encrypted data without revealing individual records to each other.

Using this solution, the hospital networks successfully trained a collaborative model using 150,000 patient records across three hospital systems, achieving 15% better accuracy than individual hospital models while maintaining complete data privacy.

3. Model Training Phase

3.1 Differentially Private Training

Libraries & Tools:

References:

Deep Learning with Differential Privacy (Abadi et al., 2016) - Introduces DP-SGD algorithm for training deep neural networks with differential privacy
Differentially Private Model Publishing for Deep Learning (Yu et al., 2018) - Methods for publishing deep learning models with privacy guarantees

Example Scenario:

MedAI's initial model training process was vulnerable to privacy attacks:

Membership Inference: Adversaries could determine if specific patients were in the training data
Model Inversion: Attackers might reconstruct patient information from model parameters
Gradient Leakage: Model gradients could leak sensitive patient information during training

Privacy-Preserving Solution: Implemented Differentially Private SGD using Opacus for PyTorch training. Added calibrated noise to gradients during training and used TensorFlow Privacy for privacy accounting to track the total privacy budget consumed.

Using this solution resulted in trained models with formal privacy guarantees (ε=1.0 differential privacy) while maintaining 91% accuracy on readmission prediction tasks.

3.2 Federated Learning

Libraries & Tools:

References:

Communication-Efficient Learning of Deep Networks from Decentralized Data (McMahan et al., 2017) - Introduces Federated Averaging (FedAvg) algorithm for efficient federated learning
Practical Secure Aggregation for Federated Learning on User-Held Data (Bonawitz et al., 2017) - Cryptographic protocol for secure aggregation in federated learning
Federated Learning: Strategies for Improving Communication Efficiency (Konečný et al., 2016) - Techniques for reducing communication costs in federated learning

Example Scenario:

MedAI wanted to continuously improve their model using data from all partner hospitals, but centralized training created privacy and logistical issues:

Data Centralization Risk: Moving all hospital data to a central location violated data governance policies
Bandwidth Limitations: Transferring large healthcare datasets was impractical
Regulatory Requirements: Some hospitals couldn't legally share raw patient data

Privacy-Preserving Solution: Deployed TensorFlow Federated framework where each hospital trained local models on their own data and only shared encrypted model updates. Used Flower for orchestrating the federated learning process across hospital networks.

This method enables continuous improvement of the global model using data from 50 hospitals without centralizing patient records, achieving state-of-the-art readmission prediction while maintaining data locality.

4. Model Deployment Phase

4.1 Private Inference

Libraries & Tools:

References:

CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy (Gilad-Bachrach et al., 2016) - Early work on running neural networks on homomorphically encrypted data
GAZELLE: A Low Latency Framework for Secure Neural Network Inference (Juvekar et al., 2018) - Efficient framework for secure neural network inference using homomorphic encryption

Example Scenario:

When hospitals queried MedAI's readmission prediction model, they had to send patient data to MedAI's servers, creating privacy risks:

Data in Transit: Patient information was exposed during transmission
Server-Side Storage: MedAI could potentially store and analyze sensitive patient queries
Insider Threats: MedAI employees could access real patient data during inference

Privacy-Preserving Solution: Implemented Homomorphic Encryption using TenSEAL and Microsoft SEAL. Hospitals could encrypt patient data locally and receive encrypted predictions without MedAI ever seeing the raw patient information.

Using homomorphic encryption enabled privacy-preserving predictions where hospitals could get readmission risk scores without exposing patient data to MedAI or third parties.

4.2 Model Anonymization and Protection

Libraries & Tools:

References:

Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks (Papernot et al., 2016) - Uses model distillation to improve robustness and privacy
Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Introduces membership inference attacks and defenses

Example Scenario:

MedAI needed to deploy their model to hospital systems, but the model itself contained privacy risks:

Model Inversion Attacks: Adversaries could extract training data patterns from model parameters
Membership Inference: Attackers could determine which patients were used for training
Intellectual Property: Model weights revealed MedAI's proprietary algorithms

Privacy-Preserving Solution: Used Knowledge Distillation and Model Compression techniques to create privacy-preserving model versions. Implemented defenses against membership inference attacks using techniques from Adversarial Robustness Toolbox.

The solution enables deployment of sanitized models that maintaine prediction accuracy while preventing privacy attacks and protecting both patient data and intellectual property.

5. Privacy Governance

5.1 Privacy Budget Management

Libraries & Tools:

References:

The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy theory and practice
Renyi Differential Privacy (Mironov, 2017) - Introduces Rényi differential privacy for better privacy accounting

Example Scenario:

MedAI was using differential privacy across multiple model versions and data analysis tasks, but poor privacy budget management created vulnerabilities:

Budget Exhaustion: Researchers were using up privacy budgets too quickly
Composition Attacks: Multiple queries with small epsilon values could accumulate to reveal private information
Inconsistent Accounting: Different teams used different privacy accounting methods

Privacy-Preserving Solution: Implemented centralized privacy budget management using Google's dp-accounting and TensorFlow Privacy Accountant. Created governance policies for privacy budget allocation across research projects.

The outcome of this solution is that sustainable privacy budget management that allowed for long-term research while maintaining strong privacy guarantees was established.

5.2 Privacy Impact Evaluation

Libraries & Tools:

References:

Evaluating Differentially Private Machine Learning in Practice (Jayaraman & Evans, 2019) - Empirical evaluation of privacy-utility trade-offs in DP ML
Machine Learning with Membership Privacy using Adversarial Regularization (Nasr et al., 2018) - Uses adversarial training to improve membership privacy

Example Scenario:

MedAI needed to continuously assess whether their privacy protections were working effectively:

Unknown Vulnerabilities: New attack methods could bypass existing protections
Compliance Auditing: Regulators required proof that privacy protections were effective
Risk Assessment: Needed quantitative measures of privacy risk for business decisions

Privacy-Preserving Solution: Deployed ML Privacy Meter for continuous privacy auditing, implemented membership inference attack simulations using IMIA, and established regular privacy risk assessments.

MedAI has now created a comprehensive privacy monitoring system that provided early warning of potential privacy breaches and demonstrated compliance to regulators.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
docs		docs
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
PRIVACY-DECISIONS.md		PRIVACY-DECISIONS.md
README.md		README.md

yashmaurya01/Awesome-ML-Privacy-Mitigations

Folders and files

Latest commit

History

Repository files navigation

Awesome ML Privacy Mitigation

Contents

1. Data Collection Phase

1.1 Data Minimization

1.2 Synthetic Data Generation

2. Data Processing Phase

2.1 Local Differential Privacy (LDP)

2.2 Secure Multi-Party Computation (SMPC)

3. Model Training Phase

3.1 Differentially Private Training

3.2 Federated Learning

4. Model Deployment Phase

4.1 Private Inference

4.2 Model Anonymization and Protection

5. Privacy Governance

5.1 Privacy Budget Management

5.2 Privacy Impact Evaluation

7. Libraries & Tools

7.1 Differential Privacy

7.2 Federated Learning

7.3 Secure Computation

7.4 Synthetic Data

7.5 Privacy Evaluation

8. Tutorials & Resources

8.1 Differential Privacy Tutorials

8.2 Federated Learning Tutorials

8.3 Secure Computation Tutorials

8.4 Synthetic Data Tutorials

8.5 Privacy Evaluation Tutorials

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!