A community-driven collection of privacy-preserving machine learning techniques, tools, and practical evaluations
This repository serves as a living catalog of privacy-preserving machine learning (PPML) techniques and tools. Building on the NIST Adversarial Machine Learning Taxonomy (2025), our goal is to create a comprehensive resource where practitioners can find, evaluate, and implement privacy-preserving solutions in their ML workflows.
An example scenario has been included under each section to assist practicitioners in relating each phase back to a real-world issue:
We follow MedAI Healthcare Solutions, a fictional company developing an AI system to predict patient readmission risk using electronic health records (EHRs) from 50 hospitals across the country. The system will help hospitals optimize discharge planning and reduce healthcare costs. However, each phase of their ML pipeline presents unique privacy challenges that could expose sensitive patient information or violate HIPAA regulations.
About Our Team
Our team actively maintains and evaluates the repository by:
- Testing and benchmarking each framework/tool
 - Documenting pros, cons, and integration challenges
 - Providing practical examples and use cases
 - Maintaining an evaluation website with detailed analyses
 - Keeping the collection updated with the latest PPML developments
 
How to Contribute
We welcome contributions from the community! Whether you're a researcher, practitioner, or enthusiast, you can help by:
- Adding new privacy-preserving tools and frameworks
 - Sharing your experiences with existing tools
 - Contributing evaluation results
 - Suggesting improvements to our documentation
 
Repository Structure
Each section includes:
- Libraries & Tools: Practical implementations and frameworks you can use
 - References: Research papers, tutorials, and resources for deeper understanding
 
The techniques covered include:
- Data minimization and synthetic data generation
 - Local differential privacy and secure multi-party computation
 - Differentially private training and federated learning
 - Private inference and model protection
 - Privacy governance and evaluation
 
- 1. Data Collection Phase
 - 2. Data Processing Phase
 - 3. Model Training Phase
 - 4. Model Deployment Phase
 - 5. Privacy Governance
 - 6. Evaluation & Metrics
 - 7. Libraries & Tools
 - 8. Tutorials & Resources
 - Contribute
 
Libraries & Tools:
References:
- The Data Minimization Principle in Machine Learning (Ganesh et al., 2024) - Empirical exploration of data minimization and its misalignment with privacy, along with potential solutions
 - Data Minimization for GDPR Compliance in Machine Learning Models (Goldsteen et al., 2022) - Method to reduce personal data needed for ML predictions while preserving model accuracy through knowledge distillation
 - From Principle to Practice: Vertical Data Minimization for Machine Learning (Staab et al., 2023) - Comprehensive framework for implementing data minimization in machine learning with data generalization techniques
 - Data Shapley: Equitable Valuation of Data for Machine Learning (Ghorbani & Zou, 2019) - Introduces method to quantify the value of individual data points to model performance, enabling systematic data reduction
 - Algorithmic Data Minimization for ML over IoT Data Streams (Kil et al., 2024) - Framework for minimizing data collection in IoT environments while balancing utility and privacy
 - Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Pioneering work on membership inference attacks that can be used to audit privacy leakage in ML models
 - Selecting critical features for data classification based on machine learning methods (Dewi et al., 2020) - Demonstrates that feature selection improves model accuracy and performance while reducing dimensionality
 
Example Scenario:
MedAI initially collected comprehensive patient data including full medical histories, demographic information, insurance details, social security numbers, and even lifestyle data from wearable devices. While more data seemed better for model accuracy, this approach created several issues:
- Over-collection Risk: The company stored 200+ features per patient when only 20-30 were actually predictive of readmission
 - Compliance Violation: Collecting more data than necessary violates GDPR's data minimization principle
 - Attack Surface: Extra data created more opportunities for membership inference attacks
 - Storage Costs: Unnecessary data inflated storage and processing costs
 
Privacy-Preserving Solution: Using tools like SHAP and Data Shapley, MedAI identified the minimum set of features needed for accurate predictions. They implemented feature selection algorithms from scikit-learn and used ML Privacy Meter to audit which features contributed most to privacy leakage.
This reduced feature set from 200 to 28 critical variables (age, diagnosis codes, length of stay, etc.) while maintaining 94% model accuracy and significantly reducing privacy risk.
Libraries & Tools:
References:
- Synthetic Data: Revisiting the Privacy-Utility Trade-off (Sarmin et al., 2024) - Analysis of privacy-utility trade-offs between synthetic data and traditional anonymization
 - Machine Learning for Synthetic Data Generation: A Review (Zhao et al., 2023) - Comprehensive review of synthetic data generation techniques and their applications
 - Modeling Tabular Data using Conditional GAN (Xu et al., 2019) - Introduces CTGAN, designed specifically for mixed-type tabular data generation
 - Tabular and latent space synthetic data generation: a literature review (Garcia-Gasulla et al., 2023) - Review of data generation methods for tabular data
 - Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacks (Yan et al., 2024) - Novel hybrid approach combining VAE and GAN
 - SMOTE: Synthetic Minority Over-sampling Technique (Chawla et al., 2002) - Classic approach for generating synthetic samples for minority classes
 - Empirical privacy metrics: the bad, the ugly... and the good, maybe? (Desfontaines, 2024) - Critical analysis of common empirical privacy metrics in synthetic data
 - Challenges of Using Synthetic Data Generation Methods for Tabular Microdata (Winter & Tolan, 2023) - Empirical study of trade-offs in different synthetic data generation methods
 - Privacy Auditing of Machine Learning using Membership Inference Attacks (Yaghini et al., 2021) - Framework for privacy auditing in ML models
 - PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (Jordon et al., 2019) - Integrates differential privacy into GANs using the PATE framework
 - A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning (Domingo-Ferrer & Soria-Comas, 2022) - Analysis of privacy in ML including synthetic data approaches
 - Protect and Extend - Using GANs for Synthetic Data Generation of Time-Series Medical Records (2024) - Application and evaluation of synthetic data in healthcare domain
 
Example Scenario:
MedAI needed to share data with research partners and train models across different hospital systems, but couldn't transfer real patient records due to HIPAA restrictions. Traditional anonymization techniques were insufficient because:
- Re-identification Risk: Simple anonymization could be reversed using auxiliary datasets
 - Utility Loss: Heavy anonymization made data unsuitable for ML training
 - Legal Barriers: Hospitals were reluctant to share even anonymized real patient data
 
Privacy-Preserving Solution: MedAI implemented CTGAN to generate synthetic patient records that preserved statistical relationships without containing real patient information. They used PATE-GAN to add differential privacy guarantees and validated synthetic data quality using the SDV framework.
This generated 100,000 synthetic patient records that maintained the same readmission prediction patterns as real data, enabling safe data sharing while protecting patient privacy.
Libraries & Tools:
- OpenDP
 - IBM Differential Privacy Library
 - Tumult Analytics
 - Google's Differential Privacy Library
 - TensorFlow Privacy
 - Opacus
 - Microsoft SmartNoise
 
References:
- A friendly introduction to differential privacy (Desfontaines) - Accessible explanation of differential privacy concepts and fundamentals
 - Local Differential Privacy: a tutorial (Xiong et al., 2020) - Comprehensive overview of LDP theory and applications
 - RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response (Erlingsson et al., 2014) - Google's LDP system for Chrome usage statistics
 - The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy
 - Approximate Differential Privacy (Programming Differential Privacy) - Detailed guide to approximate DP implementation
 - Rényi Differential Privacy (Mironov, 2017) - Original paper introducing RDP
 - Gaussian Differential Privacy (Dong et al., 2022) - Framework connecting DP to hypothesis testing
 - Getting more useful results with differential privacy (Desfontaines) - Practical advice for improving utility in DP systems
 - A reading list on differential privacy (Desfontaines) - Curated list of papers and resources for learning DP
 - Rényi Differential Privacy of the Sampled Gaussian Mechanism (Mironov et al., 2019) - Analysis of privacy guarantees for subsampled data
 - On the Rényi Differential Privacy of the Shuffle Model (Wang et al., 2021) - Analysis of shuffling for privacy amplification
 - Differential Privacy: An Economic Method for Choosing Epsilon (Hsu et al., 2014) - Framework for epsilon selection based on economic principles
 - Functional Rényi Differential Privacy for Generative Modeling (Jalko et al., 2023) - Extension of RDP to functional outputs
 - Precision-based attacks and interval refining: how to break, then fix, differential privacy (Haney et al., 2022) - Analysis of vulnerabilities in DP implementations
 - Differential Privacy: A Primer for a Non-technical Audience (Wood et al., 2018) - Accessible introduction for non-technical readers
 - Using differential privacy to harness big data and preserve privacy (Brookings, 2020) - Overview of real-world applications
 
Example Scenario:
MedAI wanted to collect real-time health metrics from patient mobile apps to improve their readmission model, but patients were concerned about privacy. Traditional data collection would reveal:
- Individual Health Status: Exact blood pressure, weight, and activity levels
 - Behavioral Patterns: Sleep schedules, medication adherence, lifestyle habits
 - Location Privacy: When and where health measurements were taken
 
Privacy-Preserving Solution: Implemented Local Differential Privacy using OpenDP and TensorFlow Privacy. Each patient's mobile app added calibrated noise to health metrics before transmission, ensuring individual readings couldn't be precisely determined while still allowing for useful aggregate statistics.
This solution allowed MedAI to collect privacy-preserving health metrics from 10,000+ patients, improving readmission prediction by 8% while ensuring no individual patient data could be reconstructed.
Libraries & Tools:
References:
- Secure Multiparty Computation (Lindell, 2020) - Comprehensive overview of SMPC theory and applications
 
Example Scenario
Three major hospital networks wanted to collaboratively train a readmission model using their combined data, but legal and competitive concerns prevented direct data sharing:
- Competitive Advantage: Hospitals didn't want competitors seeing their patient demographics or treatment outcomes
 - Regulatory Compliance: Cross-institutional data sharing required extensive legal agreements
 - Data Sovereignty: Each hospital needed to maintain control over their data
 
Privacy-Preserving Solution: Used MP-SPDZ and CrypTen to implement secure multi-party computation. Hospitals could jointly train models on encrypted data without revealing individual records to each other.
Using this solution, the hospital networks successfully trained a collaborative model using 150,000 patient records across three hospital systems, achieving 15% better accuracy than individual hospital models while maintaining complete data privacy.
Libraries & Tools:
References:
- Deep Learning with Differential Privacy (Abadi et al., 2016) - Introduces DP-SGD algorithm for training deep neural networks with differential privacy
 - Differentially Private Model Publishing for Deep Learning (Yu et al., 2018) - Methods for publishing deep learning models with privacy guarantees
 
Example Scenario:
MedAI's initial model training process was vulnerable to privacy attacks:
- Membership Inference: Adversaries could determine if specific patients were in the training data
 - Model Inversion: Attackers might reconstruct patient information from model parameters
 - Gradient Leakage: Model gradients could leak sensitive patient information during training
 
Privacy-Preserving Solution: Implemented Differentially Private SGD using Opacus for PyTorch training. Added calibrated noise to gradients during training and used TensorFlow Privacy for privacy accounting to track the total privacy budget consumed.
Using this solution resulted in trained models with formal privacy guarantees (ε=1.0 differential privacy) while maintaining 91% accuracy on readmission prediction tasks.
Libraries & Tools:
References:
- Communication-Efficient Learning of Deep Networks from Decentralized Data (McMahan et al., 2017) - Introduces Federated Averaging (FedAvg) algorithm for efficient federated learning
 - Practical Secure Aggregation for Federated Learning on User-Held Data (Bonawitz et al., 2017) - Cryptographic protocol for secure aggregation in federated learning
 - Federated Learning: Strategies for Improving Communication Efficiency (Konečný et al., 2016) - Techniques for reducing communication costs in federated learning
 
Example Scenario:
MedAI wanted to continuously improve their model using data from all partner hospitals, but centralized training created privacy and logistical issues:
- Data Centralization Risk: Moving all hospital data to a central location violated data governance policies
 - Bandwidth Limitations: Transferring large healthcare datasets was impractical
 - Regulatory Requirements: Some hospitals couldn't legally share raw patient data
 
Privacy-Preserving Solution: Deployed TensorFlow Federated framework where each hospital trained local models on their own data and only shared encrypted model updates. Used Flower for orchestrating the federated learning process across hospital networks.
This method enables continuous improvement of the global model using data from 50 hospitals without centralizing patient records, achieving state-of-the-art readmission prediction while maintaining data locality.
Libraries & Tools:
References:
- CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy (Gilad-Bachrach et al., 2016) - Early work on running neural networks on homomorphically encrypted data
 - GAZELLE: A Low Latency Framework for Secure Neural Network Inference (Juvekar et al., 2018) - Efficient framework for secure neural network inference using homomorphic encryption
 
Example Scenario:
When hospitals queried MedAI's readmission prediction model, they had to send patient data to MedAI's servers, creating privacy risks:
- Data in Transit: Patient information was exposed during transmission
 - Server-Side Storage: MedAI could potentially store and analyze sensitive patient queries
 - Insider Threats: MedAI employees could access real patient data during inference
 
Privacy-Preserving Solution: Implemented Homomorphic Encryption using TenSEAL and Microsoft SEAL. Hospitals could encrypt patient data locally and receive encrypted predictions without MedAI ever seeing the raw patient information.
Using homomorphic encryption enabled privacy-preserving predictions where hospitals could get readmission risk scores without exposing patient data to MedAI or third parties.
Libraries & Tools:
References:
- Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks (Papernot et al., 2016) - Uses model distillation to improve robustness and privacy
 - Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Introduces membership inference attacks and defenses
 
Example Scenario:
MedAI needed to deploy their model to hospital systems, but the model itself contained privacy risks:
- Model Inversion Attacks: Adversaries could extract training data patterns from model parameters
 - Membership Inference: Attackers could determine which patients were used for training
 - Intellectual Property: Model weights revealed MedAI's proprietary algorithms
 
Privacy-Preserving Solution: Used Knowledge Distillation and Model Compression techniques to create privacy-preserving model versions. Implemented defenses against membership inference attacks using techniques from Adversarial Robustness Toolbox.
The solution enables deployment of sanitized models that maintaine prediction accuracy while preventing privacy attacks and protecting both patient data and intellectual property.
Libraries & Tools:
References:
- The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy theory and practice
 - Renyi Differential Privacy (Mironov, 2017) - Introduces Rényi differential privacy for better privacy accounting
 
Example Scenario:
MedAI was using differential privacy across multiple model versions and data analysis tasks, but poor privacy budget management created vulnerabilities:
- Budget Exhaustion: Researchers were using up privacy budgets too quickly
 - Composition Attacks: Multiple queries with small epsilon values could accumulate to reveal private information
 - Inconsistent Accounting: Different teams used different privacy accounting methods
 
Privacy-Preserving Solution: Implemented centralized privacy budget management using Google's dp-accounting and TensorFlow Privacy Accountant. Created governance policies for privacy budget allocation across research projects.
The outcome of this solution is that sustainable privacy budget management that allowed for long-term research while maintaining strong privacy guarantees was established.
Libraries & Tools:
References:
- Evaluating Differentially Private Machine Learning in Practice (Jayaraman & Evans, 2019) - Empirical evaluation of privacy-utility trade-offs in DP ML
 - Machine Learning with Membership Privacy using Adversarial Regularization (Nasr et al., 2018) - Uses adversarial training to improve membership privacy
 
Example Scenario:
MedAI needed to continuously assess whether their privacy protections were working effectively:
- Unknown Vulnerabilities: New attack methods could bypass existing protections
 - Compliance Auditing: Regulators required proof that privacy protections were effective
 - Risk Assessment: Needed quantitative measures of privacy risk for business decisions
 
Privacy-Preserving Solution: Deployed ML Privacy Meter for continuous privacy auditing, implemented membership inference attack simulations using IMIA, and established regular privacy risk assessments.
MedAI has now created a comprehensive privacy monitoring system that provided early warning of potential privacy breaches and demonstrated compliance to regulators.
- Google's Differential Privacy Tutorial
 - OpenDP Tutorial Series
 - Opacus Tutorials
 - TensorFlow Privacy Tutorials
 - IBM Differential Privacy Library Tutorials
 
- TensorFlow Federated Tutorials
 - Flower Federated Learning Tutorials
 - PySyft Tutorials
 - FedML Tutorials
 - NVFlare Examples
 
Contributions welcome! Read the contribution guidelines first.