PyIPU is a Python package that implements the Iterative Proportional Updating (IPU) algorithm proposed by Ye et al. (2009) in the paper "Methodology to match distributions of both household and person attributes in generation of synthetic populations". This implementation is based on the ipfr R package.
The IPU algorithm is a general case of iterative proportional fitting that can satisfy two disparate sets of marginals that do not agree on a single total. A common example is balancing population data using household- and person-level marginal controls for survey expansion or synthetic population creation.
Key features:
- Support for both household and person level constraints
- Handling of multiple geographies
- Configurable convergence criteria
- Detailed reporting of results
- Faster than traditional IPF
pip install pyipu
Or install from source:
git clone https://github.com/williamagyapong/pyipu.git
cd pyipu
pip install -e .
- Python 3.7+
- NumPy
- Pandas
- Matplotlib
- scikit-learn
import pandas as pd
import numpy as np
from pyipu import ipu
# Create a simple household seed table
hh_seed = pd.DataFrame({
'id': [1, 2, 3, 4],
'siz': [1, 2, 2, 1],
'weight': [1, 1, 1, 1],
'geo_cluster': [1, 1, 2, 2]
})
# Create household targets
hh_targets = {}
hh_targets['siz'] = pd.DataFrame({
'geo_cluster': [1, 2],
'1': [75, 100],
'2': [25, 150]
})
# Run IPU
result = ipu(hh_seed, hh_targets, max_iterations=5)
# Access the results
print(result['weight_tbl']) # Household table with weights
print(result['primary_comp']) # Comparison of results to targets
result['weight_dist'] # Matplotlib figure showing weight distribution for diagnostics
TODO: Include an example of using the PyIPU package with household and person level constraints to demonstrate how to use the IPU algorithm with both household and person level seed tables and targets, which is a common use case in population synthesis.
import numpy as np
from pyipu import ipu_matrix
# Create a matrix
mtx = np.array([
[10, 20, 30],
[40, 50, 60],
[70, 80, 90]
])
# Define row and column targets
row_targets = np.array([100, 200, 300])
col_targets = np.array([200, 250, 150])
# Balance the matrix
balanced_mtx = ipu_matrix(mtx, row_targets, col_targets)
print(balanced_mtx)
print("Row sums:", balanced_mtx.sum(axis=1))
print("Column sums:", balanced_mtx.sum(axis=0))
import pandas as pd
from pyipu import ipu, synthesize
# Create a simple household seed table
hh_seed = pd.DataFrame({
'id': [1, 2, 3, 4],
'siz': [1, 2, 2, 1],
'income': ['low', 'med', 'high', 'low'],
'weight': [1, 1, 1, 1],
'geo_cluster': [1, 1, 2, 2]
})
# Create household targets
hh_targets = {}
hh_targets['siz'] = pd.DataFrame({
'geo_cluster': [1, 2],
'1': [75, 100],
'2': [25, 150]
})
hh_targets['income'] = pd.DataFrame({
'geo_cluster': [1, 2],
'low': [60, 120],
'med': [30, 80],
'high': [10, 50]
})
# Run IPU
result = ipu(hh_seed, hh_targets, max_iterations=10)
# Create a synthetic population
synthetic_pop = synthesize(result['weight_tbl'], group_by='geo_cluster')
print("Synthetic population (first 10 rows):")
print(synthetic_pop.head(10))
ipu(primary_seed, primary_targets,
secondary_seed=None, secondary_targets=None,
primary_id="id", secondary_importance=1,
relative_gap=0.01, max_iterations=100, absolute_diff=10,
weight_floor=0.00001, verbose=False,
max_ratio=10000, min_ratio=0.0001)
Parameters:
primary_seed
: DataFrame containing the primary seed table (e.g., households)primary_targets
: Dictionary of DataFrames with target marginals for primary seedsecondary_seed
: Optional DataFrame containing the secondary seed table (e.g., persons)secondary_targets
: Optional dictionary of DataFrames with target marginals for secondary seedprimary_id
: Column name that links primary and secondary seed tablessecondary_importance
: Value between 0 and 1 signifying the importance of secondary targetsrelative_gap
: Convergence threshold for percent RMSE between iterationsmax_iterations
: Maximum number of iterations to performabsolute_diff
: Threshold below which absolute differences don't matter for reportingweight_floor
: Minimum weight to allow in any cellverbose
: Whether to print iteration details and worst marginal statsmax_ratio
: Maximum weight as a multiple of the average weightmin_ratio
: Minimum weight as a multiple of the average weight
Returns:
A dictionary with the following keys:
weight_tbl
: The primary_seed with weight, avg_weight, and weight_factor columnsweight_dist
: A matplotlib figure showing the weight distributionprimary_comp
: A DataFrame comparing the primary seed results to targetssecondary_comp
: A DataFrame comparing the secondary seed results to targets (only if secondary_seed is provided)
ipu_matrix(mtx, row_targets, column_targets, **kwargs)
Parameters:
mtx
: 2D numpy array to balancerow_targets
: Array of targets for row sumscolumn_targets
: Array of targets for column sums**kwargs
: Additional arguments passed toipu()
Returns:
A 2D numpy array with the balanced matrix
synthesize(weight_tbl, group_by=None, primary_id="id")
Parameters:
weight_tbl
: DataFrame containing the weight table output byipu()
group_by
: Optional column name to group by before sampling (e.g., geography)primary_id
: Column name of the primary ID in the weight table
Returns:
A DataFrame with one record for each synthesized member of the population. A new_id
column is created, but the previous primary_id
column is maintained to facilitate joining back to other data sources.
MIT
If you use PyIPU in your research, please cite:
- William, O. A. (2025). *PyIPU*: Python implementation of the Iterative
Proportional Updating (IPU) algorithm [https://www.github.com/williamagyapong/pyipu]
(Version 0.1.0).
- Ye, X., Konduri, K., Pendyala, R. M., Sana, B., & Waddell, P. (2009).
A methodology to match distributions of both household and person attributes
in the generation of synthetic populations.
In 88th Annual Meeting of the Transportation Research Board, Washington, DC.
pyipu/
├── pyipu/
│ ├── __init__.py # Package initialization
│ ├── version.py # Version information
│ ├── core.py # Main IPU implementation
│ ├── utils.py # Helper functions
│ └── synthesis.py # Synthetic population generation
├── examples/
│ ├── __init__.py
│ ├── basic_example.py # Simple example with household data
│ ├── household_person_example.py # Example with household and person data
│ ├── matrix_example.py # Example of a simple matrix balancing
│ └── synthesis_example.py # Example for synthetic population generation
├── tests/
│ ├── __init__.py
│ └── test_ipu.py # Unit tests
├── .gitignore # Git ignore file
├── LICENSE # MIT License
├── README.md # Documentation
├── run_tests.py # Script to run tests
└── setup.py # Package installation script
Contributions are welcome! Please feel free to submit a Pull Request.
A big thanks to Kyle Ward, the author of the ipfr package which provided the implementation logic for pyipu
. It is fair to say that pyipu
is the Python version of ipfr
.