|
| 1 | +# JSON Serialization Performance in Mypy |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Mypy uses JSON serialization extensively for caching type checking results, which is critical for incremental type checking performance. This document explains how mypy's JSON serialization works and how to optimize it. |
| 6 | + |
| 7 | +## Basic Usage |
| 8 | + |
| 9 | +Mypy provides two main functions for JSON serialization in `mypy.util`: |
| 10 | + |
| 11 | +```python |
| 12 | +from mypy.util import json_dumps, json_loads |
| 13 | + |
| 14 | +# Serialize an object to JSON bytes |
| 15 | +data = {"module": "mypy.main", "mtime": 1234567890.123} |
| 16 | +serialized = json_dumps(data) |
| 17 | + |
| 18 | +# Deserialize JSON bytes back to a Python object |
| 19 | +deserialized = json_loads(serialized) |
| 20 | +``` |
| 21 | + |
| 22 | +## Performance Optimization with orjson |
| 23 | + |
| 24 | +By default, mypy uses Python's standard `json` module for serialization. However, you can significantly improve performance by installing `orjson`, a fast JSON library written in Rust. |
| 25 | + |
| 26 | +### Installation |
| 27 | + |
| 28 | +```bash |
| 29 | +# Install mypy with the faster-cache optional dependency |
| 30 | +pip install mypy[faster-cache] |
| 31 | + |
| 32 | +# Or install orjson separately |
| 33 | +pip install orjson |
| 34 | +``` |
| 35 | + |
| 36 | +### Performance Benefits |
| 37 | + |
| 38 | +When orjson is available, mypy automatically uses it for JSON operations. Based on benchmarks: |
| 39 | + |
| 40 | +- **Small objects** (< 1KB): 2-3x faster serialization and deserialization |
| 41 | +- **Medium objects** (10-100KB): 3-5x faster |
| 42 | +- **Large objects** (> 100KB): 5-10x faster |
| 43 | + |
| 44 | +For large projects with extensive caching, this can result in noticeable improvements in incremental type checking speed. |
| 45 | + |
| 46 | +## Key Guarantees |
| 47 | + |
| 48 | +### Deterministic Output |
| 49 | + |
| 50 | +Both `json_dumps` and `json_loads` guarantee deterministic output: |
| 51 | + |
| 52 | +1. **Sorted Keys**: Dictionary keys are always sorted alphabetically |
| 53 | +2. **Consistent Encoding**: The same object always produces the same bytes |
| 54 | +3. **Roundtrip Consistency**: `json_loads(json_dumps(obj)) == obj` |
| 55 | + |
| 56 | +This is critical for: |
| 57 | +- Cache invalidation (detecting when cached data has changed) |
| 58 | +- Test reproducibility |
| 59 | +- Comparing serialized output across different runs |
| 60 | + |
| 61 | +### Error Handling |
| 62 | + |
| 63 | +The functions include robust error handling: |
| 64 | + |
| 65 | +1. **Large Integers**: Automatically falls back to standard json for integers exceeding 64-bit range |
| 66 | +2. **orjson Errors**: Gracefully falls back to standard json if orjson encounters issues |
| 67 | +3. **Invalid JSON**: Raises appropriate exceptions with clear error messages |
| 68 | + |
| 69 | +## Debug Mode |
| 70 | + |
| 71 | +For debugging purposes, you can enable pretty-printed output: |
| 72 | + |
| 73 | +```python |
| 74 | +# Compact output (default) |
| 75 | +compact = json_dumps(data) |
| 76 | +# Output: b'{"key":"value","number":42}' |
| 77 | + |
| 78 | +# Pretty-printed output |
| 79 | +pretty = json_dumps(data, debug=True) |
| 80 | +# Output: b'{\n "key": "value",\n "number": 42\n}' |
| 81 | +``` |
| 82 | + |
| 83 | +## Benchmarking |
| 84 | + |
| 85 | +Mypy includes a benchmarking utility to measure JSON serialization performance: |
| 86 | + |
| 87 | +```bash |
| 88 | +# Run standard benchmarks |
| 89 | +python -m mypy.json_bench |
| 90 | +``` |
| 91 | + |
| 92 | +This will show: |
| 93 | +- Whether orjson is installed and being used |
| 94 | +- Performance metrics for various data sizes |
| 95 | +- Comparison of serialization vs deserialization speed |
| 96 | +- Serialized data sizes |
| 97 | + |
| 98 | +Example output: |
| 99 | +``` |
| 100 | +============================================================ |
| 101 | +JSON Serialization Performance Benchmark |
| 102 | +============================================================ |
| 103 | +Using orjson: True |
| 104 | +Iterations: 1000 |
| 105 | +Object type: dict |
| 106 | +Serialized size: 20,260 bytes |
| 107 | +------------------------------------------------------------ |
| 108 | +json_dumps avg: 0.0823 ms |
| 109 | +json_loads avg: 0.0456 ms |
| 110 | +Roundtrip avg: 0.1279 ms |
| 111 | +============================================================ |
| 112 | +``` |
| 113 | + |
| 114 | +## Implementation Details |
| 115 | + |
| 116 | +### Why Sorted Keys Matter |
| 117 | + |
| 118 | +Mypy requires sorted keys for several reasons: |
| 119 | + |
| 120 | +1. **Cache Consistency**: The cache system uses serialized JSON as part of cache keys. Unsorted keys would cause cache misses even when data hasn't changed. |
| 121 | + |
| 122 | +2. **Test Stability**: Many tests (e.g., `testIncrementalInternalScramble`) rely on deterministic output to verify correct behavior. |
| 123 | + |
| 124 | +3. **Diff-Friendly**: When debugging cache issues, having sorted keys makes it easier to compare JSON output. |
| 125 | + |
| 126 | +### Fallback Behavior |
| 127 | + |
| 128 | +The implementation includes multiple fallback layers: |
| 129 | + |
| 130 | +``` |
| 131 | +Try orjson (if available) |
| 132 | + ├─> Success: Return result |
| 133 | + ├─> 64-bit integer overflow: Fall back to standard json |
| 134 | + ├─> Other TypeError: Re-raise (non-serializable object) |
| 135 | + └─> Other errors: Fall back to standard json |
| 136 | +
|
| 137 | +Use standard json module |
| 138 | + ├─> Success: Return result |
| 139 | + └─> Error: Propagate exception to caller |
| 140 | +``` |
| 141 | + |
| 142 | +## Testing |
| 143 | + |
| 144 | +Comprehensive tests are available in `mypy/test/test_json_serialization.py`: |
| 145 | + |
| 146 | +```bash |
| 147 | +# Run JSON serialization tests |
| 148 | +python -m unittest mypy.test.test_json_serialization -v |
| 149 | +``` |
| 150 | + |
| 151 | +Tests cover: |
| 152 | +- Basic serialization and deserialization |
| 153 | +- Edge cases (large integers, Unicode, nested structures) |
| 154 | +- Error handling |
| 155 | +- Deterministic output |
| 156 | +- Performance with large objects |
| 157 | + |
| 158 | +## Best Practices |
| 159 | + |
| 160 | +1. **Install orjson for production**: For better performance in CI/CD and development |
| 161 | +2. **Use debug mode sparingly**: Only enable when actively debugging |
| 162 | +3. **Monitor cache sizes**: Large serialized objects can impact disk I/O |
| 163 | +4. **Test with both backends**: Ensure your code works with and without orjson |
| 164 | + |
| 165 | +## Troubleshooting |
| 166 | + |
| 167 | +### "Integer exceeds 64-bit range" warnings |
| 168 | + |
| 169 | +If you see this in logs, it means orjson encountered a very large integer and fell back to standard json. This is expected behavior and doesn't indicate a problem. |
| 170 | + |
| 171 | +### Performance not improving after installing orjson |
| 172 | + |
| 173 | +1. Verify orjson is installed: `python -c "import orjson; print(orjson.__version__)"` |
| 174 | +2. Run benchmarks: `python -m mypy.json_bench` |
| 175 | +3. Check that mypy is using the correct Python environment |
| 176 | + |
| 177 | +### JSON decode errors |
| 178 | + |
| 179 | +If you encounter JSON decode errors: |
| 180 | +1. Check that the input is valid UTF-8 encoded bytes |
| 181 | +2. Verify the JSON structure is valid |
| 182 | +3. Try with `debug=True` to see the formatted output |
| 183 | + |
| 184 | +## Contributing |
| 185 | + |
| 186 | +When modifying JSON serialization code: |
| 187 | + |
| 188 | +1. Run the test suite: `python -m unittest mypy.test.test_json_serialization` |
| 189 | +2. Run benchmarks to verify performance: `python -m mypy.json_bench` |
| 190 | +3. Test with and without orjson installed |
| 191 | +4. Update this documentation if behavior changes |
0 commit comments