-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
- We recently ran into a couple of issues regarding the scalability of some GETTSIM/TTSIM code (see Optimize JAX performance in data preparation pipeline ttsim#34, Optimize
aggregation_numpy.sum_by_p_id
... ttsim#40, Optimizations:tt.shared.join
, severalttsim.interface_dag_elements.fail_if
functions ttsim#41, and Optimizebürgergeld__in_anderer_bg_als_kindergeldempfänger
#1076). - The "problematic" code in question worked perfectly fine and passed all tests. However, our tests (reasonably) use only very small datasets, which means that CI currently misses scalability issues (regarding runtime and/or memory usage) in the existing code and in new PRs.
- For the PRs mentioned above, we used profiling/benchmark scripts (which can be found here) to identify and fix the problematic code. Currently, these are "just" scripts (not part of the GETTSIM/TTSIM code) that users have to run locally.
Potential solutions
- I propose to automate or semi-automate the scalability tests as part of CI:
- Automated approach: Run (probably a more polished version of this) benchmark script in CI on the PR-branch vs. the main-branch and create a "regressions"-report. (This might be feasible through Github's CI?)
- Semi-automated approach: Run the benchmark "on demand" by posting a keyword in a PR comment. For example, the maintainers of the Julia language have set up a Github bot they call "nanosoldier". If called in a PR comment with the keyword @nanosoldier, it runs a large benchmark suite on the most popular packages in the Julia ecosystem, once for the PR and once for the main branch and compares results. Here is a recent example and corresponding report. However, this feature comes at a relatively steep cost: They had to create a package that implements this functionality. I'm pretty sure something similar exists somewhere in the Python world, but I couldn't find anything so far. It also comes at a financial cost because the benchmarks run on rented servers.
For us, the "on demand"-approach seems like overkill: Currently, it only takes ~5 minutes to run the full PR-vs-main-benchmark (timing both, NumPy and JAX-CPU backends) with up to 4M rows in the dataset on the full GETTSIM DAG on my (not very powerful) laptop.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request