ENH: Add scalability tests to CI

#### Problem

- We recently ran into a couple of issues regarding the scalability of some GETTSIM/TTSIM code (see https://github.com/ttsim-dev/ttsim/pull/34, https://github.com/ttsim-dev/ttsim/pull/40, https://github.com/ttsim-dev/ttsim/pull/41, and #1076). 
- The "problematic" code in question worked perfectly fine and passed all tests. However, our tests (reasonably) use only very small datasets, which means that CI currently misses scalability issues (regarding runtime and/or memory usage) in the existing code and in new PRs.
- For the PRs mentioned above, we used profiling/benchmark scripts (which can be found [here](https://github.com/ttsim-dev/gettsim-code-for-picking/tree/main/benchmark_code)) to identify and fix the problematic code. Currently, these are "just" scripts (not part of the GETTSIM/TTSIM code) that users have to run locally.

#### Potential solutions

- I propose to automate or semi-automate the scalability tests as part of CI:
  - Automated approach: Run (probably a more polished version of this) benchmark script in CI on the PR-branch vs. the main-branch and create a "regressions"-report. (This might be feasible through Github's CI?)
  - Semi-automated approach: Run the benchmark "on demand" by posting a keyword in a PR comment. For example, the maintainers of the Julia language have set up a Github bot they call "nanosoldier". If called in a PR comment with the keyword @nanosoldier, it runs a large benchmark suite on the most popular packages in the Julia ecosystem, once for the PR and once for the main branch and compares results. Here is a recent [example](https://github.com/JuliaLang/julia/pull/58948#issuecomment-3074321607) and corresponding [report](https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/f175dd0_vs_228edd6/report.html). However, this feature comes at a relatively steep cost: They had to create a [package](https://github.com/JuliaCI/Nanosoldier.jl) that implements this functionality. I'm pretty sure something similar exists somewhere in the Python world, but I couldn't find anything so far. It also comes at a financial cost because the benchmarks run on rented servers.
 
For us, the "on demand"-approach seems like overkill: Currently, it only takes ~5 minutes to run the full PR-vs-main-benchmark (timing both, NumPy and JAX-CPU backends) with up to 4M rows in the dataset on the full GETTSIM DAG on my (not very powerful) laptop.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Add scalability tests to CI #1080

Problem

Potential solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Add scalability tests to CI #1080

Description

Problem

Potential solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions