Consider what programs to run during PGO #130

ammaraskar · 2021-10-14T15:37:25Z

ammaraskar
Oct 14, 2021

In the discussions of when PGO was added to the Makefile, there were some questions of whether the set of unit tests we run are representative of a real load. I wonder if anyone has tried using PGO with more "realistic" programs or a different set of tests and what impact that would have on performance.

brandtbucher · 2021-10-14T16:10:11Z

brandtbucher
Oct 14, 2021
Maintainer

Yeah, one thing that has always bugged me about using tests as our profiling task is that they tend to skew disproportionately towards edge-cases and error conditions (I also suspect that they do less looping and branching than typical workloads as well).

It's really easy to use existing tests, but I too expect that we may be able to benefit from revisiting this.

0 replies

kmod · 2021-10-14T16:44:13Z

kmod
Oct 14, 2021

We (the Pyston team) tried experimenting with this and we ran into the issue that it's not currently possible to install packages at PGO-task time. I don't think it's theoretically impossible though, but it did limit the tasks we could run.

0 replies

gvanrossum · 2021-10-14T18:13:55Z

gvanrossum
Oct 14, 2021
Maintainer

The more I learn about modern CPUs work (e.g. speculation) the more I realize how important it is to get all the branch prediction right. Agreed we need to revisit this.

0 replies

brandtbucher · 2021-10-14T18:43:22Z

brandtbucher
Oct 14, 2021
Maintainer

We (the Pyston team) tried experimenting with this and we ran into the issue that it's not currently possible to install packages at PGO-task time. I don't think it's theoretically impossible though, but it did limit the tasks we could run.

Just curious: what's the issue with this? It seems like the instrumented Python should still be capable of building a virtual environment with venv, installing packages, and running some workload as part of a profile task?

Sure, that work would ultimately "count towards" the profile data, but it's probably still a more realistic workload than the tests we're running now. And it would probably be tiny in comparison to the actual profile task.

0 replies

kmod · 2021-10-14T19:13:50Z

kmod
Oct 14, 2021

Other people here know this much better than I do, but my understanding is that the build directory structure is a fair bit different from the installed directory structure. There's some code to try to figure out which situation we're in, but installing packages is pretty involved and doesn't work in this half-set-up state. I don't remember the exact issue but it could be something like there's no lib or site-packages directory.

Though I did just have luck creating a virtualenv using the being-built binary so maybe we didn't try that before or I updated my virtualenv or something.

0 replies

brandtbucher · 2021-10-19T18:45:02Z

brandtbucher
Oct 19, 2021
Maintainer

We'd have to be really careful with this, but it may be interesting to try running the full pyperformance suite as our PGO workload, and gather benchmarks for that.

It would give us an idea of what the actual upper bound for these sort of improvements could be in practice. If we only see a 5% improvement, for example, it's probably not worth revisiting our profile task. If the numbers double, though, it's not unrealistic to expect that reworking our profiling could potentially yield 10%+ improvements for real-world workloads.

0 replies

gvanrossum · 2021-10-19T19:01:10Z

gvanrossum
Oct 19, 2021
Maintainer

I wouldn't sneeze at 5%.

Surely we can run the benchmarks with fewer iterations in a shorter time?

0 replies

brandtbucher · 2021-10-20T20:04:09Z

brandtbucher
Oct 20, 2021
Maintainer

Using pyperformance run --fast as our PROFILE_TASK speeds things up by about 4%. As expected, the vast majority of the benchmarks get faster (some quite dramatically, by up to 25%).

(BTW, setting up a virtual environment and installing packages in the PROFILE_TASK works fine. I just had to change the normal pyperformance workflow a bit to help pip find Python.h, since pyperformance generally doesn't like inheriting anything from the parent environment.)

Slower (5):
- unpickle_pure_python: 276 us +- 2 us -> 292 us +- 2 us: 1.06x slower
- richards: 55.3 ms +- 0.8 ms -> 56.8 ms +- 0.7 ms: 1.03x slower
- mako: 12.7 ms +- 0.1 ms -> 12.9 ms +- 0.1 ms: 1.02x slower
- scimark_sparse_mat_mult: 4.90 ms +- 0.07 ms -> 4.96 ms +- 0.09 ms: 1.01x slower
- pickle_pure_python: 379 us +- 3 us -> 383 us +- 3 us: 1.01x slower

Faster (47):
- pickle_list: 4.41 us +- 0.06 us -> 3.55 us +- 0.04 us: 1.24x faster
- regex_dna: 214 ms +- 1 ms -> 177 ms +- 2 ms: 1.21x faster
- regex_effbot: 3.31 ms +- 0.05 ms -> 2.75 ms +- 0.04 ms: 1.20x faster
- json_loads: 25.3 us +- 0.7 us -> 22.3 us +- 0.2 us: 1.14x faster
- pickle_dict: 27.2 us +- 0.2 us -> 24.6 us +- 0.1 us: 1.10x faster
- fannkuch: 421 ms +- 2 ms -> 384 ms +- 3 ms: 1.10x faster
- xml_etree_parse: 154 ms +- 3 ms -> 141 ms +- 4 ms: 1.09x faster
- telco: 6.46 ms +- 0.13 ms -> 5.96 ms +- 0.12 ms: 1.08x faster
- regex_v8: 24.3 ms +- 0.1 ms -> 22.5 ms +- 0.1 ms: 1.08x faster
- crypto_pyaes: 88.4 ms +- 0.5 ms -> 83.1 ms +- 0.8 ms: 1.06x faster
- scimark_fft: 350 ms +- 3 ms -> 329 ms +- 5 ms: 1.06x faster
- scimark_monte_carlo: 78.7 ms +- 0.9 ms -> 74.0 ms +- 0.9 ms: 1.06x faster
- unpickle_list: 4.98 us +- 0.06 us -> 4.69 us +- 0.15 us: 1.06x faster
- xml_etree_generate: 83.9 ms +- 1.1 ms -> 79.1 ms +- 0.7 ms: 1.06x faster
- json_dumps: 12.5 ms +- 0.2 ms -> 11.8 ms +- 0.1 ms: 1.06x faster
- chaos: 79.0 ms +- 1.0 ms -> 74.9 ms +- 1.0 ms: 1.06x faster
- meteor_contest: 107 ms +- 1 ms -> 101 ms +- 0 ms: 1.06x faster
- xml_etree_process: 61.7 ms +- 0.6 ms -> 58.7 ms +- 0.7 ms: 1.05x faster
- scimark_lu: 150 ms +- 3 ms -> 143 ms +- 5 ms: 1.05x faster
- spectral_norm: 114 ms +- 3 ms -> 109 ms +- 0 ms: 1.05x faster
- float: 84.7 ms +- 1.0 ms -> 81.3 ms +- 0.8 ms: 1.04x faster
- sympy_expand: 515 ms +- 4 ms -> 496 ms +- 5 ms: 1.04x faster
- django_template: 39.0 ms +- 0.4 ms -> 37.5 ms +- 0.4 ms: 1.04x faster
- nqueens: 89.6 ms +- 1.3 ms -> 86.8 ms +- 0.6 ms: 1.03x faster
- unpickle: 13.4 us +- 0.1 us -> 13.0 us +- 0.1 us: 1.03x faster
- sqlite_synth: 2.76 us +- 0.06 us -> 2.68 us +- 0.05 us: 1.03x faster
- pickle: 9.87 us +- 0.09 us -> 9.61 us +- 0.06 us: 1.03x faster
- scimark_sor: 156 ms +- 2 ms -> 152 ms +- 2 ms: 1.03x faster
- sqlalchemy_imperative: 18.8 ms +- 0.3 ms -> 18.4 ms +- 0.5 ms: 1.02x faster
- sympy_str: 307 ms +- 4 ms -> 300 ms +- 3 ms: 1.02x faster
- sympy_integrate: 22.4 ms +- 0.1 ms -> 21.9 ms +- 0.1 ms: 1.02x faster
- go: 171 ms +- 2 ms -> 168 ms +- 2 ms: 1.02x faster
- regex_compile: 151 ms +- 1 ms -> 148 ms +- 1 ms: 1.02x faster
- xml_etree_iterparse: 107 ms +- 2 ms -> 105 ms +- 4 ms: 1.02x faster
- unpack_sequence: 47.6 ns +- 1.8 ns -> 46.8 ns +- 2.3 ns: 1.02x faster
- logging_silent: 115 ns +- 2 ns -> 114 ns +- 2 ns: 1.02x faster
- raytrace: 326 ms +- 2 ms -> 321 ms +- 2 ms: 1.02x faster
- tornado_http: 111 ms +- 3 ms -> 109 ms +- 3 ms: 1.02x faster
- logging_simple: 6.10 us +- 0.08 us -> 6.01 us +- 0.07 us: 1.01x faster
- sqlalchemy_declarative: 145 ms +- 3 ms -> 143 ms +- 3 ms: 1.01x faster
- pidigits: 188 ms +- 0 ms -> 185 ms +- 0 ms: 1.01x faster
- pathlib: 19.2 ms +- 0.3 ms -> 19.0 ms +- 0.2 ms: 1.01x faster
- 2to3: 269 ms +- 1 ms -> 266 ms +- 1 ms: 1.01x faster
- sympy_sum: 171 ms +- 1 ms -> 169 ms +- 1 ms: 1.01x faster
- hexiom: 7.64 ms +- 0.05 ms -> 7.60 ms +- 0.08 ms: 1.00x faster
- python_startup: 8.46 ms +- 0.01 ms -> 8.43 ms +- 0.01 ms: 1.00x faster
- python_startup_no_site: 5.85 ms +- 0.01 ms -> 5.83 ms +- 0.00 ms: 1.00x faster

Benchmark hidden because not significant (6): chameleon, deltablue, dulwich_log, logging_format, nbody, pyflate

Geometric mean: 1.04x faster

0 replies

ericsnowcurrently · 2021-10-20T20:34:20Z

ericsnowcurrently
Oct 20, 2021
Maintainer

Ideally:

identify a set of fundamental workloads representing use in the Python community
ensure there is at least one high-level, not-too-fast benchmark representing each
use that suite for a daily "slow" run, as well as to train PGO

By "not-too-fast" I'm talking about how long the benchmark takes to run. Pretty much every one of the current pyperformance benchmarks runs in a fraction of a second. In contrast, we want the real-world benchmarks to represent actual workload behavior as closely as possible, including duration (within reason). So some quick scripts will run in under a second but some apps may run for several minutes.

Thus this new suite wouldn't be suitable for the quick benchmark runs that we do throughout the day. Instead we'd probably run it once a day (which happens to match the frequency of uploads to speed.python.org 🙂). It just so happens that it would also be great for training PGO.

0 replies

gvanrossum · 2021-10-20T20:40:41Z

gvanrossum
Oct 20, 2021
Maintainer

Do we really want to use a "slow" run to train PGO? Building a PGO binary is slow enough as it is.

I agree that for speed.python.org we could use something better, but do remember that the runs there already take ~1 hour (due to repetitions, setup etc.) and I'd rather use the extra capacity for multiple runs per day so we have less guesswork about which commit contributed to a sudden jump.

0 replies

brandtbucher · 2021-10-20T20:47:28Z

brandtbucher
Oct 20, 2021
Maintainer

I'm still skeptical of using benchmarks to train PGO.

I think that it would be a good idea (but also a lot of work) to have different workloads for these steps to avoid "gaming" the benchmark suite. I'm envisioning two different suites: one "in-sample", and one "out-of-sample". We'd use the "in-sample" one for profiling and data-gathering, and the "out-of-sample" one for benchmarking.

Otherwise, I fear we might end up with a Python executable that's really good at running pyperformance, but suffers in other areas. By validating the impact of our work out-of-sample, we can probably reduce this sort of overfitting dramatically.

0 replies

gvanrossum · 2021-10-20T21:26:23Z

gvanrossum
Oct 20, 2021
Maintainer

I think in theory I agree with you. I've been warned about benchmark-chasing by folks who created a really fast JS engine that nobody's ever heard of. In earlier stages of their project the benchmarks were useful, but over time they had to switch to more realistic workloads.

But to some extent it depends on what PGO really does. I imagine it gathers info about which branches are likely vs. unlikely, and I've got a feeling that most branches have a very strong bias one way or another (most of the time they check for an error condition). The exception would be branches that check e.g. special-casing for int or str -- the frequency of the special case in the benchmark could definitely be different than in other real-world code.

There are of course other things that PGO (or LTO) does that might or might not affect benchmarks more than other code (e.g. code rearranging). Maybe we should just continue to collect or create new benchmarks (as we're already doing with Eric's project of adding the Pyston benchmarks).

Alas, we don't have enough realistic benchmarks to easily reserve half of them for in-sample. :-(

0 replies

markshannon · 2021-10-22T14:28:13Z

markshannon
Oct 22, 2021
Collaborator

We already use the "benchmarks" to test each improvement, so we already are gaming things TBH.
We may as well just treat the benchmark suite as the training set, and look to get some new benchmarks.

The Pyston benchmarks would be a good starting point. We also need some numerical and ML benchmarks.

0 replies

gvanrossum · 2021-10-22T17:08:12Z

gvanrossum
Oct 22, 2021
Maintainer

Benchmarks that spend a significant time in C libraries, like numerical and ML code, are unlikely to be affected by our work though.

0 replies

Consider what programs to run during PGO #130

Uh oh!

ammaraskar Oct 14, 2021

Replies: 14 comments

Uh oh!

Uh oh!

brandtbucher Oct 14, 2021 Maintainer

Uh oh!

kmod Oct 14, 2021

Uh oh!

gvanrossum Oct 14, 2021 Maintainer

Uh oh!

brandtbucher Oct 14, 2021 Maintainer

Uh oh!

kmod Oct 14, 2021

Uh oh!

brandtbucher Oct 19, 2021 Maintainer

Uh oh!

gvanrossum Oct 19, 2021 Maintainer

Uh oh!

brandtbucher Oct 20, 2021 Maintainer

Uh oh!

ericsnowcurrently Oct 20, 2021 Maintainer

Uh oh!

gvanrossum Oct 20, 2021 Maintainer

Uh oh!

brandtbucher Oct 20, 2021 Maintainer

Uh oh!

gvanrossum Oct 20, 2021 Maintainer

Uh oh!

markshannon Oct 22, 2021 Collaborator

Uh oh!

gvanrossum Oct 22, 2021 Maintainer

ammaraskar
Oct 14, 2021

brandtbucher
Oct 14, 2021
Maintainer

kmod
Oct 14, 2021

gvanrossum
Oct 14, 2021
Maintainer

brandtbucher
Oct 14, 2021
Maintainer

kmod
Oct 14, 2021

brandtbucher
Oct 19, 2021
Maintainer

gvanrossum
Oct 19, 2021
Maintainer

brandtbucher
Oct 20, 2021
Maintainer

ericsnowcurrently
Oct 20, 2021
Maintainer

gvanrossum
Oct 20, 2021
Maintainer

brandtbucher
Oct 20, 2021
Maintainer

gvanrossum
Oct 20, 2021
Maintainer

markshannon
Oct 22, 2021
Collaborator

gvanrossum
Oct 22, 2021
Maintainer