Consider what programs to run during PGO #130
Replies: 14 comments
-
Yeah, one thing that has always bugged me about using tests as our profiling task is that they tend to skew disproportionately towards edge-cases and error conditions (I also suspect that they do less looping and branching than typical workloads as well). It's really easy to use existing tests, but I too expect that we may be able to benefit from revisiting this. |
Beta Was this translation helpful? Give feedback.
-
We (the Pyston team) tried experimenting with this and we ran into the issue that it's not currently possible to install packages at PGO-task time. I don't think it's theoretically impossible though, but it did limit the tasks we could run. |
Beta Was this translation helpful? Give feedback.
-
The more I learn about modern CPUs work (e.g. speculation) the more I
realize how important it is to get all the branch prediction right. Agreed
we need to revisit this.
|
Beta Was this translation helpful? Give feedback.
-
Just curious: what's the issue with this? It seems like the instrumented Python should still be capable of building a virtual environment with Sure, that work would ultimately "count towards" the profile data, but it's probably still a more realistic workload than the tests we're running now. And it would probably be tiny in comparison to the actual profile task. |
Beta Was this translation helpful? Give feedback.
-
Other people here know this much better than I do, but my understanding is that the build directory structure is a fair bit different from the installed directory structure. There's some code to try to figure out which situation we're in, but installing packages is pretty involved and doesn't work in this half-set-up state. I don't remember the exact issue but it could be something like there's no lib or site-packages directory. Though I did just have luck creating a virtualenv using the being-built binary so maybe we didn't try that before or I updated my virtualenv or something. |
Beta Was this translation helpful? Give feedback.
-
We'd have to be really careful with this, but it may be interesting to try running the full It would give us an idea of what the actual upper bound for these sort of improvements could be in practice. If we only see a 5% improvement, for example, it's probably not worth revisiting our profile task. If the numbers double, though, it's not unrealistic to expect that reworking our profiling could potentially yield 10%+ improvements for real-world workloads. |
Beta Was this translation helpful? Give feedback.
-
I wouldn't sneeze at 5%. Surely we can run the benchmarks with fewer iterations in a shorter time? |
Beta Was this translation helpful? Give feedback.
-
Using (BTW, setting up a virtual environment and installing packages in the
|
Beta Was this translation helpful? Give feedback.
-
Ideally:
By "not-too-fast" I'm talking about how long the benchmark takes to run. Pretty much every one of the current pyperformance benchmarks runs in a fraction of a second. In contrast, we want the real-world benchmarks to represent actual workload behavior as closely as possible, including duration (within reason). So some quick scripts will run in under a second but some apps may run for several minutes. Thus this new suite wouldn't be suitable for the quick benchmark runs that we do throughout the day. Instead we'd probably run it once a day (which happens to match the frequency of uploads to speed.python.org 🙂). It just so happens that it would also be great for training PGO. |
Beta Was this translation helpful? Give feedback.
-
Do we really want to use a "slow" run to train PGO? Building a PGO binary is slow enough as it is. I agree that for speed.python.org we could use something better, but do remember that the runs there already take ~1 hour (due to repetitions, setup etc.) and I'd rather use the extra capacity for multiple runs per day so we have less guesswork about which commit contributed to a sudden jump. |
Beta Was this translation helpful? Give feedback.
-
I'm still skeptical of using benchmarks to train PGO. I think that it would be a good idea (but also a lot of work) to have different workloads for these steps to avoid "gaming" the benchmark suite. I'm envisioning two different suites: one "in-sample", and one "out-of-sample". We'd use the "in-sample" one for profiling and data-gathering, and the "out-of-sample" one for benchmarking. Otherwise, I fear we might end up with a Python executable that's really good at running |
Beta Was this translation helpful? Give feedback.
-
I think in theory I agree with you. I've been warned about benchmark-chasing by folks who created a really fast JS engine that nobody's ever heard of. In earlier stages of their project the benchmarks were useful, but over time they had to switch to more realistic workloads. But to some extent it depends on what PGO really does. I imagine it gathers info about which branches are likely vs. unlikely, and I've got a feeling that most branches have a very strong bias one way or another (most of the time they check for an error condition). The exception would be branches that check e.g. special-casing for int or str -- the frequency of the special case in the benchmark could definitely be different than in other real-world code. There are of course other things that PGO (or LTO) does that might or might not affect benchmarks more than other code (e.g. code rearranging). Maybe we should just continue to collect or create new benchmarks (as we're already doing with Eric's project of adding the Pyston benchmarks). Alas, we don't have enough realistic benchmarks to easily reserve half of them for in-sample. :-( |
Beta Was this translation helpful? Give feedback.
-
We already use the "benchmarks" to test each improvement, so we already are gaming things TBH. The Pyston benchmarks would be a good starting point. We also need some numerical and ML benchmarks. |
Beta Was this translation helpful? Give feedback.
-
Benchmarks that spend a significant time in C libraries, like numerical and ML code, are unlikely to be affected by our work though. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In the discussions of when PGO was added to the
Makefile
, there were some questions of whether the set of unit tests we run are representative of a real load. I wonder if anyone has tried using PGO with more "realistic" programs or a different set of tests and what impact that would have on performance.Beta Was this translation helpful? Give feedback.
All reactions