Description
I'm thinking of a new (to the best of my knowledge) form of tracing: I call it lazy basic block stitching. It is a variant of lazy basic bock versioning. The idea is to use lazy basic blocks as a "trace recording/projection hybrid" compiled with a baseline compiler with almost no optimizations at all. This will be used to record inline cache information (like JS), and branch bias information. Then after that, we stitch all the traces together and perform aggressive optimizations on that.
The benefit:
- Near zero execution overhead (other than memory) trace projection/recording.
- Low warmup.
- All the benefits of traces.
- Fewer trace drawbacks, since we stitch the trace tree, the side exits are presumably formed already and don't need expensive deopts to the interpreter.
Here's my proposal (not in order).
- Lower the side exit threshold for the current JIT, to allow it to fully form the trace tree easily.
- Stop projecting traces at branches, to allow a trace tree to naturally form. This is similar to lazy basic block versioning.
- Remove the heavy optimizations from the current tier 2, such as the types specializer. This is required because of 1 and 2 causing way more traces to be created.
- Stitch the traces together.
- Only do the heavy optimizations on what we get from stitched tree.
This would require tracing from:
- RESUME
- Backward jumps
- All branches
- RETURN_VALUE
- YIELD_VALUE
Background and what this addresses
These draw from my own experiences implementing optimizations for CPython's traces. To date, I have implemented the type specializer (merged upstream), a function inliner (not merged, experimental), and a tagged integer optimizer (not merged, experimental).
The function inliner and tagged integer optimizer shows massive speedups in microbenchmarks, but massive slowdowns in real world benchmarks from my own experience. Why? This is due to the current trace design:
- Deopt out of a trace requires reconstruction of the world (all inlined frames, and boxing all tagged ints).
- Side exit requires reconstruction of the world.
- The 1. and 2. means a single branch or non-common control flow wrecks the optimizations we do. Branches are more common than we initially thought.
- Trace-to-trace jumps are long jumps and very costly still.
- The problem of nested for loops and recursion are solved in research by trace trees, but they still cost a lot in our current design due to 4.
The cost of the optimization is only worth it if the common case is more frequent than the deopt/uncommon case. Unfortunately, in a tracing JIT, side exits are frequent. There are ways to solve most of these, as seen in PyPy and LuaJIT 2. The solution is inter-trace optimizations, and keeping optimization state per trace. However, the end result is just deferring the complexity to the runtime. We end up having to implement all sorts of complex solutions to keep our trace performant. The final control-flow graph looks like a trace tree anyways, so why not start with that?