Skip to content

Proposal: Lazy Basic Block Stitching #723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Fidget-Spinner opened this issue Mar 19, 2025 · 2 comments
Open

Proposal: Lazy Basic Block Stitching #723

Fidget-Spinner opened this issue Mar 19, 2025 · 2 comments

Comments

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 19, 2025

I'm thinking of a new (to the best of my knowledge) form of tracing: I call it lazy basic block stitching. It is a variant of lazy basic bock versioning. The idea is to use lazy basic blocks as a "trace recording/projection hybrid" compiled with a baseline compiler with almost no optimizations at all. This will be used to record inline cache information (like JS), and branch bias information. Then after that, we stitch all the traces together and perform aggressive optimizations on that.

The benefit:

  • Near zero execution overhead (other than memory) trace projection/recording.
  • Low warmup.
  • All the benefits of traces.
  • Fewer trace drawbacks, since we stitch the trace tree, the side exits are presumably formed already and don't need expensive deopts to the interpreter.

Here's my proposal (not in order).

  1. Lower the side exit threshold for the current JIT, to allow it to fully form the trace tree easily.
  2. Stop projecting traces at branches, to allow a trace tree to naturally form. This is similar to lazy basic block versioning.
  3. Remove the heavy optimizations from the current tier 2, such as the types specializer. This is required because of 1 and 2 causing way more traces to be created.
  4. Stitch the traces together.
  5. Only do the heavy optimizations on what we get from stitched tree.

This would require tracing from:

  1. RESUME
  2. Backward jumps
  3. All branches
  4. RETURN_VALUE
  5. YIELD_VALUE

Background and what this addresses

These draw from my own experiences implementing optimizations for CPython's traces. To date, I have implemented the type specializer (merged upstream), a function inliner (not merged, experimental), and a tagged integer optimizer (not merged, experimental).

The function inliner and tagged integer optimizer shows massive speedups in microbenchmarks, but massive slowdowns in real world benchmarks from my own experience. Why? This is due to the current trace design:

  1. Deopt out of a trace requires reconstruction of the world (all inlined frames, and boxing all tagged ints).
  2. Side exit requires reconstruction of the world.
  3. The 1. and 2. means a single branch or non-common control flow wrecks the optimizations we do. Branches are more common than we initially thought.
  4. Trace-to-trace jumps are long jumps and very costly still.
  5. The problem of nested for loops and recursion are solved in research by trace trees, but they still cost a lot in our current design due to 4.

The cost of the optimization is only worth it if the common case is more frequent than the deopt/uncommon case. Unfortunately, in a tracing JIT, side exits are frequent. There are ways to solve most of these, as seen in PyPy and LuaJIT 2. The solution is inter-trace optimizations, and keeping optimization state per trace. However, the end result is just deferring the complexity to the runtime. We end up having to implement all sorts of complex solutions to keep our trace performant. The final control-flow graph looks like a trace tree anyways, so why not start with that?

@Fidget-Spinner
Copy link
Collaborator Author

Fidget-Spinner commented Mar 21, 2025

To deal with generators, and all returns in general:

Trace starting from RETURN_VALUE. Where we trace starting from where it's returning to (can be obtained at runtime). Insert a guard that looks like this called _GUARD_RETURNING_IP where we guard the returning frame's instruction pointer EXIT_IF(frame->previous->instr_ptr + frame->previous->return_offset != (_Py_CODEUNIT *)instr_ptr);. This will allow our polymorphism handling to automatically handle generators, and trace stitching to produce efficient generator code.

@brandtbucher
Copy link
Member

Yep, this will handle all forms of underflow. The missing part for tracing generators is tracking where we're going in FOR_ITER and SEND when specializing generators. I have a branch where that works, but I believe it was a net perf loss, I think because it's not helpful unless we can also handle "polymorphic underflow" and recursion too.

The stress test for any of this is bm_generators, which does this (seriously, try to work out the traces on a whiteboard, I dare you 🙂):

class Tree:
    def __init__(self, left: Tree | None, value: int, right: Tree | None) -> None:
        self.left = left
        self.value = value
        self.right = right

    def __iter__(self) -> Iterator[int]:
        if self.left:
            yield from self.left
        yield self.value
        if self.right:
            yield from self.right

for item in massive_tree:
    pass

The trace tree for __iter__ is absolutely wild (note that it's recursive, and each yield from is really a loop, and don't forget about the branches and the third yield in there either), but it's a good, "simple" test for anything we hope to do concerning generators, recursion, and underflow (our three weakest points right now).

(This is actually a great example of code that is incredibly complex for a tracing JIT to handle, but dead-simple for a method JIT to handle, and tracing doesn't actually allow us to optimize much more than a method JIT would.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants