Improve dot lift rewrites #1471

ricardoV94 · 2025-06-13T08:54:14Z

This PR was motivated by the partial jacobian computation example in JAX discussed in jax-ml/jax#5904 (comment)

After #1228 it's actually easier to do this sort of optimization in PyTensor since there's no scan to worry about. We already have a bunch of rewrites to lift subtensor operations through elemwise and dots, but we did not have to lift it through blockwise (and blockwise dot - aka matmul). This PR addresses this.

Some notes on each commit:

Do constant_folding in python mode. This is not related to this PR but I noticed a test was taking 10x longer than the others just because there was a simple constant folding operation being triggered in the rewrites, and the whole c-cache was being loaded. This incurs a one time penalty that's pretty large. For users, not interested in the C backend at all, there's no reason to involve the machinery. One single python eval should be pretty fast anyway.
Simplified local_upcast_elemwise. This rewrite was too complex and wasteful, in that it wrapped constants in symbolic expand_dims / alloc + cast. I just do it in numpy directly. This reduces the number of rewrite iterations.
Bunch of improvements to rewrites. Including lifting index operations on the batch dimensions of blockwise, and expanding the dot subtensor lift to work with the Blockwise case. This rewrite predates Blockwise. Others are self-explanatory.
Canonicalize matvec, vecmat, vecdot internally to all use matmul (i.e., Blockwise of 2x2 dot operation). This makes things simpler for our rewrites, because we only need to worry about one case.
The pre-existing test_local_batched_matmul_to_core_matmul rewrite was extend to better address cases of batched matvec, vecmat, and vecdot (batch dimensions are moved to the core dimension). It now moves non-ovelapping batch dimensions of both inputs to their core dimensions. It further tries to avoid reshape (needed when combining multiple batch/core dimensions), so that subtensor_lift rewrites mentioned above can work fine through them.

Benchmark result added in the last commit:
(Note that vectorize=True goes from underperforming (28ms) to overperforming (.37 ms).

Before
------------------------------------------------------------------------------------------------- benchmark: 2 tests ------------------------------------------------------------------------------------------------
Name (time in ms)                                        Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_partial_jacobian[vectorize=False]      1.9453 (1.0)       2.8201 (1.0)       2.2296 (1.0)      0.0963 (1.0)       2.2031 (1.0)      0.0855 (1.0)         52;25  448.5095 (1.0)         421           1
test_benchmark_partial_jacobian[vectorize=True]      28.8122 (14.81)    36.9261 (13.09)    34.1470 (15.32)    2.3973 (24.90)    34.8889 (15.84)    2.6797 (31.35)         8;1   29.2851 (0.07)         21           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After
--------------------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------------------
Name (time in us)                                           Min                   Max                  Mean             StdDev                Median                IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_partial_jacobian[vectorize=True]        345.7980 (1.0)        658.8850 (1.0)        370.9925 (1.0)      41.1362 (1.0)        357.2400 (1.0)      16.9117 (1.0)         24;34  2,695.4724 (1.0)         287           1
test_benchmark_partial_jacobian[vectorize=False]     2,148.9270 (6.21)     3,062.8910 (4.65)     2,215.2234 (5.97)     77.6787 (1.89)     2,194.7940 (6.14)     44.7890 (2.65)        33;34    451.4217 (0.17)        496           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

vectorized jacobian code before:

Subtensor{:stop, :stop} [id A] shape=(5, 5) 9
 ├─ DimShuffle{order=[1,0]} [id B] shape=(1000, 1000) 8
 │  └─ Reshape{3} [id C] shape=(1000, 1000, 1) 7
 │     ├─ Dot22 [id D] shape=(1000, 1000) 6
 │     │  ├─ [[0.903246 ... 74841955]] [id E] shape=(1000, 1000)
 │     │  └─ Reshape{2} [id F] shape=(1000, 1000) 5
 │     │     ├─ True_div [id G] shape=(1000, 1000, 1) 4
 │     │     │  ├─ [[[0.0005] ... [0.0005]]] [id H] shape=(1000, 1000, 1)
 │     │     │  └─ Composite{sqrt((0.001 * i0))} [id I] shape=(1000, 1, 1) 3
 │     │     │     └─ ExpandDims{axes=[1, 2]} [id J] shape=(1000, 1, 1) 2
 │     │     │        └─ CGemv{inplace} [id K] shape=(1000,) 1
 │     │     │           ├─ AllocEmpty{dtype='float64'} [id L] shape=(1000,) 0
 │     │     │           │  └─ 1000 [id M] shape=()
 │     │     │           ├─ 1.0 [id N] shape=()
 │     │     │           ├─ [[0.903246 ... 74841955]] [id O] shape=(1000, 1000)
 │     │     │           ├─ x [id P] shape=(?,)
 │     │     │           └─ 0.0 [id Q] shape=()
 │     │     └─ [1000   -1] [id R] shape=(2,)
 │     └─ [1000 1000    1] [id S] shape=(3,)
 ├─ 5 [id T] shape=()
 └─ 5 [id T] shape=()

and after:

Dot22 [id A] shape=(5, 5) 5
 ├─ True_div [id B] shape=(5, 1000) 4
 │  ├─ [[0.0005 0 ... 0.    ]] [id C] shape=(5, 1000)
 │  └─ Composite{sqrt((0.001 * i0))} [id D] shape=(1, 1000) 3
 │     └─ ExpandDims{axis=0} [id E] shape=(1, 1000) 2
 │        └─ CGemv{inplace} [id F] shape=(1000,) 1
 │           ├─ AllocEmpty{dtype='float64'} [id G] shape=(1000,) 0
 │           │  └─ 1000 [id H] shape=()
 │           ├─ 1.0 [id I] shape=()
 │           ├─ [[0.903246 ... 74841955]] [id J] shape=(1000, 1000)
 │           ├─ x [id K] shape=(?,)
 │           └─ 0.0 [id L] shape=()
 └─ [[0.903246 ... 45926986]] [id M] shape=(1000, 5)

📚 Documentation preview 📚: https://pytensor--1471.org.readthedocs.build/en/1471/

Also allow arbitrary expression dimensionality

This reduces the number of rewrite passes, by avoiding constant fold of cast/expand_dims/alloc

…in `local_subtensor_merge`

New rewrite is added to convert unpaired batched row/column matvec or vec products as equivalent matmul products.

ricardoV94 added 6 commits June 13, 2025 10:30

More precise type-hint for vectorize_graph

3fe002d

Eager optimization for no-op flatten

e51b3bb

Allow building jacobian via vectorization instead of Scan

0f65bb7

Also allow arbitrary expression dimensionality

Use python implementation for constant_folding Ops

5e64f0a

Don't do symbolic upcasting in local_upcast_elemwise_constants

2266bd8

This reduces the number of rewrite passes, by avoiding constant fold of cast/expand_dims/alloc

Don't return useless subtensor for local_useless_slice

14c4eaa

ricardoV94 added gradients graph rewriting performance labels Jun 13, 2025

ricardoV94 added 7 commits June 13, 2025 11:54

Benchmark partial jacobian

44be813

Avoid canonicalization of slices when merging non-overlapping slices …

688a70b

…in `local_subtensor_merge`

Generalize local_subtensor_of_elemwise to Blockwise

2427c3d

Lift subtensor through squeeze

9136947

Generalize dot rewrites to work with Blockwise

c8eed58

Define all batched dot operations as matmul

345fffd

New rewrite is added to convert unpaired batched row/column matvec or vec products as equivalent matmul products.

Prioritize gemv/gerc over dot22scalar

24d7a98

ricardoV94 force-pushed the dot_lift_rewrite branch from 5b63faf to 24d7a98 Compare June 13, 2025 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve dot lift rewrites #1471

Improve dot lift rewrites #1471

Uh oh!

ricardoV94 commented Jun 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Improve dot lift rewrites #1471

Are you sure you want to change the base?

Improve dot lift rewrites #1471

Uh oh!

Conversation

ricardoV94 commented Jun 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ricardoV94 commented Jun 13, 2025 •

edited by github-actions bot

Loading