Reduce per-step allocations in MIRK loss functions by ChrisRackauckas-Claude · Pull Request #453 · SciML/BoundaryValueDiffEq.jl

ChrisRackauckas-Claude · 2026-03-20T08:00:16Z

Summary

Investigates and fixes allocation issues in MIRK solvers, reducing per-step allocations by 60-70%.

Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255

Investigation Findings

Q: Should steps allocate? Only adaptivity should.
Currently, the loss function (called every Newton iteration) allocates due to:

Logging.with_logger(NullLogger()) wrapping every get_tmp call — creates a closure + task-local context on each of 50+ calls per loss evaluation
[get_tmp(r, u) for r in residual] array comprehensions in loss functions
similar(fᵢ_cache) in NoDiffCacheNeeded Φ! — allocates a new vector every call
SubArray views in Φ! inner loop (small but scales with mesh size)

Q: Is allocation-free stepping tested?
No — added allocation regression tests.

Changes

Remove Logging.with_logger(NullLogger()) from get_tmp (biggest win)
- The wrapper allocated a closure + task-local context on EVERY get_tmp call
- The warning it suppressed only occurs during adaptive cache resize (not in the hot path)
- Replaced with direct PreallocationTools.get_tmp calls
Eliminate array comprehensions in MIRK loss functions
- Pass DiffCache residual vectors directly to Φ! instead of pre-unwrapping
- Added _maybe_get_tmp helper for Φ! to handle both DiffCache and plain arrays
- Added recursive_flatten!/recursive_flatten_twopoint! overloads for AbstractVector{<:DiffCache}
Fix NoDiffCacheNeeded Φ! to reuse fᵢ_cache
- similar(fᵢ_cache) allocated a new scratch vector every call
- fᵢ_cache is already a scratch buffer that can be reused directly

Results (dt=0.1, N=2, MIRK4, non-adaptive)

Metric	Before	After	Reduction
Total solve allocations	3,290 (176 KiB)	1,180 (115 KiB)	64%
TwoPointBVP allocations	2,840 (160 KiB)	924 (104 KiB)	67%
Loss function per call	13,552 bytes	4,080 bytes	70%
Jacobian per call	20,448 bytes	8,784 bytes	57%

Remaining allocations

The remaining ~4 KiB per loss call comes from:

recursive_unflatten! broadcast for DiffCache (~192 bytes)
SubArray views in Φ! inner loop (~4 KiB, scales with mesh × stages)
These are harder to eliminate without restructuring the inner loop

Future work (not in this PR)

Apply same comprehension elimination to FIRK loss functions (FIRK's Φ! writes directly to residual elements, making DiffCache pass-through more invasive)
Eliminate SubArray allocations in Φ! inner loop (would require avoiding @views and using manual indexing)
Pre-allocate the recursive_unflatten! broadcast result

Test plan

All existing MIRK tests pass (154 passed, 2 pre-existing BigFloat errors, 6 expected broken)
FIRK solvers verified working (RadauIIa3, RadauIIa5)
New allocation regression tests (12 tests covering MIRK4/5/6 × StandardBVP/TwoPointBVP)
Verified correctness: solutions match expected values

🤖 Generated with Claude Code

github-actions · 2026-03-20T08:19:04Z

Benchmark Results

Click to check benchmark results

	master	`da0fa97`...	master / `da0fa97`...
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK2()	0.588 ± 0.0071 s	0.586 ± 0.0048 s	1 ± 0.015
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK3()	12.5 ± 0.79 ms	12.4 ± 0.85 ms	1.01 ± 0.094
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK4()	2.77 ± 0.42 ms	2.78 ± 0.45 ms	0.997 ± 0.22
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK5()	3.3 ± 0.54 ms	3.27 ± 0.54 ms	1.01 ± 0.23
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK6()	1.62 ± 0.31 ms	1.59 ± 0.32 ms	1.02 ± 0.28
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = false)	2.65 ± 0.66 ms	2.63 ± 0.64 ms	1.01 ± 0.35
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = true)	5.27 ± 1.2 ms	5.12 ± 1.1 ms	1.03 ± 0.33
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = false)	0.0728 ± 0.02 s	0.0741 ± 0.018 s	0.982 ± 0.36
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = true)	0.11 ± 0.024 s	0.11 ± 0.025 s	0.996 ± 0.31
Simple Pendulum/IIP/Shooting(Tsit5())	0.313 ± 0.09 ms	0.317 ± 0.092 ms	0.989 ± 0.4
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK2()	0.766 ± 0.018 s	0.762 ± 0.02 s	1.01 ± 0.035
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK3()	15.8 ± 5.6 ms	15.9 ± 6.2 ms	0.996 ± 0.52
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK4()	3.33 ± 0.31 ms	3.32 ± 0.35 ms	1.01 ± 0.14
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK5()	4.07 ± 0.42 ms	3.99 ± 0.4 ms	1.02 ± 0.15
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK6()	1.94 ± 0.39 ms	1.93 ± 0.44 ms	1.01 ± 0.31
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = false)	3.94 ± 2.8 ms	3.82 ± 3 ms	1.03 ± 1.1
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = true)	7.64 ± 6.3 ms	7.68 ± 6.5 ms	0.995 ± 1.2
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = false)	0.0928 ± 0.0078 s	0.0936 ± 0.0066 s	0.991 ± 0.11
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = true)	0.147 ± 0.011 s	0.146 ± 0.011 s	1 ± 0.1
Simple Pendulum/OOP/Shooting(Tsit5())	0.62 ± 0.085 ms	0.632 ± 0.1 ms	0.98 ± 0.21
time_to_load	7.08 ± 0.042 s	7.09 ± 0.13 s	1 ± 0.019

### Benchmark Plots A plot of the benchmark results has been uploaded as an artifact to the workflow run for this PR. Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Three key changes to reduce allocations in the hot path (loss function and Jacobian evaluation called every Newton iteration): 1. Remove Logging.with_logger(NullLogger()) wrapper from get_tmp - The wrapper allocated a closure + task-local context on EVERY call - get_tmp is called 50+ times per loss evaluation - The warning it suppressed only occurs during adaptive cache resize - This single change reduced total solve allocations by ~60% 2. Eliminate array comprehensions in MIRK loss functions - Replaced `[get_tmp(r, u) for r in residual]` with direct DiffCache passing to Φ! and recursive_flatten! - Added _maybe_get_tmp helper for Φ! to handle both DiffCache and plain array residuals - Added recursive_flatten!/recursive_flatten_twopoint! overloads for AbstractVector{<:DiffCache} 3. Fix NoDiffCacheNeeded Φ! to reuse fᵢ_cache instead of similar() - `similar(fᵢ_cache)` allocated a new vector every call - fᵢ_cache is a scratch buffer that can be used directly Results (dt=0.1, N=2, MIRK4, non-adaptive): - Total solve: 3,290 → 1,180 allocations (64% reduction) - Loss per call: 13,552 → 4,080 bytes (70% reduction) - Jacobian per call: 20,448 → 8,784 bytes (57% reduction) Adds allocation regression tests to verify loss function allocations stay bounded. Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255 Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…IRKN The `EvalSol.u[1:end] .= x` and `EvalSol.cache.k_discrete[1:end] .=` patterns create unnecessary SubArray views under @views. Using copyto! avoids the view allocation entirely. Applied to MIRK, FIRK, and MIRKN loss functions. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ChrisRackauckas-Claude · 2026-03-25T10:46:17Z

Updated findings with Profile.Allocs

After profiling with Profile.Allocs (sample_rate=1.0, 100 loss calls), the remaining 3,776 bytes per loss call breaks down as:

Bytes	Count	Type	Source
2,704	1×	`NonlinearFunction{...}`	Boxing when accessing `nlprob.f` via `getproperty` in `problem_interface.jl`
816	1×	Loss closure	Boxing the closure when extracted from the NonlinearProblem
104	1×	`Memory{Vector{Float64}}`	`get_tmp.` broadcast in `recursive_unflatten!`
32	1×	`Vector{Vector{Float64}}`	Outer vector from the broadcast result
3,776	4

Key finding: Φ! inner loop is allocation-free on Julia 1.12

The SubArray views (K[:, r], K[:, 1:(r-1)]) in the Φ! inner loop are stack-allocated by Julia 1.12's escape analysis — they do NOT contribute to heap allocations. I verified this by inlining the matmul loops (eliminating all views) and seeing zero change in allocation count.

What would fix the remaining 3,520 bytes (93%)

The NonlinearFunction and closure boxing (3,520 of 3,776 bytes) is a SciMLBase type instability — the NonlinearProblem struct's getproperty can't infer the concrete function type. This is the same class of issue PR #455 started addressing. A function barrier at the nlprob.f(resid, u, p) call site in NonlinearSolve would eliminate it.

Added: `copyto!` fix for EvalSol updates

Also pushed a second commit replacing EvalSol.u[1:end] .= x and EvalSol.cache.k_discrete[1:end] .= x with copyto! across MIRK/FIRK/MIRKN — avoids unnecessary SubArray creation from the [1:end] slice.

ChrisRackauckas-Claude · 2026-03-25T11:03:00Z

Correction: `@allocated` was producing garbage — actual per-step allocations are 144 bytes

The earlier analysis claiming 3,520 bytes from NonlinearFunction boxing was wrong. @allocated wraps expressions in a closure, and the closure capture overhead dominated the measurement for large structs. Using @timed in a compiled function loop gives the real numbers:

Via nlprob.f(resid,u,p): 144 bytes/call
Via nf(resid,u,p):       144 bytes/call
Via raw_f(resid,u,p):    144 bytes/call

All three are identical. The NonlinearFunction access and call is fully inlined by the compiler. The 144 bytes comes solely from get_tmp.(y, (x,)) in recursive_unflatten! which broadcasts over a Vector{DiffCache} to create a small Vector{Vector{T}} of 11 pointers.

Summary of actual allocation sources

Phase	Bytes	Per-step?
`__init` (cache creation)	~32 KiB	No — one-time
`__construct_problem` (Jacobian setup)	~52 KiB	No — once per adaptive iteration
`recursive_unflatten!` broadcast	144	Yes — per Newton step
Φ! inner loop views	0	Stack-allocated on Julia 1.12
`NonlinearFunction` access	0	Fully inlined

With ~6 Newton iterations, per-step allocations total ~864 bytes in a 115 KiB solve. Stepping is essentially allocation-free.

What this PR fixed (still valid)

The three main fixes in this PR reduced per-step allocations from ~13,552 to 144 bytes:

get_tmp logging wrapper removal: ~9,000 bytes/call saved
Array comprehension elimination: ~2,500 bytes/call saved
similar(fᵢ_cache) → fᵢ_cache reuse + copyto! fixes: ~900 bytes/call saved

@timed

… test Add y_cache field to MIRKCache that pre-allocates the Vector{Vector{T}} needed by recursive_unflatten! for the primal (Float64) path. This eliminates the last 144 bytes/call from get_tmp. broadcast allocation. For the Dual path (ForwardDiff Jacobian computation), falls back to the existing broadcast since element types differ. Loss function is now fully allocation-free at runtime (0 bytes/call verified via @timed over 100k calls). Add zero-allocation QA test in the qa test group. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ChrisRackauckas-Claude force-pushed the cr/fix-mirk-allocations branch from 03d9f66 to 544fff2 Compare March 25, 2026 09:34

ChrisRackauckas-Claude mentioned this pull request Mar 25, 2026

Refactor MIRK k_discrete from Matrix to Vector{Vector} storage #458

Open

4 tasks

ErikQQY and others added 3 commits April 6, 2026 03:05

Merge branch 'master' into cr/fix-mirk-allocations

99d2453

Use update_eval_sol to update eval sol

6cc12b8

Non-allocating __maybe_matmul

da0fa97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce per-step allocations in MIRK loss functions#453

Reduce per-step allocations in MIRK loss functions#453
ChrisRackauckas-Claude wants to merge 6 commits intoSciML:masterfrom
ChrisRackauckas-Claude:cr/fix-mirk-allocations

ChrisRackauckas-Claude commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

ChrisRackauckas-Claude commented Mar 25, 2026

Uh oh!

ChrisRackauckas-Claude commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ChrisRackauckas-Claude commented Mar 20, 2026

Summary

Investigation Findings

Changes

Results (dt=0.1, N=2, MIRK4, non-adaptive)

Remaining allocations

Future work (not in this PR)

Test plan

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

ChrisRackauckas-Claude commented Mar 25, 2026

Updated findings with Profile.Allocs

Key finding: Φ! inner loop is allocation-free on Julia 1.12

What would fix the remaining 3,520 bytes (93%)

Added: copyto! fix for EvalSol updates

Uh oh!

ChrisRackauckas-Claude commented Mar 25, 2026

Correction: @allocated was producing garbage — actual per-step allocations are 144 bytes

Summary of actual allocation sources

What this PR fixed (still valid)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 20, 2026 •

edited

Loading

Added: `copyto!` fix for EvalSol updates

Correction: `@allocated` was producing garbage — actual per-step allocations are 144 bytes