Skip to content

Reduce per-step allocations in MIRK loss functions#453

Open
ChrisRackauckas-Claude wants to merge 6 commits intoSciML:masterfrom
ChrisRackauckas-Claude:cr/fix-mirk-allocations
Open

Reduce per-step allocations in MIRK loss functions#453
ChrisRackauckas-Claude wants to merge 6 commits intoSciML:masterfrom
ChrisRackauckas-Claude:cr/fix-mirk-allocations

Conversation

@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor

Summary

Investigates and fixes allocation issues in MIRK solvers, reducing per-step allocations by 60-70%.

Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255

Investigation Findings

Q: Should steps allocate? Only adaptivity should.
Currently, the loss function (called every Newton iteration) allocates due to:

  1. Logging.with_logger(NullLogger()) wrapping every get_tmp call — creates a closure + task-local context on each of 50+ calls per loss evaluation
  2. [get_tmp(r, u) for r in residual] array comprehensions in loss functions
  3. similar(fᵢ_cache) in NoDiffCacheNeeded Φ! — allocates a new vector every call
  4. SubArray views in Φ! inner loop (small but scales with mesh size)

Q: Is allocation-free stepping tested?
No — added allocation regression tests.

Changes

  1. Remove Logging.with_logger(NullLogger()) from get_tmp (biggest win)

    • The wrapper allocated a closure + task-local context on EVERY get_tmp call
    • The warning it suppressed only occurs during adaptive cache resize (not in the hot path)
    • Replaced with direct PreallocationTools.get_tmp calls
  2. Eliminate array comprehensions in MIRK loss functions

    • Pass DiffCache residual vectors directly to Φ! instead of pre-unwrapping
    • Added _maybe_get_tmp helper for Φ! to handle both DiffCache and plain arrays
    • Added recursive_flatten!/recursive_flatten_twopoint! overloads for AbstractVector{<:DiffCache}
  3. Fix NoDiffCacheNeeded Φ! to reuse fᵢ_cache

    • similar(fᵢ_cache) allocated a new scratch vector every call
    • fᵢ_cache is already a scratch buffer that can be reused directly

Results (dt=0.1, N=2, MIRK4, non-adaptive)

Metric Before After Reduction
Total solve allocations 3,290 (176 KiB) 1,180 (115 KiB) 64%
TwoPointBVP allocations 2,840 (160 KiB) 924 (104 KiB) 67%
Loss function per call 13,552 bytes 4,080 bytes 70%
Jacobian per call 20,448 bytes 8,784 bytes 57%

Remaining allocations

The remaining ~4 KiB per loss call comes from:

  • recursive_unflatten! broadcast for DiffCache (~192 bytes)
  • SubArray views in Φ! inner loop (~4 KiB, scales with mesh × stages)
  • These are harder to eliminate without restructuring the inner loop

Future work (not in this PR)

  • Apply same comprehension elimination to FIRK loss functions (FIRK's Φ! writes directly to residual elements, making DiffCache pass-through more invasive)
  • Eliminate SubArray allocations in Φ! inner loop (would require avoiding @views and using manual indexing)
  • Pre-allocate the recursive_unflatten! broadcast result

Test plan

  • All existing MIRK tests pass (154 passed, 2 pre-existing BigFloat errors, 6 expected broken)
  • FIRK solvers verified working (RadauIIa3, RadauIIa5)
  • New allocation regression tests (12 tests covering MIRK4/5/6 × StandardBVP/TwoPointBVP)
  • Verified correctness: solutions match expected values

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 20, 2026

Benchmark Results

Click to check benchmark results
master da0fa97... master / da0fa97...
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK2() 0.588 ± 0.0071 s 0.586 ± 0.0048 s 1 ± 0.015
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK3() 12.5 ± 0.79 ms 12.4 ± 0.85 ms 1.01 ± 0.094
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK4() 2.77 ± 0.42 ms 2.78 ± 0.45 ms 0.997 ± 0.22
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK5() 3.3 ± 0.54 ms 3.27 ± 0.54 ms 1.01 ± 0.23
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK6() 1.62 ± 0.31 ms 1.59 ± 0.32 ms 1.02 ± 0.28
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = false) 2.65 ± 0.66 ms 2.63 ± 0.64 ms 1.01 ± 0.35
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = true) 5.27 ± 1.2 ms 5.12 ± 1.1 ms 1.03 ± 0.33
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = false) 0.0728 ± 0.02 s 0.0741 ± 0.018 s 0.982 ± 0.36
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = true) 0.11 ± 0.024 s 0.11 ± 0.025 s 0.996 ± 0.31
Simple Pendulum/IIP/Shooting(Tsit5()) 0.313 ± 0.09 ms 0.317 ± 0.092 ms 0.989 ± 0.4
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK2() 0.766 ± 0.018 s 0.762 ± 0.02 s 1.01 ± 0.035
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK3() 15.8 ± 5.6 ms 15.9 ± 6.2 ms 0.996 ± 0.52
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK4() 3.33 ± 0.31 ms 3.32 ± 0.35 ms 1.01 ± 0.14
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK5() 4.07 ± 0.42 ms 3.99 ± 0.4 ms 1.02 ± 0.15
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK6() 1.94 ± 0.39 ms 1.93 ± 0.44 ms 1.01 ± 0.31
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = false) 3.94 ± 2.8 ms 3.82 ± 3 ms 1.03 ± 1.1
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = true) 7.64 ± 6.3 ms 7.68 ± 6.5 ms 0.995 ± 1.2
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = false) 0.0928 ± 0.0078 s 0.0936 ± 0.0066 s 0.991 ± 0.11
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = true) 0.147 ± 0.011 s 0.146 ± 0.011 s 1 ± 0.1
Simple Pendulum/OOP/Shooting(Tsit5()) 0.62 ± 0.085 ms 0.632 ± 0.1 ms 0.98 ± 0.21
time_to_load 7.08 ± 0.042 s 7.09 ± 0.13 s 1 ± 0.019
### Benchmark Plots A plot of the benchmark results has been uploaded as an artifact to the workflow run for this PR. Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Three key changes to reduce allocations in the hot path (loss function
and Jacobian evaluation called every Newton iteration):

1. Remove Logging.with_logger(NullLogger()) wrapper from get_tmp
   - The wrapper allocated a closure + task-local context on EVERY call
   - get_tmp is called 50+ times per loss evaluation
   - The warning it suppressed only occurs during adaptive cache resize
   - This single change reduced total solve allocations by ~60%

2. Eliminate array comprehensions in MIRK loss functions
   - Replaced `[get_tmp(r, u) for r in residual]` with direct DiffCache
     passing to Φ! and recursive_flatten!
   - Added _maybe_get_tmp helper for Φ! to handle both DiffCache and
     plain array residuals
   - Added recursive_flatten!/recursive_flatten_twopoint! overloads for
     AbstractVector{<:DiffCache}

3. Fix NoDiffCacheNeeded Φ! to reuse fᵢ_cache instead of similar()
   - `similar(fᵢ_cache)` allocated a new vector every call
   - fᵢ_cache is a scratch buffer that can be used directly

Results (dt=0.1, N=2, MIRK4, non-adaptive):
- Total solve: 3,290 → 1,180 allocations (64% reduction)
- Loss per call: 13,552 → 4,080 bytes (70% reduction)
- Jacobian per call: 20,448 → 8,784 bytes (57% reduction)

Adds allocation regression tests to verify loss function allocations
stay bounded.

Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…IRKN

The `EvalSol.u[1:end] .= x` and `EvalSol.cache.k_discrete[1:end] .=`
patterns create unnecessary SubArray views under @views. Using copyto!
avoids the view allocation entirely.

Applied to MIRK, FIRK, and MIRKN loss functions.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor Author

Updated findings with Profile.Allocs

After profiling with Profile.Allocs (sample_rate=1.0, 100 loss calls), the remaining 3,776 bytes per loss call breaks down as:

Bytes Count Type Source
2,704 NonlinearFunction{...} Boxing when accessing nlprob.f via getproperty in problem_interface.jl
816 Loss closure Boxing the closure when extracted from the NonlinearProblem
104 Memory{Vector{Float64}} get_tmp. broadcast in recursive_unflatten!
32 Vector{Vector{Float64}} Outer vector from the broadcast result
3,776 4

Key finding: Φ! inner loop is allocation-free on Julia 1.12

The SubArray views (K[:, r], K[:, 1:(r-1)]) in the Φ! inner loop are stack-allocated by Julia 1.12's escape analysis — they do NOT contribute to heap allocations. I verified this by inlining the matmul loops (eliminating all views) and seeing zero change in allocation count.

What would fix the remaining 3,520 bytes (93%)

The NonlinearFunction and closure boxing (3,520 of 3,776 bytes) is a SciMLBase type instability — the NonlinearProblem struct's getproperty can't infer the concrete function type. This is the same class of issue PR #455 started addressing. A function barrier at the nlprob.f(resid, u, p) call site in NonlinearSolve would eliminate it.

Added: copyto! fix for EvalSol updates

Also pushed a second commit replacing EvalSol.u[1:end] .= x and EvalSol.cache.k_discrete[1:end] .= x with copyto! across MIRK/FIRK/MIRKN — avoids unnecessary SubArray creation from the [1:end] slice.

@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor Author

Correction: @allocated was producing garbage — actual per-step allocations are 144 bytes

The earlier analysis claiming 3,520 bytes from NonlinearFunction boxing was wrong. @allocated wraps expressions in a closure, and the closure capture overhead dominated the measurement for large structs. Using @timed in a compiled function loop gives the real numbers:

Via nlprob.f(resid,u,p): 144 bytes/call
Via nf(resid,u,p):       144 bytes/call
Via raw_f(resid,u,p):    144 bytes/call

All three are identical. The NonlinearFunction access and call is fully inlined by the compiler. The 144 bytes comes solely from get_tmp.(y, (x,)) in recursive_unflatten! which broadcasts over a Vector{DiffCache} to create a small Vector{Vector{T}} of 11 pointers.

Summary of actual allocation sources

Phase Bytes Per-step?
__init (cache creation) ~32 KiB No — one-time
__construct_problem (Jacobian setup) ~52 KiB No — once per adaptive iteration
recursive_unflatten! broadcast 144 Yes — per Newton step
Φ! inner loop views 0 Stack-allocated on Julia 1.12
NonlinearFunction access 0 Fully inlined

With ~6 Newton iterations, per-step allocations total ~864 bytes in a 115 KiB solve. Stepping is essentially allocation-free.

What this PR fixed (still valid)

The three main fixes in this PR reduced per-step allocations from ~13,552 to 144 bytes:

  1. get_tmp logging wrapper removal: ~9,000 bytes/call saved
  2. Array comprehension elimination: ~2,500 bytes/call saved
  3. similar(fᵢ_cache)fᵢ_cache reuse + copyto! fixes: ~900 bytes/call saved

… test

Add y_cache field to MIRKCache that pre-allocates the Vector{Vector{T}}
needed by recursive_unflatten! for the primal (Float64) path. This
eliminates the last 144 bytes/call from get_tmp. broadcast allocation.

For the Dual path (ForwardDiff Jacobian computation), falls back to the
existing broadcast since element types differ.

Loss function is now fully allocation-free at runtime (0 bytes/call
verified via @timed over 100k calls).

Add zero-allocation QA test in the qa test group.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants