Reduce per-step allocations in MIRK loss functions#453
Reduce per-step allocations in MIRK loss functions#453ChrisRackauckas-Claude wants to merge 6 commits intoSciML:masterfrom
Conversation
Benchmark ResultsClick to check benchmark results
|
Three key changes to reduce allocations in the hot path (loss function
and Jacobian evaluation called every Newton iteration):
1. Remove Logging.with_logger(NullLogger()) wrapper from get_tmp
- The wrapper allocated a closure + task-local context on EVERY call
- get_tmp is called 50+ times per loss evaluation
- The warning it suppressed only occurs during adaptive cache resize
- This single change reduced total solve allocations by ~60%
2. Eliminate array comprehensions in MIRK loss functions
- Replaced `[get_tmp(r, u) for r in residual]` with direct DiffCache
passing to Φ! and recursive_flatten!
- Added _maybe_get_tmp helper for Φ! to handle both DiffCache and
plain array residuals
- Added recursive_flatten!/recursive_flatten_twopoint! overloads for
AbstractVector{<:DiffCache}
3. Fix NoDiffCacheNeeded Φ! to reuse fᵢ_cache instead of similar()
- `similar(fᵢ_cache)` allocated a new vector every call
- fᵢ_cache is a scratch buffer that can be used directly
Results (dt=0.1, N=2, MIRK4, non-adaptive):
- Total solve: 3,290 → 1,180 allocations (64% reduction)
- Loss per call: 13,552 → 4,080 bytes (70% reduction)
- Jacobian per call: 20,448 → 8,784 bytes (57% reduction)
Adds allocation regression tests to verify loss function allocations
stay bounded.
Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
03d9f66 to
544fff2
Compare
…IRKN The `EvalSol.u[1:end] .= x` and `EvalSol.cache.k_discrete[1:end] .=` patterns create unnecessary SubArray views under @views. Using copyto! avoids the view allocation entirely. Applied to MIRK, FIRK, and MIRKN loss functions. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated findings with Profile.AllocsAfter profiling with
Key finding: Φ! inner loop is allocation-free on Julia 1.12The SubArray views ( What would fix the remaining 3,520 bytes (93%)The Added:
|
Correction:
|
| Phase | Bytes | Per-step? |
|---|---|---|
__init (cache creation) |
~32 KiB | No — one-time |
__construct_problem (Jacobian setup) |
~52 KiB | No — once per adaptive iteration |
recursive_unflatten! broadcast |
144 | Yes — per Newton step |
| Φ! inner loop views | 0 | Stack-allocated on Julia 1.12 |
NonlinearFunction access |
0 | Fully inlined |
With ~6 Newton iterations, per-step allocations total ~864 bytes in a 115 KiB solve. Stepping is essentially allocation-free.
What this PR fixed (still valid)
The three main fixes in this PR reduced per-step allocations from ~13,552 to 144 bytes:
get_tmplogging wrapper removal: ~9,000 bytes/call saved- Array comprehension elimination: ~2,500 bytes/call saved
similar(fᵢ_cache)→fᵢ_cachereuse +copyto!fixes: ~900 bytes/call saved
… test
Add y_cache field to MIRKCache that pre-allocates the Vector{Vector{T}}
needed by recursive_unflatten! for the primal (Float64) path. This
eliminates the last 144 bytes/call from get_tmp. broadcast allocation.
For the Dual path (ForwardDiff Jacobian computation), falls back to the
existing broadcast since element types differ.
Loss function is now fully allocation-free at runtime (0 bytes/call
verified via @timed over 100k calls).
Add zero-allocation QA test in the qa test group.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Investigates and fixes allocation issues in MIRK solvers, reducing per-step allocations by 60-70%.
Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255
Investigation Findings
Q: Should steps allocate? Only adaptivity should.
Currently, the loss function (called every Newton iteration) allocates due to:
Logging.with_logger(NullLogger())wrapping everyget_tmpcall — creates a closure + task-local context on each of 50+ calls per loss evaluation[get_tmp(r, u) for r in residual]array comprehensions in loss functionssimilar(fᵢ_cache)inNoDiffCacheNeededΦ! — allocates a new vector every callQ: Is allocation-free stepping tested?
No — added allocation regression tests.
Changes
Remove
Logging.with_logger(NullLogger())fromget_tmp(biggest win)get_tmpcallPreallocationTools.get_tmpcallsEliminate array comprehensions in MIRK loss functions
DiffCacheresidual vectors directly toΦ!instead of pre-unwrapping_maybe_get_tmphelper forΦ!to handle both DiffCache and plain arraysrecursive_flatten!/recursive_flatten_twopoint!overloads forAbstractVector{<:DiffCache}Fix
NoDiffCacheNeededΦ! to reusefᵢ_cachesimilar(fᵢ_cache)allocated a new scratch vector every callfᵢ_cacheis already a scratch buffer that can be reused directlyResults (dt=0.1, N=2, MIRK4, non-adaptive)
Remaining allocations
The remaining ~4 KiB per loss call comes from:
recursive_unflatten!broadcast for DiffCache (~192 bytes)Future work (not in this PR)
@viewsand using manual indexing)recursive_unflatten!broadcast resultTest plan
🤖 Generated with Claude Code