Skip to content

fix: update A_backup before LU to prevent stale QR fallback#901

Merged
ChrisRackauckas merged 2 commits intoSciML:mainfrom
ChrisRackauckas-Claude:fix-stale-a-backup-qr-fallback
Feb 27, 2026
Merged

fix: update A_backup before LU to prevent stale QR fallback#901
ChrisRackauckas merged 2 commits intoSciML:mainfrom
ChrisRackauckas-Claude:fix-stale-a-backup-qr-fallback

Conversation

@ChrisRackauckas-Claude
Copy link
Contributor

Summary

  • Fix stale A_backup causing QR fallback to use wrong matrix when LinearSolve cache is reused across multiple solves
  • A_backup was set once from prob.A at init time and never updated
  • When NonlinearSolve (or any caller) updates cache.A with new Jacobians via copyto!, the QR fallback would restore from the stale initial matrix instead of the current one

Root Cause

Introduced in v3.59.0 (commit b477d98, "Restore cache.A from prob.A before QR safety fallback"). The A_backup field in DefaultLinearSolverInit stored a reference to prob.A and was never updated when cache.A was modified externally.

When NonlinearSolve's NewtonRaphson solves a system with a singular Jacobian:

  1. Each Newton step updates cache.A with the current Jacobian via copyto!
  2. LU factorization fails (singular), triggering QR fallback
  3. QR fallback does copyto!(cache.A, A_backup) — restoring the initial Jacobian, not the current one
  4. QR solves with the wrong matrix, producing wrong Newton steps
  5. Newton iteration fails to converge

This caused SciMLBase's downstream initialization.jl tests to fail: NewtonRaphson computed u0[2] = 1.955 instead of 2.0.

Fix

Before each in-place LU factorization, save the current cache.A into A_backup:

if !(cache.A === cache.cacheval.A_backup)
    copyto!(cache.cacheval.A_backup, cache.A)
end

The === guard skips the copy when alias_A=true (same object), preserving existing alias detection logic.

Applied to all 5 LU variant blocks (LU, GenericLU, MKL, AppleAccelerate → shared block; RFLU; BLIS; CUDA; Metal).

Test plan

  • Added regression test: cache reuse with different singular matrix, verify QR fallback uses current matrix
  • Existing default_algs.jl tests pass
  • @test_broken for SparseMatrixCSC{Float64, Int32} now passes (updated to @test)
  • Verified NonlinearSolve's NewtonRaphson converges correctly on the SciMLBase initialization scenario
  • CI

🤖 Generated with Claude Code

ChrisRackauckas and others added 2 commits February 26, 2026 20:43
When the LinearSolve cache is reused across multiple solves (e.g. by
NonlinearSolve updating cache.A with new Jacobians each Newton step),
A_backup was never updated from its initial value (prob.A at init time).
If LU factorization failed and the QR fallback fired, it would restore
cache.A from the stale A_backup instead of the current matrix, causing
the QR solve to operate on the wrong matrix.

This led to NonlinearSolve's NewtonRaphson failing to converge on
problems with singular Jacobians, producing wrong results (e.g.
1.955 instead of 2.0 in SciMLBase initialization tests).

Fix: save cache.A into A_backup before each in-place LU factorization,
so the QR fallback always restores the correct (current) matrix.

Also updates @test_broken to @test for a SparseMatrixCSC{Float64,Int32}
case that now passes with the fix.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the A_backup sync from the solve path into setproperty!(cache, :A, x).
When callers (e.g. NonlinearSolve) update cache.A in-place via copyto! and
then trigger setproperty! with `cache.A = cache.A`, the backup is now synced
at the interface boundary rather than scattered across every LU variant.

The previous approach (copyto! into A_backup before each LU) mutated prob.A
through the A_backup alias, corrupting user problem data and causing
basictests.jl:108 to fail (values off by 2x).

The guard `x === getfield(cache, :A)` ensures copyto! only fires on the
in-place re-trigger pattern, not when a new matrix object is assigned.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisRackauckas-Claude
Copy link
Contributor Author

Revised approach: sync A_backup in setproperty! instead of the solve path

The previous approach (copyto!(A_backup, cache.A) before each LU variant) had two bugs:

  1. Mutated prob.A: A_backup === prob.A (by design), so copyto!(A_backup, cache.A) overwrites the user's problem data through the alias. This caused basictests.jl:108 to fail — values off by exactly 2x because test_interface does cache.A = deepcopy(A2) (where A2 = I/2), triggering the copyto! into prob.A.

  2. MethodError on FunctionOperator: Not all A types support copyto! (e.g. FunctionOperator), causing basictests.jl:719 to error.

The fix

Move the sync into Base.setproperty!(cache::LinearCache, :A, x). NonlinearSolve already uses the correct pattern:

copyto!(lincache.A, new_A)
lincache.A = lincache.A  # important!! triggers special code in setproperty!

The new guard x === getfield(cache, :A) && !(x === A_backup) ensures copyto! only fires on this in-place re-trigger pattern (same object re-assigned after mutation), not when a new matrix object is assigned. This is one fix at the interface boundary instead of five copies scattered across every LU variant.

What passes

  • All LinearSolve Core tests pass locally (Julia 1.12)
  • The NonlinearSolve initialization scenario (NewtonRaphson with singular Jacobian, u0 ≈ [2.0, 1.0]) passes locally
  • The @test_broken for SparseMatrixCSC{Float64, Int32} is reverted back since that's still broken (Sparspak Int32 issue, not related)

What this does NOT fix

Issue #887's reproducer (n=20 singular matrix where LUFactorization via LAPACK gives garbage but reports Success) is a pre-existing bug on main — LU doesn't detect the near-singularity, so the QR fallback never triggers. That's a separate issue about singularity detection thresholds, not about stale backups.

@ChrisRackauckas ChrisRackauckas merged commit 21f1f3d into SciML:main Feb 27, 2026
311 of 321 checks passed
ChrisRackauckas-Claude pushed a commit to ChrisRackauckas-Claude/LinearSolve.jl that referenced this pull request Feb 28, 2026
When an ODE system is resized (e.g., via resize! callbacks), A_backup in
DefaultLinearSolverInit retains its original dimensions while cache.A gets
resized externally. The setproperty! method then crashes with BoundsError
when trying to copyto! between mismatched sizes.

Two fixes:
1. setproperty!: check size(A_backup) == size(x) before copyto!, and
   replace A_backup with copy(x) when sizes differ.
2. New Base.resize!(cache::LinearCache, i): proactively resize A_backup
   when integrators call resize!, and mark cache as stale.

Fixes regression from SciML#901 that broke OrdinaryDiffEq resize tests.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants