You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Default OpenBLAS thread settings cause severe performance regressions for decomposition-heavy MNE operations on multi-core Linux machines. Shrunk covariance estimation is 13.3x slower, Maxwell filtering (tSSS) is 6.5x slower, and Maxwell SSS is 9.2x slower at the default thread count compared to the optimal setting.
This affects any Linux user with OpenBLAS (the default NumPy BLAS on most systems) and ≥8 cores.
Catastrophic at default threads: Maxwell filter, shrunk covariance — dominated by SVD/QR/eigh on tall-skinny or small matrices where thread synchronization overhead exceeds compute benefit
BLAS-insensitive: bandpass/notch filter, TFR, PSD, source morph — these are FFT-dominated or sparse and completely unaffected
Root cause
Per-operation benchmarks confirm the issue is specific to decomposition routines on medium-sized matrices:
Operation
BLAS=1
BLAS=4
BLAS=8
BLAS=16
SVD (10000×306)
230ms
120ms
99ms
111ms ↑
QR+pivot (10000×306)
200ms
86ms
71ms
76ms ↑
matmul (306×306)@(306×10000)
39ms
13ms
8ms
7ms
matmul (5000×5000)
4848ms
1229ms
645ms
631ms
SVD and QR on tall-skinny matrices (the kind used in Maxwell tSSS, shrunk covariance, and ICA) peak at 4-8 threads and regress at 16 due to OpenBLAS thread synchronization overhead. Large square matmul scales linearly to 16 threads — but the decomposition-heavy operations don't benefit.
Immediate user workaround
export OPENBLAS_NUM_THREADS=4 # add to .bashrc
This single environment variable provides larger speedups than any code-level optimization for decomposition-heavy workflows.
Possible fixes
Documentation: Add a note to the installation/performance docs recommending OPENBLAS_NUM_THREADS=4 for multi-core Linux systems
Runtime detection: Use threadpoolctl (already an MNE dependency) to detect and warn when OpenBLAS is running with too many threads
Summary
Default OpenBLAS thread settings cause severe performance regressions for decomposition-heavy MNE operations on multi-core Linux machines. Shrunk covariance estimation is 13.3x slower, Maxwell filtering (tSSS) is 6.5x slower, and Maxwell SSS is 9.2x slower at the default thread count compared to the optimal setting.
This affects any Linux user with OpenBLAS (the default NumPy BLAS on most systems) and ≥8 cores.
Benchmarks
System: AMD EPYC 7R13, 16 vCPU, OpenBLAS 0.3.31, 60s of sample data.
Operations fall into three categories:
Root cause
Per-operation benchmarks confirm the issue is specific to decomposition routines on medium-sized matrices:
SVD and QR on tall-skinny matrices (the kind used in Maxwell tSSS, shrunk covariance, and ICA) peak at 4-8 threads and regress at 16 due to OpenBLAS thread synchronization overhead. Large square matmul scales linearly to 16 threads — but the decomposition-heavy operations don't benefit.
Immediate user workaround
This single environment variable provides larger speedups than any code-level optimization for decomposition-heavy workflows.
Possible fixes
OPENBLAS_NUM_THREADS=4for multi-core Linux systemsthreadpoolctl(already an MNE dependency) to detect and warn when OpenBLAS is running with too many threadsthreadpool_limits(limits=N)to cap BLAS threads — as proposed in Improve control over number of threads used in an mne call #10522 but never implementedOption 3 is the most robust but also the most invasive. Option 1 is low-effort and immediately helpful.
Reproduction
Benchmark script (requires MNE sample data)
Compare:
OPENBLAS_NUM_THREADS=4 python bench.pyvsOPENBLAS_NUM_THREADS=16 python bench.pyRelated
threadpoolctlto control BLAS threads vian_jobs(closed, never implemented)