Skip to content

Conversation

quake
Copy link
Member

@quake quake commented Mar 10, 2025

The prefetchnta instruction is better suited for our trace data access pattern because:

  • Trace data is accessed only once during asm execution
  • Using non-temporal prefetch reduces cache pollution by not displacing more frequently used data (e.g instructions_cache)

run benchmark multiple times, shows measurable improvements on two different x86 cpus (low and medium spec)

interpret secp256k1_bench via assembly
                        time:   [4.6501 ms 4.6595 ms 4.6743 ms]
                        change: [-1.8326% -1.6108% -1.3108%] (p = 0.00 < 0.05)
                        Performance has improved.
interpret secp256k1_bench via assembly
                        time:   [3.4878 ms 3.4889 ms 3.4901 ms]
                        change: [-2.1179% -1.9138% -1.7849%] (p = 0.00 < 0.05)
                        Performance has improved.

@eval-exec
Copy link
Contributor

eval-exec commented Mar 11, 2025

On my Intel i9-14900K
develop branch:

$ rm Cargo.lock; cargo bench
     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-2198b6531a9750c2)
Gnuplot not found, using plotters backend
roundup via remainder   time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

roundup via bit ops     time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

roundup via multication time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

roundup via remainder #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

roundup via bit ops #2  time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

roundup via multication #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-fbea187d4c738a4c)
Gnuplot not found, using plotters backend
interpret secp256k1_bench
                        time:   [6.0670 ms 6.0788 ms 6.0925 ms]
Found 20 outliers among 100 measurements (20.00%)
  8 (8.00%) high mild
  12 (12.00%) high severe

This PR:

     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-2198b6531a9750c2)
Gnuplot not found, using plotters backend
roundup via remainder   time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-47.895% -3.1511% +79.757%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

roundup via bit ops     time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-77.445% -53.415% +14.119%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

roundup via multication time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-46.332% +2.2388% +92.464%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

roundup via remainder #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-46.500% -0.8975% +87.882%] (p = 0.98 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

roundup via bit ops #2  time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-42.725% +14.514% +133.60%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

roundup via multication #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-48.268% +0.2155% +91.318%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-fbea187d4c738a4c)
Gnuplot not found, using plotters backend
interpret secp256k1_bench
                        time:   [6.0170 ms 6.0249 ms 6.0342 ms]
                        change: [-1.1492% -0.8870% -0.6528%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

@eval-exec
Copy link
Contributor

Executing rm Cargo.lock; cargo bench "interpret secp256k1_bench via assembly" --features asm

On develop

     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-d81b136bca03814f)
Gnuplot not found, using plotters backend
     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-64854b411dd08e91)
Gnuplot not found, using plotters backend
Benchmarking interpret secp256k1_bench via assembly: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.2s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly
                        time:   [1.6080 ms 1.6105 ms 1.6133 ms]
                        change: [-0.1044% +0.1182% +0.3405%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly mop
                        time:   [1.5962 ms 1.5991 ms 1.6025 ms]
                        change: [+0.0404% +0.2952% +0.5713%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.8s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Collecting 100 samples in estimated 6.8175 s (
interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [1.3446 ms 1.3472 ms 1.3502 ms]
                        change: [-0.8764% -0.2432% +0.3953%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Collecting 100 samples in estim
interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [1.0916 ms 1.0938 ms 1.0966 ms]
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

This PR:

     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-d81b136bca03814f)
Gnuplot not found, using plotters backend
     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-64854b411dd08e91)
Gnuplot not found, using plotters backend
Benchmarking interpret secp256k1_bench via assembly: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly
                        time:   [1.5726 ms 1.5750 ms 1.5777 ms]
                        change: [-2.4700% -2.2590% -2.0520%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly mop
                        time:   [1.5896 ms 1.5922 ms 1.5951 ms]
                        change: [-0.9286% -0.6165% -0.3233%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.8s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Collecting 100 samples in estimated 6.7910 s (
interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [1.3417 ms 1.3441 ms 1.3469 ms]
                        change: [-0.8756% -0.2123% +0.4776%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.6s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Collecting 100 samples in estim
interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [1.1151 ms 1.1174 ms 1.1202 ms]
                        change: [+1.2550% +2.1229% +3.0103%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

@eval-exec
Copy link
Contributor

eval-exec commented Mar 11, 2025

I created a bash script to run cargo bench "interpret secp256k1_bench via assembly" --features asm 21 times"

#!/usr/bin/env bash
set -e

for i in {0..20}; do
    echo git checkout to develop
    git checkout develop
    cargo bench "interpret secp256k1_bench via assembly" --features asm
    echo git checkout to quake/prefetchnta
    git checkout quake/prefetchnta
    cargo bench "interpret secp256k1_bench via assembly" --features asm

done

The bench result log file: bench.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants