small prime FFT based on ulong #2107

vneiger · 2024-11-09T10:24:02Z

This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if n_fft is relevant naming).

For the moment, the features implemented are:

forward FFT, inverse FFT, transposed forward FFT, transposed inverse FFT
restriction on the modulus: it must be 62 bits at most (for performance reasons)
length power of 2 (other lengths: zero padding means non-smooth timings between powers of 2)

Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in fft_small (or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)

Planned:

more thorough testing files (for the transposed variants, which are only tested indirectly at the moment)
cleaning things here and there, add documentation
add mechanism to avoid too memory-consuming precomputation when a root of unity of very large order is available (maybe, in a first version, simply forbid transforms of length more than 2**25 or so?).

Planned, but likely not within this PR:

truncated FFT variants, for smooth performance when length varies from one power of 2 to the next
versions with strides, useful e.g. for polynomial matrices stored as a list of matrix coefficients (e.g., might help for the half-GCD algorithm)

vneiger added 30 commits September 16, 2024 11:24

add profile for powmod

0bf0127

Merge branch 'main' into introduce_nmod_fft

c363adb

add .h file

4a887b4

fix ifndef

aee38b3

context and init code

17faaea

add profile

e738256

fix include

dcaede7

improve profile init

cd50787

rename ctx init

afa5ddc

testing init

fd24de2

fix explanations and complete test for init

211ab75

remove printf

6368823

forgot to add main

9eeedd6

dft, test passes

3fa7944

add profile

ff33533

clean things a bit

f4520c9

introducing dft32 base case

e10c29c

dft32 base case

7b605a6

cleaning things

1f236d8

testing from length 1

9bf18c7

fix

fb88c54

remove useless function argument

f6cc96c

vaguely faster with added lazy14 layer

a675b68

clean explanations

28b3276

finalize lazy14 version

b71649d

small fixes

8cd392c

tentative fix for flint_bits == 32

9fa9020

dft8 is now a macro, code generation was too unpredictable

ccd3f71

putting more args slightly slows down for large lengths...

f0587e5

macro for dft16 helps, let's see for dft32

4cf7343

vneiger added 30 commits April 11, 2025 23:05

this todo is done (precomputationwith inverse roots)

c5926e1

misc notes and comments

6a3dc5c

add dft_lazy44 to timed functions; add notes about attempts

ae76510

rename dft_node0->dft and dft->dft_node ; reorganize todos

422d02e

some enhancements in documentation

5f080b5

some more reorganization

f096ea4

review + doc + clean macros

a686fdd

clean dft.c

4d6659f

some investigations into nmod_poly using n_fft

cdf9bd9

Merge branch 'main' into introduce_nmod_fft

869e1d0

new base case and some cleaning

2aa0f72

reorganize and make sure return is [0..n)

d3d4e6e

remove unused include

a2b2a8f

idft32 and some more cleaning

4d481eb

documentation + clean test t-dft

6d893e6

clean test t-idft

2c66633

test dft_t goes fine

59f0e30

reduce parameters a bit to make tests fast enough

f610be4

test idft_t goes fine

3e651f5

clean profile p-dft

3b906a3

Merge branch 'main' into introduce_nmod_fft

3743e5b

checked that dft32 and idft32 do help (a tiny bit); clean doc in dft

5108018

remove useless temporary macro parameters

f1852d1

doc functions in idft.c

e45cd41

add timings for meteorlake

97aca9c

forgot to specify prime

21a853c

forgot to specify prime

e437cc1

Merge branch 'main' into introduce_nmod_fft

d36d74a

fix small typo and add explanations for depth/node functions

f196781

Merge branch 'main' into introduce_nmod_fft

894f9e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

small prime FFT based on ulong #2107

small prime FFT based on ulong #2107

Uh oh!

vneiger commented Nov 9, 2024

Uh oh!

Uh oh!

small prime FFT based on ulong #2107

Are you sure you want to change the base?

small prime FFT based on ulong #2107

Uh oh!

Conversation

vneiger commented Nov 9, 2024

Uh oh!

Uh oh!