-
Notifications
You must be signed in to change notification settings - Fork 267
small prime FFT based on ulong #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
vneiger
wants to merge
109
commits into
flintlib:main
Choose a base branch
from
vneiger:introduce_nmod_fft
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 72 commits
Commits
Show all changes
109 commits
Select commit
Hold shift + click to select a range
0bf0127
add profile for powmod
vneiger c363adb
Merge branch 'main' into introduce_nmod_fft
vneiger 4a887b4
add .h file
vneiger aee38b3
fix ifndef
vneiger 17faaea
context and init code
vneiger e738256
add profile
vneiger dcaede7
fix include
vneiger cd50787
improve profile init
vneiger afa5ddc
rename ctx init
vneiger fd24de2
testing init
vneiger 211ab75
fix explanations and complete test for init
vneiger 6368823
remove printf
vneiger 9eeedd6
forgot to add main
vneiger 3fa7944
dft, test passes
vneiger ff33533
add profile
vneiger f4520c9
clean things a bit
vneiger e10c29c
introducing dft32 base case
vneiger 7b605a6
dft32 base case
vneiger 1f236d8
cleaning things
vneiger 9bf18c7
testing from length 1
vneiger fb88c54
fix
vneiger f6cc96c
remove useless function argument
vneiger a675b68
vaguely faster with added lazy14 layer
vneiger 28b3276
clean explanations
vneiger b71649d
finalize lazy14 version
vneiger 8cd392c
small fixes
vneiger 9fa9020
tentative fix for flint_bits == 32
vneiger ccd3f71
dft8 is now a macro, code generation was too unpredictable
vneiger f0587e5
putting more args slightly slows down for large lengths...
vneiger 4cf7343
macro for dft16 helps, let's see for dft32
vneiger b96e0a5
dft32 macroified does help a bit
vneiger 82a6c85
mod4 currently unused
vneiger 0cae527
some notes / todos
vneiger 4818198
test with prime close to announced limitw
vneiger c413b63
limit is 62 bits
vneiger 174bb2d
fix assert and try struct for args... for circumventing strange behav…
vneiger c7f6dff
fft args and introducing iw
vneiger 5b84260
dft 16 macro pointers
vneiger 1fb2479
dft 16 macro pointers
vneiger 4d8071f
dft8 simplify a bit
vneiger 6e751ea
clean
vneiger 192a575
dft16 simplify a bit
vneiger 375a84d
dft16 4 4 now ok as well
vneiger c9b6575
dft32 expanded as well; this is ugly but will help for versions with …
vneiger c24e7b6
starting precomp of tab_inverse(w)
vneiger e0ace33
add tab_iw in fitdepth
vneiger b376f7e
unroll a bit is faster
vneiger 0e0df84
notes about init
vneiger 40539de
wip: use multipoint eval in test
vneiger 8c8b08b
use multipoint eval in test
vneiger b1fb674
idft_t
vneiger dcf2ae7
idft_t, not tested yet
vneiger 9d8845d
minor changes
vneiger a872720
progress
vneiger c29a5d1
add files
vneiger 66523a7
idft test passes
vneiger 453c068
idft test passes
vneiger 82d5cbf
idft in progress
vneiger 07a7dbe
fix name
vneiger f0fed07
idft in progress
vneiger f79ba6f
idft in progress
vneiger c41bcd4
idft in progress
vneiger f09bca0
idft in progress
vneiger 38a5aba
idft in progress
vneiger ece93dc
minor fix
vneiger 9ef50f1
a bit lazier
vneiger 3b5669d
idft becoming good, remains some fine tuning to do
vneiger 614dacb
a bit lazier
vneiger 9567b15
fix name
vneiger 42f6fe6
Merge branch 'flintlib:main' into introduce_nmod_fft
vneiger c438c43
minor comment
vneiger eaf6d5c
Merge branch 'introduce_nmod_fft' of github.com:vneiger/flint into in…
vneiger 4bb083a
remove unnecessary includes
vneiger 99284b5
add todo about tests idft any node small depths
vneiger b67916a
Merge branch 'flintlib:main' into introduce_nmod_fft
vneiger 56dde27
Merge branch 'main' into introduce_nmod_fft
vneiger 5010d9c
small typo in bound
vneiger d7f0c02
Merge branch 'main' into introduce_nmod_fft
vneiger fb03745
Merge branch 'main' into introduce_nmod_fft
vneiger c5926e1
this todo is done (precomputationwith inverse roots)
vneiger 6a3dc5c
misc notes and comments
vneiger ae76510
add dft_lazy44 to timed functions; add notes about attempts
vneiger 422d02e
rename dft_node0->dft and dft->dft_node ; reorganize todos
vneiger 5f080b5
some enhancements in documentation
vneiger f096ea4
some more reorganization
vneiger a686fdd
review + doc + clean macros
vneiger 4d6659f
clean dft.c
vneiger cdf9bd9
some investigations into nmod_poly using n_fft
vneiger 869e1d0
Merge branch 'main' into introduce_nmod_fft
vneiger 2aa0f72
new base case and some cleaning
vneiger d3d4e6e
reorganize and make sure return is [0..n)
vneiger a2b2a8f
remove unused include
vneiger 4d481eb
idft32 and some more cleaning
vneiger 6d893e6
documentation + clean test t-dft
vneiger 2c66633
clean test t-idft
vneiger 59f0e30
test dft_t goes fine
vneiger f610be4
reduce parameters a bit to make tests fast enough
vneiger 3e651f5
test idft_t goes fine
vneiger 3b906a3
clean profile p-dft
vneiger 3743e5b
Merge branch 'main' into introduce_nmod_fft
vneiger 5108018
checked that dft32 and idft32 do help (a tiny bit); clean doc in dft
vneiger f1852d1
remove useless temporary macro parameters
vneiger e45cd41
doc functions in idft.c
vneiger 97aca9c
add timings for meteorlake
vneiger 21a853c
forgot to specify prime
vneiger e437cc1
forgot to specify prime
vneiger d36d74a
Merge branch 'main' into introduce_nmod_fft
vneiger f196781
fix small typo and add explanations for depth/node functions
vneiger 894f9e7
Merge branch 'main' into introduce_nmod_fft
vneiger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,261 @@ | ||
/* | ||
Copyright (C) 2024 Vincent Neiger | ||
|
||
This file is part of FLINT. | ||
|
||
FLINT is free software: you can redistribute it and/or modify it under | ||
the terms of the GNU Lesser General Public License (LGPL) as published | ||
by the Free Software Foundation; either version 3 of the License, or | ||
(at your option) any later version. See <https://www.gnu.org/licenses/>. | ||
*/ | ||
|
||
#ifndef N_FFT_H | ||
#define N_FFT_H | ||
|
||
#include "flint.h" | ||
#include "nmod.h" | ||
#include "nmod_vec.h" | ||
#include "ulong_extras.h" | ||
|
||
#define N_FFT_CTX_DEFAULT_DEPTH 12 | ||
|
||
#ifdef __cplusplus | ||
extern "C" { | ||
#endif | ||
|
||
/** | ||
* TODO[short term] augment precomputations with inverse roots | ||
* TODO[short term] add testing for general variants, not only node0 | ||
* TODO[long term] large depth can lead to heavy memory usage | ||
* --> provide precomputation-free functions | ||
* TODO[long term] on zen4 (likely on other cpus as well) ctx_init becomes | ||
* slower at some point, losing a factor 4 or more, probably due to caching; | ||
* what is annoying is that the depth where it becomes slower is significantly | ||
* smaller (~13-14) when tab_iw has been incorporated compared to without | ||
* tab_iw (it was depth ~20-21); see if this can be understood, and maybe play | ||
* with vectorization for those simple functions | ||
* TODO[later] provide forward function which reduces output to [0..n) ? | ||
* unclear this is useful... to be decided later | ||
*/ | ||
|
||
/** n_fft context: | ||
* parameters and tabulated powers of the primitive root of unity "w". | ||
**/ | ||
|
||
typedef struct | ||
{ | ||
ulong mod; // modulus, odd prime | ||
ulong max_depth; // maximum supported depth (w has order 2**max_depth) | ||
ulong cofactor; // prime is 1 + cofactor * 2**max_depth | ||
ulong depth; // depth supported by current precomputation | ||
nn_ptr tab_w; // tabulated powers of w, see below | ||
nn_ptr tab_iw; // tabulated powers of 1/w, see below | ||
ulong tab_w2[2*FLINT_BITS]; // tabulated powers w**(2**k), see below | ||
ulong tab_inv2[2*FLINT_BITS]; // tabulated inverses of 2**k, see below | ||
} n_fft_ctx_struct; | ||
typedef n_fft_ctx_struct n_fft_ctx_t[1]; | ||
|
||
|
||
/** Requirements (not checked upon init): | ||
* - mod is an odd prime < 2**(FLINT_BITS-2) | ||
* - max_depth must be >= 3 (so, 8 must divide mod - 1) | ||
* Total memory cost of precomputations for arrays tab_{w,iw,w2,inv2}: | ||
* at most 2 * (2*FLINT_BITS + 2**depth) ulong's | ||
*/ | ||
|
||
/** tab_w2: | ||
* - length 2*FLINT_BITS, with undefined entries at index 2*(max_depth-1) and beyond | ||
* - contains powers w**d for d a power of 2, and corresponding | ||
* precomputations for modular multiplication: | ||
* -- for 0 <= k < max_depth-1, tab_w2[2*k] = w**(2**(max_depth-2-k)) | ||
* and tab_w2[2*k+1] = floor(tab_w2[2*k] * 2**FLINT_BITS / mod) | ||
* -- for 2*(max_depth-1) <= k < 2*FLINT_BITS, tab_w2[k] is undefined | ||
* | ||
* --> one can retrieve w as tab_w2[2 * (max_depth-2)] | ||
* --> the first elements are tab_w2 = [I, I_pr, J, J_pr, ...] | ||
* where I is a square root of -1 and J is a square root of I | ||
*/ | ||
|
||
/** tab_w: | ||
* - length 2**depth | ||
* - contains 2**(depth-1) first powers of w in (max_depth-1)-bit reversed order, | ||
* and corresponding precomputations for modular multiplication: | ||
* -- for 0 <= k < 2**(depth-1), tab_w[2*k] = w**(br[k]) | ||
* and tab_w[2*k+1] = floor(tab_w[2*k] * 2**FLINT_BITS / mod) | ||
* where br = [0, 2**(max_depth-2), 2**(max_depth-3), 3 * 2**(max_depth-3), ...] | ||
* is the bit reversal permutation of length 2**(max_depth-1) | ||
* (https://en.wikipedia.org/wiki/Bit-reversal_permutation) | ||
* | ||
* In particular the first elements are | ||
* tab_w = [1, 1_pr, I, I_pr, J, J_pr, IJ, IJ_pr, ...] | ||
* where I is a square root of -1, J is a square root of I, and IJ = I*J. Note | ||
* that powers of w beyond 2**(max_depth-1), for example -1, -I, -J, etc. are | ||
* not stored. | ||
**/ | ||
|
||
/** tab_iw: same as tab_w but for the primitive root 1/w */ | ||
|
||
/** tab_inv2: | ||
* - length 2*FLINT_BITS, with undefined entries at index 2*max_depth and beyond | ||
* - contains the modular inverses of 2**k, and corresponding | ||
* precomputations for modular multiplication: | ||
* -- for 0 <= k < max_depth, tab_inv2[2*k] = the inverse of 2**(k+1) | ||
* modulo mod, and tab_inv2[2*k+1] = floor(tab_inv2[2*k] * 2**FLINT_BITS / mod) | ||
* -- for 2*max_depth <= k < 2*FLINT_BITS, tab_inv2[k] is undefined | ||
* | ||
* Recall F->mod == 1 + cofactor * 2**max_depth, so | ||
* 1 == F->mod - cofactor * 2**(max_depth - k) * 2**k | ||
* --> the inverse of 2**k in (0, F->mod) is | ||
* F->mod - cofactor * 2**(max_depth - k), | ||
* we do not really need to store it, but we want the precomputations as well | ||
*/ | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
/** Note for init functions, when depth is provided: | ||
* - if it is < 3, it is pretended that it is 3 | ||
* - it it is more than F->max_depth (the maximum possible with the given | ||
* prime), it is reduced to F->max_depth | ||
* After calling init, precomputations support DFTs of length up to 2**depth | ||
*/ | ||
|
||
// initialize with given root and given depth | ||
void n_fft_ctx_init2_root(n_fft_ctx_t F, ulong w, ulong max_depth, ulong cofactor, ulong depth, ulong mod); | ||
|
||
// find primitive root, initialize with given depth | ||
void n_fft_ctx_init2(n_fft_ctx_t F, ulong depth, ulong p); | ||
|
||
// same, with default depth | ||
FLINT_FORCE_INLINE | ||
void n_fft_ctx_init_root(n_fft_ctx_t F, ulong w, ulong max_depth, ulong cofactor, ulong p) | ||
{ n_fft_ctx_init2_root(F, w, max_depth, cofactor, N_FFT_CTX_DEFAULT_DEPTH, p); } | ||
|
||
FLINT_FORCE_INLINE | ||
void n_fft_ctx_init(n_fft_ctx_t F, ulong p) | ||
{ n_fft_ctx_init2(F, N_FFT_CTX_DEFAULT_DEPTH, p); } | ||
|
||
// grows F->depth and precomputations to support DFTs of depth up to depth | ||
void n_fft_ctx_fit_depth(n_fft_ctx_t F, ulong depth); | ||
|
||
void n_fft_ctx_clear(n_fft_ctx_t F); | ||
|
||
|
||
|
||
typedef struct | ||
{ | ||
ulong mod; // modulus, odd prime | ||
ulong mod2; // 2*mod (storing helps for speed) | ||
//ulong mod4; // 4*mod (storing helps for speed) | ||
nn_srcptr tab_w; // tabulated powers of w, see below | ||
} n_fft_args_struct; | ||
typedef n_fft_args_struct n_fft_args_t[1]; | ||
|
||
FLINT_FORCE_INLINE | ||
void n_fft_set_args(n_fft_args_t F, ulong mod, nn_srcptr tab_w) | ||
{ | ||
F->mod = mod; | ||
F->mod2 = 2*mod; | ||
F->tab_w = tab_w; | ||
} | ||
|
||
|
||
|
||
|
||
|
||
/** dft: | ||
* transforms / inverse transforms / transposed transforms | ||
* at length a power of 2 | ||
*/ | ||
|
||
void dft_node0_lazy14(nn_ptr p, ulong depth, n_fft_args_t F); | ||
|
||
/** 2**depth-point DFT | ||
* * in [0..n) / out [0..4n) / max < 4n | ||
* * In-place transform p of length len == 2**depth into | ||
* the concatenation of | ||
* [sum(p[i] * w_k**i for i in range(len), sum(p[i] * (-w_k)**i for i in range(len)] | ||
* for k in range(len), | ||
* where w_k = F->tab_w[2*k] for 0 <= k < 2**(depth-1) | ||
* * By construction these evaluation points are the roots of the polynomial | ||
* x**len - 1, precisely they are all powers of the chosen len-th primitive | ||
* root of unity with exponents listed in bit reversed order | ||
* * Requirements (not checked): depth <= F.depth | ||
*/ | ||
FLINT_FORCE_INLINE void n_fft_dft(nn_ptr p, ulong depth, n_fft_ctx_t F) | ||
{ | ||
n_fft_args_t Fargs; | ||
n_fft_set_args(Fargs, F->mod, F->tab_w); | ||
dft_node0_lazy14(p, depth, Fargs); | ||
} | ||
|
||
// FIXME in progress | ||
// not tested yet --> test == applying dft yields identity | ||
// DOC. Note: output < n. | ||
void idft_node0_lazy12(nn_ptr p, ulong depth, n_fft_args_t F); | ||
FLINT_FORCE_INLINE void n_fft_idft(nn_ptr p, ulong depth, n_fft_ctx_t F) | ||
{ | ||
n_fft_args_t Fargs; | ||
n_fft_set_args(Fargs, F->mod, F->tab_iw); | ||
idft_node0_lazy12(p, depth, Fargs); | ||
|
||
if (depth > 0) | ||
{ | ||
const ulong inv2 = F->tab_inv2[2*depth-2]; | ||
const ulong inv2_pr = F->tab_inv2[2*depth-1]; | ||
//ulong p_hi, p_lo; | ||
for (ulong k = 0; k < (UWORD(1) << depth); k++) | ||
{ | ||
p[k] = n_mulmod_shoup(inv2, p[k], inv2_pr, F->mod); | ||
//umul_ppmm(p_hi, p_lo, inv2_pr, p[k]); | ||
//p[k] = inv2 * p[k] - p_hi * F->mod; | ||
} | ||
// NOTE: apparently no gain from lazy variant, so | ||
// probably better to use non-lazy one (ensures output < n) | ||
} | ||
// FIXME see if that can be made less expensive at least for depths not too | ||
// small, by integrating into base cases of dft_node0 | ||
} | ||
|
||
|
||
|
||
// FIXME in progress | ||
// not tested yet --> test == naive version? | ||
// DOC. Note: output < 2n (?). | ||
FLINT_FORCE_INLINE void n_fft_dft_t(nn_ptr p, ulong depth, n_fft_ctx_t F) | ||
{ | ||
n_fft_args_t Fargs; | ||
n_fft_set_args(Fargs, F->mod, F->tab_w); | ||
idft_node0_lazy12(p, depth, Fargs); | ||
} | ||
|
||
// FIXME in progress | ||
// not tested yet --> test == applying dft_t yields identity? | ||
// DOC. Note: output < n. | ||
FLINT_FORCE_INLINE void n_fft_idft_t(nn_ptr p, ulong depth, n_fft_ctx_t F) | ||
{ | ||
n_fft_args_t Fargs; | ||
n_fft_set_args(Fargs, F->mod, F->tab_iw); | ||
dft_node0_lazy14(p, depth, Fargs); | ||
|
||
if (depth > 0) | ||
{ | ||
// see comments in idft concerning this loop | ||
const ulong inv2 = F->tab_inv2[2*depth-2]; | ||
const ulong inv2_pr = F->tab_inv2[2*depth-1]; | ||
for (ulong k = 0; k < (UWORD(1) << depth); k++) | ||
p[k] = n_mulmod_shoup(inv2, p[k], inv2_pr, F->mod); | ||
} | ||
} | ||
|
||
|
||
|
||
|
||
#ifdef __cplusplus | ||
} | ||
#endif | ||
|
||
#endif /* N_FFT_H */ |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.