Skip to content

Commit 821729c

Browse files
authored
Update README_REDC_supplement.md
1 parent a223729 commit 821729c

File tree

1 file changed

+2
-2
lines changed
  • montgomery_arithmetic/include/hurchalla/montgomery_arithmetic/low_level_api/detail/platform_specific

1 file changed

+2
-2
lines changed

montgomery_arithmetic/include/hurchalla/montgomery_arithmetic/low_level_api/detail/platform_specific/README_REDC_supplement.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
This file supplements the document [README_REDC.md](README_REDC.md).
22
<br><br>
33

4-
The simple solution to getting a good traditional REDC is to write a delegating function that calls the alternate REDC. With inlining, its total uops will likely be lower than the low-uops asm version further below, and there is a decent chance that the compiler will loop hoist the calculation of invN if we are calling this function from a loop. Thus this version could also achieve latency equal to the low-latency asm version further below. Yet another reason why we might prefer this implementation is that the delegate "REDC_alternate" function can be implemented effectively with just standard C, which would eliminate the chance of inline-asm related bugs, and will sometimes improve performance since inline-asm may hinder compiler optimizations.<br>
4+
The simplest and usually most effective way to implement the traditional REDC is to write a delegating function that calls the alternate REDC. With inlining, its total uops will likely be lower than the low-uops asm version further below, and there is a decent chance that the compiler will loop hoist the calculation of invN if we are calling this function from a loop. Thus this version could also achieve latency equal to the low-latency asm version further below. In practice, even if the negation is not loop hoisted, REDC will most often be called during Montgomery multiplication, and the negation will not contribute to latency since its calculation will overlap with the preceding multiply in the Montgomery multiplication. Yet another reason why we might prefer this implementation is that the delegate "REDC_alternate" function can be implemented effectively with just standard C, which would eliminate the chance of inline-asm related bugs, and will sometimes improve performance since inline-asm may hinder compiler optimizations.<br>
55

66
<pre>
7-
// On Intel Skylake: ~10 cycles latency, ~8 fused uops
7+
// On Intel Skylake: ~9-10 cycles latency, ~8 fused uops
88
inline uint64_t REDC_traditional_delegating(uint64_t T_hi, uint64_t T_lo,
99
uint64_t N, uint64_t negInvN)
1010
{

0 commit comments

Comments
 (0)