-
Notifications
You must be signed in to change notification settings - Fork 5
Closed
Description
For example in AVX2 implementation of mul_vec_by_scalar_then_add_into
function, we can optimize the throughput by issuing more instructions per iteration of the loop. We just have to manually unroll the loop to get it working.
rlnc/src/common/simd/x86/avx2.rs
Lines 74 to 90 in 061ec3f
for (add_vec_chunk, mul_vec_chunk) in add_vec_iter.by_ref().zip(mul_vec_iter.by_ref()) { | |
let mul_vec_chunk_simd = _mm256_lddqu_si256(mul_vec_chunk.as_ptr().cast()); | |
let chunk_simd_lo = _mm256_and_si256(mul_vec_chunk_simd, l_mask); | |
let chunk_simd_lo = _mm256_shuffle_epi8(l_tbl, chunk_simd_lo); | |
let chunk_simd_hi = _mm256_srli_epi64(mul_vec_chunk_simd, 4); | |
let chunk_simd_hi = _mm256_and_si256(chunk_simd_hi, l_mask); | |
let chunk_simd_hi = _mm256_shuffle_epi8(h_tbl, chunk_simd_hi); | |
let scaled_res = _mm256_xor_si256(chunk_simd_lo, chunk_simd_hi); | |
let add_vec_chunk_simd = _mm256_lddqu_si256(add_vec_chunk.as_ptr().cast()); | |
let accum_res = _mm256_xor_si256(add_vec_chunk_simd, scaled_res); | |
_mm256_storeu_si256(add_vec_chunk.as_mut_ptr().cast(), accum_res); | |
} |
This is possible because AVX2 or AVX512 instructions may have higher latency, but they allow issuing multiple of them. When pipelined nicely, it can give us quite good performance boost.
So explore that and see what we can achieve.
Metadata
Metadata
Assignees
Labels
No labels