Skip to content

Optimize SSSE3, AVX2, AVX512 Implementations of GF(2^8) SIMD Arithmetic #30

@itzmeanjan

Description

@itzmeanjan

For example in AVX2 implementation of mul_vec_by_scalar_then_add_into function, we can optimize the throughput by issuing more instructions per iteration of the loop. We just have to manually unroll the loop to get it working.

for (add_vec_chunk, mul_vec_chunk) in add_vec_iter.by_ref().zip(mul_vec_iter.by_ref()) {
let mul_vec_chunk_simd = _mm256_lddqu_si256(mul_vec_chunk.as_ptr().cast());
let chunk_simd_lo = _mm256_and_si256(mul_vec_chunk_simd, l_mask);
let chunk_simd_lo = _mm256_shuffle_epi8(l_tbl, chunk_simd_lo);
let chunk_simd_hi = _mm256_srli_epi64(mul_vec_chunk_simd, 4);
let chunk_simd_hi = _mm256_and_si256(chunk_simd_hi, l_mask);
let chunk_simd_hi = _mm256_shuffle_epi8(h_tbl, chunk_simd_hi);
let scaled_res = _mm256_xor_si256(chunk_simd_lo, chunk_simd_hi);
let add_vec_chunk_simd = _mm256_lddqu_si256(add_vec_chunk.as_ptr().cast());
let accum_res = _mm256_xor_si256(add_vec_chunk_simd, scaled_res);
_mm256_storeu_si256(add_vec_chunk.as_mut_ptr().cast(), accum_res);
}

This is possible because AVX2 or AVX512 instructions may have higher latency, but they allow issuing multiple of them. When pipelined nicely, it can give us quite good performance boost.

So explore that and see what we can achieve.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions