-
Notifications
You must be signed in to change notification settings - Fork 130
Description
[MSVC] (CPU: Xeon 1231v3)
I noticed slowdown when I was experimenting with several combinations of matrix sizes and simdpp instruction sets.
I implemented matrix multiplication in 3 different ways: plain C, unmodified simdpp, own modification of simdpp. All implementations resemble code below:
matrix<float, 4, 4> result;
__m128 m2rows[4] =
{
_mm_load_ps(m2.row(0).data()),
_mm_load_ps(m2.row(1).data()),
_mm_load_ps(m2.row(2).data()),
_mm_load_ps(m2.row(3).data()),
};
auto m1row = m1.begin();
auto resultrow = result.begin();
for (;m1row != m1.end(); m1row += 4, resultrow += 4)
{
__m128 m1row_v[4] =
{
_mm_load_ps1(m1row),
_mm_load_ps1(m1row + 1),
_mm_load_ps1(m1row + 2),
_mm_load_ps1(m1row + 3)
};
__m128 temp_mul[4] =
{
_mm_mul_ps(m1row_v[0], m2rows[0]),
_mm_mul_ps(m1row_v[1], m2rows[1]),
_mm_mul_ps(m1row_v[2], m2rows[2]),
_mm_mul_ps(m1row_v[3], m2rows[3]),
};
__m128 resultrow_v = _mm_add_ps(_mm_add_ps(temp_mul[0], temp_mul[1]), _mm_add_ps(temp_mul[2], temp_mul[3]));
_mm_store_ps(resultrow, resultrow_v);
}
return result;
(Every structure is properly aligned and has valid size in order to load/store specific simd type)
Below I will explain performance problems caused in each specific combination of matrix size and instruction set selected:
Performance calculated as an average of 100000 iterations and 10 runs of every implementations.
Compiled with /O2 and /Ob2 optimizations (Release).
-
matrix 4x4 (SSE2):
plain C: ~9ns
simdpp: ~ 9ns
personal modification simdpp: ~ 9ns
Here everything works as expected -
matrix 8x8 (SSE2):
plain C: ~63ns
simdpp : ~ 377ns
personal modification simdpp: ~ 64ns
In this combination we can see performance loss on simdpp side. After analysing implementation my solution was to change line where actual load was actually taking place - i_load(V&a, const char* p) template function. In this function MSVC compiler couldn't simplify the for loop written inside -> after change to constexpr offset added to pointer (std::index_sequence trick) execution time decreased and was matching the one from plain C (modification made in personal modification of simdpp) -
matrix 16x16 (SSE2):
plain C: ~272ns
simdpp: ~6532ns
personal modification simdpp: ~ 270ns
This combination presents the same problem as previous one. -
matrix 4x4 (AVX2):
plain C (__m128 falloff): ~9ns
simdpp (__m128 falloff): ~ 593ns
personal modification simdpp (__m128 falloff): ~ 9ns
Even though nothing really happened (still using __m128 internally as in SSE2 matrix 4x4 case) simdpp performance dropped significantly. After digging through implementation and Intel Intrinsics Guide I found a solution - replace _mm_broadcast_ss with _mm_load_ss and permute4<0,0,0,0>(v) as if I didn't define AVX2 instruction set. In Intel Intrinsics Guide we can read that _mm_broadcast_ss happens to have latency of 6 while having throughput of 0.5 CPI which means that for four load_splat I'm doing CPU is waiting while doing nothing for about 22 cycles. With using _mm_load_ss and permute4<0,0,0,0>(v) (which is almost always equivalent to _mm_load_ps1) CPU doesn't waste these cycles (this change translates to instructions movss and shufps which both have throughput of 1 CPI and latency of 1 cycle). -
matrix8x8 (AVX2):
plain C (__m256): ~33ns
simdpp (__m256): ~ 32ns
personal modification simdpp (__m256): ~ 34ns
As expected simdpp performance matches expectations. -
matrix16x16 (AVX2):
plain C (__m256): ~260ns
simdpp (__m256): ~ 2915ns
personal modification simdpp (__m256): ~ 257ns
Again simdpp creates a performance hit by using for loop as explained in the second case.
Problems stated above are more common in simdpp implementation (eg. add, mul, sub functions). I haven't tested above code using G++ and Clang on my platform but I suppose some of these problems can still happen (eg. the one with _mm_broadcast_ss). Also I haven't tested performance of these matrix multiplications on AVX512F because I don't have a CPU supporting that instruction set but some of issues of similar kind may be present with use of this instruction set.