v0.8.1
v0.8.1 sees the built in transpose function use a different algorithm to perform inplace transpose.
Prior to this version the transpose uses a cycle-chasing algorithm. This turns out to have poor cache locality. So the solution is to replace that with one that allocates a new temporary array. The transpose operation is then simpy an iterative copying to the new array. The data is then copied from the temp array back to the original array.
v0.8.1 also sees an improvement contributed by @stuartcarnie on the FlatIterator
structure. Here's the benchmark results.
benchmark old ns/op new ns/op delta
BenchmarkComplicatedGet-8 228778 199737 -12.69%
benchmark old allocs new allocs delta
BenchmarkComplicatedGet-8 2 2 +0.00%
benchmark old bytes new bytes delta
BenchmarkComplicatedGet-8 112 112 +0.00%