The algorithm is very CPU intense, might be a good idea to do that close to the metal. SIMD or numpy? There are probably a few ways to do that, everything else can stay Python.