See https://github.com/JuliaGPU/CUDA.jl/pull/2813#issuecomment-3148696577 for more details. Some algorithms/array shapes are severely hindered by the constant block size of 256. Maybe this would be better implemented in GPUArrays. Open to discussion.