-
Notifications
You must be signed in to change notification settings - Fork 7
2 Optimizing Julia benchmarks
One must first read this, and apply all what is said.
In all the benchmarks, the main problem was with memory allocation. If the program allocates large chunks of memory, performances, as it is easy to understand, will be low.
- One must first type all what can be typed, at least where we spend some computing time. Let us look for example at the module Rando.jl in FeStiff/. If we replace:
seed::Int64
a::Int64
c::Int64
m::Int64`
by
seed
a
c
m
the computing time, for the generation of triangles is now 2.622462655 seconds (on my computer), but is was only 0.091999968 seconds when the variable where typed. Now, launch the untyped version with ./script-m and have a look at Rando.jl.mem:
- function fv!(R::RandoData,vmax=1.)
576000240 R.seed= (R.a * R.seed + R.c) % R.m
192000096 vmax*Float64(R.seed)/R.m
- end
Thus, a lot of memory is allocated; return to the original "typed" version, and launch again ./script-m: you can verify that no memory is allocated in the function when the variables are typed!
- Forget what you learned with Python/Numpy:
Have a look at MicroBenchmarks/Ju: all the benchmarks are coded with different programming styles: vector styles (like what we would do with Python/Scipy/Numpy) and a naïve loop unrolling style. The unrolled loop style is always the best. Do not forget that arrays are stored fortran like
(look at MicroBenchmarks/Ju/main_lapl_2.jl), and do not forget the @simd macro.
- Improving Weno/Ju: Look at Weno/Weno.jl. You will see commented lines of instructions: these instructions give a result slower than what we kept (generally because of a memory allocation). For example:
W.InC[3:2+size]=In[:]
or
W.InC[3:2+size]=copy(In[:])
are slower than the loop:
@simd for i=1:size
W.InC[2+i]=In[i]
Even the call to ddot(lines 64 to 64) are slower (and allocate memory) than the naïve implementation (lines 70-80).
---to be continued---