-
Notifications
You must be signed in to change notification settings - Fork 2
Separable filtering #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see the first PR! Thank you!
"buildTypes": { | ||
"betterC": { | ||
"buildOptions": ["noBoundsCheck", "releaseMode", "optimize", "inline"], | ||
"dflags-ldc": ["-betterC", "-mcpu=native", "-enable-cross-module-inlining"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/+ | ||
Module contains memory handing routines used throughout the library. | ||
+/ | ||
module mir.cv.core.memory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please create a repo mir-rt
and copy glas/internal/memory here to be reused both in mir-cv and mir-glas? It is already contains required pragmas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great- will do!
cpuid_x86_any_init(); | ||
} | ||
} | ||
static Init_cpu_id __cpuid_init; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to get L1/L2/L3 cache levels in the future. In the same time I remember Walter and Andrei proposed extern(C) shared this()
module constructors that would work in betterC.
Utilities to help use SIMD vectors. | ||
+/ | ||
|
||
module mir.cv.core.simd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose another approach that is used in mir-glas, see glas/internal/config.
The reason is mir-cv
is open source and boost licensed and d compilers are fast. LDC supports compile time traits for CPU information (see glas internal config). Lets move forward with LDC for betterC libraries. Front-end DCV users would be able to use DMD because extern(C) interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are able to compiler multople mir-cv dynamic backends for different CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Front-end DCV users would be able to use DMD because extern(C) interface.
I think you meant to say 'Front-end DCV users would be able to use mir-cv...'? If thats the case, I concur, I understood that was the idea from the beginning.
I would propose another approach that is used in mir-glas, see glas/internal/config.
Nice! It's really cool there are already some modules that we can reuse in mir libs. I'll study this module and see how I can fit it here. Should this also be copied to mir-rt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes
- Yes, but to partially, and to mir-algorithm, because it is compile time logic. We need lists of available vector length for different types.
source/mir/cv/imgproc/imfilter.d
Outdated
} | ||
|
||
pragma(inline, true) | ||
auto max(T)(T t1, T t2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source/mir/cv/imgproc/imfilter.d
Outdated
if (avx) // add version (compile time flag if avx should be supported) | ||
{ | ||
inner_filtering_impl!(AVX!T)(input, temp, hm, vm, output, | ||
&apply_horizontal_kernel_simd!(AVX!T), &apply_vertical_kernel_simd!(AVX!T)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply_vertical_kernel_simd
is marked pragma(inline, true)
. If you want apply_vertical_kernel_simd
to be inlined it is more intuitive to pass it as template param
source/mir/cv/imgproc/imfilter.d
Outdated
b = output[rrb .. $, 0 .. $].map!toPtr; | ||
|
||
zip!true(a, t)[0 .. $, 0 .. $ - hsize].each!((w) { hskernel(w.a, hmask._iterator, w.b, hsize); }); | ||
zip!true(t, b)[0 .. $ - vsize, 0 .. $ - hsize / 2].each!((w) { vskernel(w.a, vmask._iterator, w.b, vsize, rowstride); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zip!true
assumes that universal slices has the same strides. Is it true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. This is true.
source/mir/cv/imgproc/imfilter.d
Outdated
Slice!(Universal, [2], InstructionSet.Scalar*) input, Slice!(Universal, | ||
[2], InstructionSet.Scalar*) temp, Slice!(Universal, [1], | ||
InstructionSet.Scalar*) hmask, Slice!(Universal, [1], | ||
InstructionSet.Scalar*) vmask, Slice!(Universal, [2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have Contiguous
input and use universal
internally because this function assumes that slices has the same strides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, to be honest, its easier for me to mark those Universal
at top, and never to worry if by some call (such as strided
) another kind of slice will be returned in the following. But anyways, I'll change those to Contiguous
- I assume it better documents the code, and also makes it more type safe (less error prone), even though this is not an API function.
source/mir/cv/imgproc/imfilter.d
Outdated
} | ||
else | ||
{ | ||
inner_filtering_impl!(Non_SIMD!T)(input, temp, hm, vm, output, | ||
get_horizontal_kernel_for_mask!T(hsize), get_vertical_kernel_for_mask!T(vsize)); | ||
switch (ksize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@9il do you know of a nicer way to write this? (to avoid the switch
)
@9il I've created mir-rt, and added |
Thanks @ljubobratovicrelja. Yes. I will remove config because it is compile time feature and should go to mir-algorithm. |
I will try to make BTW, we need to think about completely generic Allocators, that would not have instance field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have sent me a link to special language. Do not remember what. But i am sure that current code can be improved few times in terms of performance. Please use multilevel kermel architecture. Raw level, Column level, cache (block) level. Please review how it was done in those language for image processing
Should be available API for separate column processing and raw processing. Separate for common convolution and borders convolution.
auto hk = cast(V[])alignedAllocate(ksize * vbytes, 16); // horizontal simd kernel | ||
auto vk = cast(V[])alignedAllocate(ksize * vbytes, 16); // vertical simd kernel | ||
|
||
scope (exit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scope(...)
is try-catch
analog. A betterC library should not have it, thought
// It this is SIMD instructions set, allocate vectors for kernels | ||
|
||
auto hk = cast(V[])alignedAllocate(ksize * vbytes, 16); // horizontal simd kernel | ||
auto vk = cast(V[])alignedAllocate(ksize * vbytes, 16); // vertical simd kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V.sizeof.max(16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can multiple vector by scalar. No need to allocate memory and fill it with vectors.
|
||
foreach (i; 0 .. ksize) | ||
{ | ||
hk[i].array[] = hmask[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.array
may generate bad code. hk[i] = hmask[i];
should work in LDC and recent DMD.
/// SIMD vector trait - build vector type using bitsize and scalar type. | ||
template Vector(size_t bitsize, T) | ||
{ | ||
alias Vector = __vector(T[(bitsize / 8) / T.sizeof]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use __vector
instead. This would make code more readable for new commers
source/mir/cv/imgproc/imfilter.d
Outdated
} | ||
|
||
// Get pointers to where vectors are loaded (has to be public) | ||
static auto toPtr(ref T e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use such hacks if possible
source/mir/cv/imgproc/imfilter.d
Outdated
zip!true(a, t, b) | ||
.blocks(rtiling, ctiling) | ||
//.parallel | ||
.each!((b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please foreach loops instead of each
. THe are more readable. each
is good when the function is already defined.
inner_filtering_impl!(Non_SIMD!T, apply_horizontal_kernel_7!T, apply_vertical_kernel_7!T) | ||
(input, temp, hm, vm, output); | ||
break; | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 3, 5, 7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would also need 2, 4, 6, 8 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use static foreach in switch and compile time arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, 2, 3, 4, 5, 6, 7, 8 filters can be optimised with simds too
} | ||
|
||
// right columns | ||
if (crb < cols - ksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code duplication. Move it separate function.
|
||
// L1 cache block tiling (under 8192 bytes) | ||
immutable rtiling = 32; | ||
immutable ctiling = max(size_t(1), 256 / (tbytes * ksize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be described
{ | ||
r += (*(i++)) * (*(k++)); | ||
} | ||
o[msize / 2] = r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bugger architecture. should be o = r
and proper ref o
by design
It was halide-lang. And here is an awesome talk describing filter scheduling schemes in detail.
Awesome- I was hoping you'd say that! :)
You'd have to help me on this one. Could you describe this architecture in more detail, or maybe point me to some sources where I can learn about it?
I agree, and that was the plan. Also we should implement in the same package kernel separability check. There is a nice explanation on matlab's function. |
else | ||
{ | ||
// ... otherwise just use input buffers. | ||
auto hk = hmask._iterator[0 .. ksize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new syntax: auto hk = hmask.field;
Heh, hope I will have a time to describe my thoughts. |
Hey @9il!
Just after we opened mir-cv, next semester started, and I got caught up in an academic meat grinder again. So sorry for the delay.
Anyhow, I wanted to take a shot at separable filtering as my first mir-cv addition. I was hopping you'd take a look at what I've done so far. I think this meets betterC standards (once compiled with betterC flag and linked to another executable,
separable_imfilter
function works as expected).I've started a PR right away to have dedicated chat space, and hopefully this makes change tracking easier for you. Hopefully that's ok with you.
Design
I've tried building a filtering framework on which many other filtering algorithms (hopefully) could rely on. Having kernels as function pointers with the form I believe would be flexible enough to support many other filtering algorithms (sobel, bilateral, median etc.) - not really sure current design allows this, but nevertheless- that is the goal.
Implementation
Please note that I'm aware this looks more like the 'realC', rather than a 'betterC', but I'm hopping you'll point me where this can be improved. Regardless, I've implemented basic SIMD support, which shows nice results with AVX and single precision floating vectors on my machine. I've also tried making code cache friendly - I'm not sure how'd I test cache missing, but this scheme sure is faster than naive variant, so I'm hopeful I've done it right.
Results
On 512x512x1 test image, here are some rough results (I'll make a real profiling once I'm done with implementation):
conv
takes about 4 ms even though is parallelized, whereseparable_imfilter
is not)Note: these results are with inlined (template) calls (without linking). I failed setting up a link time optimization with LDC, without which each 3x3 filtering takes about 1-2ms. Is this expected, and should LTO fix this?
To be done
Feel free to change anything - just please let me know why the change so I can follow. :)