Separable filtering #3

ljubobratovicrelja · 2017-05-02T22:42:15Z

Just after we opened mir-cv, next semester started, and I got caught up in an academic meat grinder again. So sorry for the delay.

Anyhow, I wanted to take a shot at separable filtering as my first mir-cv addition. I was hopping you'd take a look at what I've done so far. I think this meets betterC standards (once compiled with betterC flag and linked to another executable, separable_imfilter function works as expected).

I've started a PR right away to have dedicated chat space, and hopefully this makes change tracking easier for you. Hopefully that's ok with you.

Design

I've tried building a filtering framework on which many other filtering algorithms (hopefully) could rely on. Having kernels as function pointers with the form I believe would be flexible enough to support many other filtering algorithms (sobel, bilateral, median etc.) - not really sure current design allows this, but nevertheless- that is the goal.

Implementation

Please note that I'm aware this looks more like the 'realC', rather than a 'betterC', but I'm hopping you'll point me where this can be improved. Regardless, I've implemented basic SIMD support, which shows nice results with AVX and single precision floating vectors on my machine. I've also tried making code cache friendly - I'm not sure how'd I test cache missing, but this scheme sure is faster than naive variant, so I'm hopeful I've done it right.

Results

On 512x512x1 test image, here are some rough results (I'll make a real profiling once I'm done with implementation):

3x3 filtering takes ~400 microsecs (dcv's conv takes about 4 ms even though is parallelized, where separable_imfilter is not)
5x5 kernel ~600usecs (dcv ~8ms)
7x7 ~800usecs (cv ~20ms)
15x15 ~1300usecs, and so on (haven't tested dcv, but I'd say it's as slow as previous ones).

Note: these results are with inlined (template) calls (without linking). I failed setting up a link time optimization with LDC, without which each 3x3 filtering takes about 1-2ms. Is this expected, and should LTO fix this?

To be done

border handing (so far only replication is done)
test filtering framework with other algorithms (sobel, and other filtering algorithms where specialized kernel can be made, bilateral, median etc.)
one way filtering (not separated)
making C headers, testing with linking against C program
there'll be more surely...

Feel free to change anything - just please let me know why the change so I can follow. :)

9il

Nice to see the first PR! Thank you!

9il · 2017-05-03T00:05:53Z

dub.json

+    "buildTypes": {
+        "betterC": {
+            "buildOptions": ["noBoundsCheck", "releaseMode", "optimize", "inline"],
+            "dflags-ldc": ["-betterC", "-mcpu=native", "-enable-cross-module-inlining"]


9il · 2017-05-03T00:12:07Z

source/mir/cv/core/memory.d

+/+
+    Module contains memory handing routines used throughout the library.
+/
+module mir.cv.core.memory;


Could you please create a repo mir-rt and copy glas/internal/memory here to be reused both in mir-cv and mir-glas? It is already contains required pragmas.

This is great- will do!

9il · 2017-05-03T00:23:11Z

source/mir/cv/core/simd.d

+            cpuid_x86_any_init();
+        }
+    }
+    static Init_cpu_id __cpuid_init;


We need this to get L1/L2/L3 cache levels in the future. In the same time I remember Walter and Andrei proposed extern(C) shared this() module constructors that would work in betterC.

9il · 2017-05-03T00:23:20Z

source/mir/cv/core/simd.d

+    Utilities to help use SIMD vectors.
+/
+
+module mir.cv.core.simd;


I would propose another approach that is used in mir-glas, see glas/internal/config.

The reason is mir-cv is open source and boost licensed and d compilers are fast. LDC supports compile time traits for CPU information (see glas internal config). Lets move forward with LDC for betterC libraries. Front-end DCV users would be able to use DMD because extern(C) interface.

We are able to compiler multople mir-cv dynamic backends for different CPUs.

Front-end DCV users would be able to use DMD because extern(C) interface.

I think you meant to say 'Front-end DCV users would be able to use mir-cv...'? If thats the case, I concur, I understood that was the idea from the beginning.

I would propose another approach that is used in mir-glas, see glas/internal/config.

Nice! It's really cool there are already some modules that we can reuse in mir libs. I'll study this module and see how I can fit it here. Should this also be copied to mir-rt?

Yes

Yes, but to partially, and to mir-algorithm, because it is compile time logic. We need lists of available vector length for different types.

9il · 2017-05-03T00:30:46Z

source/mir/cv/imgproc/imfilter.d

+}
+
+pragma(inline, true)
+auto max(T)(T t1, T t2) {


http://docs.algorithm.dlang.io/latest/mir_utility.html#.min :-)

9il · 2017-05-03T00:32:57Z

source/mir/cv/imgproc/imfilter.d

+    if (avx) // add version (compile time flag if avx should be supported)
+    {
+        inner_filtering_impl!(AVX!T)(input, temp, hm, vm, output,
+            &apply_horizontal_kernel_simd!(AVX!T), &apply_vertical_kernel_simd!(AVX!T));


apply_vertical_kernel_simd is marked pragma(inline, true). If you want apply_vertical_kernel_simd to be inlined it is more intuitive to pass it as template param

9il · 2017-05-03T00:35:11Z

source/mir/cv/imgproc/imfilter.d

+        b = output[rrb .. $, 0 .. $].map!toPtr;
+
+        zip!true(a, t)[0 .. $, 0 .. $ - hsize].each!((w) { hskernel(w.a, hmask._iterator, w.b, hsize); });
+        zip!true(t, b)[0 .. $ - vsize, 0 .. $ - hsize / 2].each!((w) { vskernel(w.a, vmask._iterator, w.b, vsize, rowstride); });


zip!true assumes that universal slices has the same strides. Is it true?

Ah, yes. This is true.

9il · 2017-05-03T00:38:15Z

source/mir/cv/imgproc/imfilter.d

+    Slice!(Universal, [2], InstructionSet.Scalar*) input, Slice!(Universal,
+    [2], InstructionSet.Scalar*) temp, Slice!(Universal, [1],
+    InstructionSet.Scalar*) hmask, Slice!(Universal, [1],
+    InstructionSet.Scalar*) vmask, Slice!(Universal, [2],


Can we have Contiguous input and use universal internally because this function assumes that slices has the same strides

Well, to be honest, its easier for me to mark those Universal at top, and never to worry if by some call (such as strided) another kind of slice will be returned in the following. But anyways, I'll change those to Contiguous - I assume it better documents the code, and also makes it more type safe (less error prone), even though this is not an API function.

ljubobratovicrelja · 2017-05-03T20:32:46Z

source/mir/cv/imgproc/imfilter.d

    }
    else
    {
-        inner_filtering_impl!(Non_SIMD!T)(input, temp, hm, vm, output,
-            get_horizontal_kernel_for_mask!T(hsize), get_vertical_kernel_for_mask!T(vsize));
+        switch (ksize) {


@9il do you know of a nicer way to write this? (to avoid the switch)

ljubobratovicrelja · 2017-05-03T21:26:40Z

@9il I've created mir-rt, and added glas.internal.memory and glas.internal.config as instructed, but can I ask you to take care of stuff like descriptions, and to check/add the details like copyright and license files etc.? Let me know when you're ok with making a release and registering it to the dub.

9il · 2017-05-04T02:50:34Z

Thanks @ljubobratovicrelja. Yes. I will remove config because it is compile time feature and should go to mir-algorithm.

9il · 2017-05-04T03:05:39Z

I will try to make memory a template source file and include it into mir-algorithm as well as config. Sorry for mir-rt, we will use this repo for non templated code.

BTW, we need to think about completely generic Allocators, that would not have instance field.

9il

You have sent me a link to special language. Do not remember what. But i am sure that current code can be improved few times in terms of performance. Please use multilevel kermel architecture. Raw level, Column level, cache (block) level. Please review how it was done in those language for image processing

Should be available API for separate column processing and raw processing. Separate for common convolution and borders convolution.

9il · 2017-05-04T03:30:55Z

source/mir/cv/imgproc/imfilter.d

+        auto hk = cast(V[])alignedAllocate(ksize * vbytes, 16); // horizontal simd kernel
+        auto vk = cast(V[])alignedAllocate(ksize * vbytes, 16); // vertical simd kernel
+
+        scope (exit)


scope(...) is try-catch analog. A betterC library should not have it, thought

9il · 2017-05-04T03:32:38Z

source/mir/cv/imgproc/imfilter.d

+        // It this is SIMD instructions set, allocate vectors for kernels
+
+        auto hk = cast(V[])alignedAllocate(ksize * vbytes, 16); // horizontal simd kernel
+        auto vk = cast(V[])alignedAllocate(ksize * vbytes, 16); // vertical simd kernel


V.sizeof.max(16)

You can multiple vector by scalar. No need to allocate memory and fill it with vectors.

9il · 2017-05-04T03:35:37Z

source/mir/cv/imgproc/imfilter.d

+
+        foreach (i; 0 .. ksize)
+        {
+            hk[i].array[] = hmask[i];


.array may generate bad code. hk[i] = hmask[i]; should work in LDC and recent DMD.

9il · 2017-05-04T03:44:20Z

source/mir/cv/core/simd.d

+/// SIMD vector trait - build vector type using bitsize and scalar type.
+template Vector(size_t bitsize, T)
+{
+    alias Vector = __vector(T[(bitsize / 8) / T.sizeof]);


Please use __vector instead. This would make code more readable for new commers

9il · 2017-05-04T03:49:06Z

source/mir/cv/imgproc/imfilter.d

+    }
+
+    // Get pointers to where vectors are loaded (has to be public)
+    static auto toPtr(ref T e)


Please do not use such hacks if possible

9il · 2017-05-04T04:02:35Z

source/mir/cv/imgproc/imfilter.d

+    zip!true(a, t, b)
+    .blocks(rtiling, ctiling)
+  //.parallel
+    .each!((b) {


please foreach loops instead of each. THe are more readable. each is good when the function is already defined.

9il · 2017-05-04T04:08:02Z

source/mir/cv/imgproc/imfilter.d

+            inner_filtering_impl!(Non_SIMD!T, apply_horizontal_kernel_7!T, apply_vertical_kernel_7!T)
+                (input, temp, hm, vm, output);
+            break;
+        default:


Why 3, 5, 7?

We would also need 2, 4, 6, 8 :-)

Please use static foreach in switch and compile time arguments

BTW, 2, 3, 4, 5, 6, 7, 8 filters can be optimised with simds too

9il · 2017-05-04T04:11:10Z

source/mir/cv/imgproc/imfilter.d

+    }
+
+    // right columns
+    if (crb < cols - ksize)


Code duplication. Move it separate function.

9il · 2017-05-04T04:15:06Z

source/mir/cv/imgproc/imfilter.d

+
+    // L1 cache block tiling (under 8192 bytes)
+    immutable rtiling = 32;
+    immutable ctiling = max(size_t(1), 256 / (tbytes * ksize));


should be described

9il · 2017-05-04T04:17:44Z

source/mir/cv/imgproc/imfilter.d

+    {
+        r += (*(i++)) * (*(k++));
+    }
+    o[msize / 2] = r;


bugger architecture. should be o = r and proper ref o by design

ljubobratovicrelja · 2017-05-09T18:41:34Z

You have sent me a link to special language. Do not remember what

It was halide-lang. And here is an awesome talk describing filter scheduling schemes in detail.

But i am sure that current code can be improved few times in terms of performance.

Awesome- I was hoping you'd say that! :)

Please use multilevel kermel architecture. Raw level, Column level, cache (block) level.

You'd have to help me on this one. Could you describe this architecture in more detail, or maybe point me to some sources where I can learn about it?

Should be available API for separate column processing and raw processing. Separate for common convolution and borders convolution.

I agree, and that was the plan. Also we should implement in the same package kernel separability check. There is a nice explanation on matlab's function.

9il · 2017-05-11T05:04:36Z

source/mir/cv/imgproc/imfilter.d

+    else
+    {
+        // ... otherwise just use input buffers.
+        auto hk = hmask._iterator[0 .. ksize];


new syntax: auto hk = hmask.field;

9il · 2017-05-11T05:06:52Z

Heh, hope I will have a time to describe my thoughts.
BTW, may be helpfull: http://docs.algorithm.dlang.io/latest/mir_ndslice_topology.html#.slide

Added initial implementation of separable image filtering.

39d41a7

ljubobratovicrelja self-assigned this May 2, 2017

ljubobratovicrelja requested a review from 9il May 2, 2017 22:42

9il reviewed May 3, 2017

View reviewed changes

ljubobratovicrelja added 4 commits May 3, 2017 21:32

Using mir.utility.max.

0973959

Formatting.

2b87e59

Universal Slice to Contiguous, formatting.

dab79ab

Kernels as template params. Unified kernel mask size.

bb432b3

ljubobratovicrelja commented May 3, 2017

View reviewed changes

9il requested changes May 4, 2017

View reviewed changes

ljubobratovicrelja added 3 commits May 9, 2017 21:24

Refactoring - each -> foreach.

7255d21

Refactoring - use iota on ptrs instead toPtr hack.

1d24c88

Added fastmath attribute.

a9bc831

9il reviewed May 11, 2017

View reviewed changes

ljubobratovicrelja mentioned this pull request Feb 21, 2018

seems deprecated in favor of dcv; should it be deleted/renamed mir-cv-deprecated? #4

Open

Separable filtering #3

Are you sure you want to change the base?

Separable filtering #3

Conversation

ljubobratovicrelja commented May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Implementation

Results

To be done

Uh oh!

9il left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljubobratovicrelja commented May 3, 2017

Uh oh!

9il commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9il commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9il left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljubobratovicrelja commented May 2, 2017 •

edited

Loading

9il commented May 4, 2017 •

edited

Loading

9il commented May 4, 2017 •

edited

Loading

9il left a comment •

edited

Loading