Skip to content

Separable filtering #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Separable filtering #3

wants to merge 8 commits into from

Conversation

ljubobratovicrelja
Copy link
Member

@ljubobratovicrelja ljubobratovicrelja commented May 2, 2017

Hey @9il!

Just after we opened mir-cv, next semester started, and I got caught up in an academic meat grinder again. So sorry for the delay.

Anyhow, I wanted to take a shot at separable filtering as my first mir-cv addition. I was hopping you'd take a look at what I've done so far. I think this meets betterC standards (once compiled with betterC flag and linked to another executable, separable_imfilter function works as expected).

I've started a PR right away to have dedicated chat space, and hopefully this makes change tracking easier for you. Hopefully that's ok with you.

Design

I've tried building a filtering framework on which many other filtering algorithms (hopefully) could rely on. Having kernels as function pointers with the form I believe would be flexible enough to support many other filtering algorithms (sobel, bilateral, median etc.) - not really sure current design allows this, but nevertheless- that is the goal.

Implementation

Please note that I'm aware this looks more like the 'realC', rather than a 'betterC', but I'm hopping you'll point me where this can be improved. Regardless, I've implemented basic SIMD support, which shows nice results with AVX and single precision floating vectors on my machine. I've also tried making code cache friendly - I'm not sure how'd I test cache missing, but this scheme sure is faster than naive variant, so I'm hopeful I've done it right.

Results

On 512x512x1 test image, here are some rough results (I'll make a real profiling once I'm done with implementation):

  • 3x3 filtering takes ~400 microsecs (dcv's conv takes about 4 ms even though is parallelized, where separable_imfilter is not)
  • 5x5 kernel ~600usecs (dcv ~8ms)
  • 7x7 ~800usecs (cv ~20ms)
  • 15x15 ~1300usecs, and so on (haven't tested dcv, but I'd say it's as slow as previous ones).

Note: these results are with inlined (template) calls (without linking). I failed setting up a link time optimization with LDC, without which each 3x3 filtering takes about 1-2ms. Is this expected, and should LTO fix this?

To be done

  • border handing (so far only replication is done)
  • test filtering framework with other algorithms (sobel, and other filtering algorithms where specialized kernel can be made, bilateral, median etc.)
  • one way filtering (not separated)
  • making C headers, testing with linking against C program
  • there'll be more surely...

Feel free to change anything - just please let me know why the change so I can follow. :)

@ljubobratovicrelja ljubobratovicrelja self-assigned this May 2, 2017
@ljubobratovicrelja ljubobratovicrelja requested a review from 9il May 2, 2017 22:42
Copy link
Member

@9il 9il left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see the first PR! Thank you!

"buildTypes": {
"betterC": {
"buildOptions": ["noBoundsCheck", "releaseMode", "optimize", "inline"],
"dflags-ldc": ["-betterC", "-mcpu=native", "-enable-cross-module-inlining"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/+
Module contains memory handing routines used throughout the library.
+/
module mir.cv.core.memory;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please create a repo mir-rt and copy glas/internal/memory here to be reused both in mir-cv and mir-glas? It is already contains required pragmas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great- will do!

cpuid_x86_any_init();
}
}
static Init_cpu_id __cpuid_init;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to get L1/L2/L3 cache levels in the future. In the same time I remember Walter and Andrei proposed extern(C) shared this() module constructors that would work in betterC.

Utilities to help use SIMD vectors.
+/

module mir.cv.core.simd;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose another approach that is used in mir-glas, see glas/internal/config.

The reason is mir-cv is open source and boost licensed and d compilers are fast. LDC supports compile time traits for CPU information (see glas internal config). Lets move forward with LDC for betterC libraries. Front-end DCV users would be able to use DMD because extern(C) interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are able to compiler multople mir-cv dynamic backends for different CPUs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Front-end DCV users would be able to use DMD because extern(C) interface.

I think you meant to say 'Front-end DCV users would be able to use mir-cv...'? If thats the case, I concur, I understood that was the idea from the beginning.

I would propose another approach that is used in mir-glas, see glas/internal/config.

Nice! It's really cool there are already some modules that we can reuse in mir libs. I'll study this module and see how I can fit it here. Should this also be copied to mir-rt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes
  2. Yes, but to partially, and to mir-algorithm, because it is compile time logic. We need lists of available vector length for different types.

}

pragma(inline, true)
auto max(T)(T t1, T t2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (avx) // add version (compile time flag if avx should be supported)
{
inner_filtering_impl!(AVX!T)(input, temp, hm, vm, output,
&apply_horizontal_kernel_simd!(AVX!T), &apply_vertical_kernel_simd!(AVX!T));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_vertical_kernel_simd is marked pragma(inline, true). If you want apply_vertical_kernel_simd to be inlined it is more intuitive to pass it as template param

b = output[rrb .. $, 0 .. $].map!toPtr;

zip!true(a, t)[0 .. $, 0 .. $ - hsize].each!((w) { hskernel(w.a, hmask._iterator, w.b, hsize); });
zip!true(t, b)[0 .. $ - vsize, 0 .. $ - hsize / 2].each!((w) { vskernel(w.a, vmask._iterator, w.b, vsize, rowstride); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip!true assumes that universal slices has the same strides. Is it true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. This is true.

Slice!(Universal, [2], InstructionSet.Scalar*) input, Slice!(Universal,
[2], InstructionSet.Scalar*) temp, Slice!(Universal, [1],
InstructionSet.Scalar*) hmask, Slice!(Universal, [1],
InstructionSet.Scalar*) vmask, Slice!(Universal, [2],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have Contiguous input and use universal internally because this function assumes that slices has the same strides

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, to be honest, its easier for me to mark those Universal at top, and never to worry if by some call (such as strided) another kind of slice will be returned in the following. But anyways, I'll change those to Contiguous - I assume it better documents the code, and also makes it more type safe (less error prone), even though this is not an API function.

}
else
{
inner_filtering_impl!(Non_SIMD!T)(input, temp, hm, vm, output,
get_horizontal_kernel_for_mask!T(hsize), get_vertical_kernel_for_mask!T(vsize));
switch (ksize) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@9il do you know of a nicer way to write this? (to avoid the switch)

@ljubobratovicrelja
Copy link
Member Author

@9il I've created mir-rt, and added glas.internal.memory and glas.internal.config as instructed, but can I ask you to take care of stuff like descriptions, and to check/add the details like copyright and license files etc.? Let me know when you're ok with making a release and registering it to the dub.

@9il
Copy link
Member

9il commented May 4, 2017

Thanks @ljubobratovicrelja. Yes. I will remove config because it is compile time feature and should go to mir-algorithm.

@9il
Copy link
Member

9il commented May 4, 2017

I will try to make memory a template source file and include it into mir-algorithm as well as config. Sorry for mir-rt, we will use this repo for non templated code.

BTW, we need to think about completely generic Allocators, that would not have instance field.

Copy link
Member

@9il 9il left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have sent me a link to special language. Do not remember what. But i am sure that current code can be improved few times in terms of performance. Please use multilevel kermel architecture. Raw level, Column level, cache (block) level. Please review how it was done in those language for image processing

Should be available API for separate column processing and raw processing. Separate for common convolution and borders convolution.

auto hk = cast(V[])alignedAllocate(ksize * vbytes, 16); // horizontal simd kernel
auto vk = cast(V[])alignedAllocate(ksize * vbytes, 16); // vertical simd kernel

scope (exit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope(...) is try-catch analog. A betterC library should not have it, thought

// It this is SIMD instructions set, allocate vectors for kernels

auto hk = cast(V[])alignedAllocate(ksize * vbytes, 16); // horizontal simd kernel
auto vk = cast(V[])alignedAllocate(ksize * vbytes, 16); // vertical simd kernel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V.sizeof.max(16)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can multiple vector by scalar. No need to allocate memory and fill it with vectors.


foreach (i; 0 .. ksize)
{
hk[i].array[] = hmask[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.array may generate bad code. hk[i] = hmask[i]; should work in LDC and recent DMD.

/// SIMD vector trait - build vector type using bitsize and scalar type.
template Vector(size_t bitsize, T)
{
alias Vector = __vector(T[(bitsize / 8) / T.sizeof]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use __vector instead. This would make code more readable for new commers

}

// Get pointers to where vectors are loaded (has to be public)
static auto toPtr(ref T e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use such hacks if possible

zip!true(a, t, b)
.blocks(rtiling, ctiling)
//.parallel
.each!((b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please foreach loops instead of each. THe are more readable. each is good when the function is already defined.

inner_filtering_impl!(Non_SIMD!T, apply_horizontal_kernel_7!T, apply_vertical_kernel_7!T)
(input, temp, hm, vm, output);
break;
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 3, 5, 7?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also need 2, 4, 6, 8 :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use static foreach in switch and compile time arguments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, 2, 3, 4, 5, 6, 7, 8 filters can be optimised with simds too

}

// right columns
if (crb < cols - ksize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication. Move it separate function.


// L1 cache block tiling (under 8192 bytes)
immutable rtiling = 32;
immutable ctiling = max(size_t(1), 256 / (tbytes * ksize));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be described

{
r += (*(i++)) * (*(k++));
}
o[msize / 2] = r;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bugger architecture. should be o = r and proper ref o by design

@ljubobratovicrelja
Copy link
Member Author

You have sent me a link to special language. Do not remember what

It was halide-lang. And here is an awesome talk describing filter scheduling schemes in detail.

But i am sure that current code can be improved few times in terms of performance.

Awesome- I was hoping you'd say that! :)

Please use multilevel kermel architecture. Raw level, Column level, cache (block) level.

You'd have to help me on this one. Could you describe this architecture in more detail, or maybe point me to some sources where I can learn about it?

Should be available API for separate column processing and raw processing. Separate for common convolution and borders convolution.

I agree, and that was the plan. Also we should implement in the same package kernel separability check. There is a nice explanation on matlab's function.

else
{
// ... otherwise just use input buffers.
auto hk = hmask._iterator[0 .. ksize];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new syntax: auto hk = hmask.field;

@9il
Copy link
Member

9il commented May 11, 2017

Heh, hope I will have a time to describe my thoughts.
BTW, may be helpfull: http://docs.algorithm.dlang.io/latest/mir_ndslice_topology.html#.slide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants