Skip to content

[FEATURE REQUEST] Add DCT #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zeroby0 opened this issue Jan 3, 2025 · 6 comments
Open

[FEATURE REQUEST] Add DCT #702

zeroby0 opened this issue Jan 3, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@zeroby0
Copy link

zeroby0 commented Jan 3, 2025

Add DCT-2 (Type-II Discrete Cosine Transform) functionality to ulab, similar to SciPy's implementation. DCT is widely used in image and audio processing applications.

Wikipedia: https://en.wikipedia.org/wiki/Discrete_cosine_transform
SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.dct.html

Technical Details

  • Implementation can be based on FFT using one of four established methods (detailed in this DSP StackExchange post)
  • I've benchmarked all four methods on a Raspberry Pi Pico using the 2025-01-01 micropython-builder build
  • Benchmark results and the code are at the end of this post

Benefits

  • Essential transform for signal processing applications
  • Minimal code footprint when implemented using existing FFT
  • Potential for performance improvement with C implementation (as opposed to users writing their own DCT)
  • Aligns with ulab's goal of providing NumPy/SciPy-like functionality

Benchmark

  • DCT_2N_M is the mirrored version
  • DCT_2N_P is the padded version.
  • Exprange is the time taken for np.exp(...) (refer to code below) , which can be memoized.
N DCT_4N DCT_2N_M DCT_2N_P DCT_N FFT EXPRANGE
4 1.455 1.976 2.026 2.076 0.462 0.992
8 2.136 2.541 2.568 2.476 0.678 1.199
16 3.618 3.691 3.741 3.293 1.096 1.654
32 6.819 6.125 6.244 4.923 1.735 2.532
64 13.798 11.278 11.448 8.36 3.23 4.312
128 28.994 22.136 22.56 15.504 6.481 7.843
256 61.7 45.2 47.0 30.6 13.2 14.6
512 133.5 96.0 98.5 62.8 28.6 29.5
1024 284.9 201.0 205.3 132.6 62.1 59.1
2048 611.8 416.7 425.3 268.6 134.5 117.2

I would recommend DCT_N (DCT_FFTN) in code with the L = 2 * np.exp(...) memoized for values of N for the last few calls. FFT compute and L = 2 * np.exp(...) cache memory occupied are the least for it.

If there is interest and drive to implement this feature, I will help find FFT implementations for other forms of DCT. I would implement this and create a PR, but wanted to check the waters first, is there interest in the community for this?

Code

from ulab import numpy as np
from time import ticks_ms, ticks_diff


def timer(func):
    def wrapper(*args, **kwargs):
        start = ticks_ms()
        for i in range(10):
            result = func(*args, **kwargs)
        end = ticks_ms()
        print(f"[] {func.__name__} took {ticks_diff(end, start)/10} milliseconds")
        return result
    return wrapper


@timer
def FFT(x):
    return np.fft.fft(x)


@timer
def DCT_FFT4N(x):
    N = x.shape[-1]

    u = np.zeros(4 * N)
    u[1:2*N:2] = x
    u[2*N+1::2] = x[::-1]

    U = np.fft.fft(u)[:N]
    return U.real


@timer
def DCT_FFT2NM(x):
    N = x.shape[-1]

    y = np.zeros(2*N)
    y[:N] = x
    y[N:] = x[::-1]

    Y = np.fft.fft(y)[:N]

    L = np.exp(-1j*np.pi*np.arange(N)/(2*N))

    return (Y.real * L.real) - (Y.imag * L.imag)


@timer
def DCT_FFT2NP(x):
    N = x.shape[-1]

    y = np.zeros(2*N)
    y[:N] = x

    Y = np.fft.fft(y)[:N]

    L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))

    return (Y.real * L.real) - (Y.imag * L.imag)


@timer
def DCT_FFTN(x):
    N = x.shape[-1]

    v = np.zeros(x.shape, dtype=x.dtype)
    v[:(N-1)//2+1] = x[::2]

    if N % 2: # odd length
        v[(N-1)//2+1:] = x[-2::-2]
    else: # even length
        v[(N-1)//2+1:] = x[::-2]

    V = np.fft.fft(v)

    L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))

    return (V.real * L.real) - (V.imag * L.imag)


@timer
def EXPRANGE(N):
    L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
    return L


for i in range(2, 12):

    P = 2**i

    x = np.arange(P)

    print(f"\nP = {P}")
    y = FFT(x)
    y = DCT_FFT4N(x)
    y = DCT_FFT2NM(x)
    y = DCT_FFT2NP(x)
    y = DCT_FFTN(x)

    z = EXPRANGE(P)
@zeroby0 zeroby0 added the enhancement New feature or request label Jan 3, 2025
@zeroby0
Copy link
Author

zeroby0 commented Jan 3, 2025

People tend to run the same sized DCT many times in their code, so we can alternatively do

def mkDCT(N):

    L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
    def DCT_N(x):
        y = np.zeros(x.shape)

        y[:(N-1)//2+1] = x[::2]

        if N % 2: # odd length
            y[(N-1)//2+1:] = x[-2::-2]
        else: # even length
            y[(N-1)//2+1:] = x[::-2]

        Y = np.fft.fft(y)

        return (Y.real * L_N.real) - (Y.imag * L_N.imag)

    return DCT_N

So we can have memoization, and also have it garbage collected when it's not needed. This breaks correspondence with the scipy API though.

@v923z
Copy link
Owner

v923z commented Jan 3, 2025

If compatibility with scipy is a concern, you should implement this as part of the utils module. I would be a bit cautious with the benchmarks above, for the simple reason that all your functions generate a lot of one-time-use artifacts, thus, I believe that the real gain is actually larger. You can look at the implementation of https://micropython-ulab.readthedocs.io/en/latest/ulab-utils.html#spectrogram, where you can supply pre-allocated RAM to the method.

Having said that, if you think that there is interest in this function, I would be happy to incorporate it here. If you want to boost the visibility of the issue, you can also post in https://github.com/v923z/micropython-ulab/discussions.

@zeroby0
Copy link
Author

zeroby0 commented Jan 4, 2025

It would be very nice if this can be even faster! I've been looking at the source and documentation spectrogram and FFT, and figuring out a way to not create the y variable, but haven't figured it out yet. However, y can be shared across DCT calls.

However, it seems most of the time is spent in FFT and in generating the L_N with the exponent. Gain from removing the y variable might be less than 5%. But I would love to be proven wrong here! Even if the improvement is small, I love seeing clever code!

Do you have any comment on why FFT is taking so long? Pico has hardware floating point support, so I expected it would be much faster than ulab_samples benchmarks. NVM, the Pico RP2040 doesn't have an FPU.

OP Time
FFT 59.3 ms
Exponent generation 58.2 ms
Complex Multiplication 8.6 ms
zero allocation 0.1 ms
rest of the code 2.7 ms
def mkDCT(N):

    L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
    y   = np.zeros(N)

    @timer
    def DCT_N(x):

        y[:(N-1)//2+1] = x[::2]

        if N % 2: # odd length
            y[(N-1)//2+1:] = x[-2::-2]
        else: # even length
            y[(N-1)//2+1:] = x[::-2]

        Y = np.fft.fft(y)

        return (Y.real * L_N.real) - (Y.imag * L_N.imag)

    return DCT_N

@v923z
Copy link
Owner

v923z commented Jan 4, 2025

It would be very nice if this can be even faster! I've been looking at the source and documentation spectrogram and FFT, and figuring out a way to not create the y variable, but haven't figured it out yet. However, y can be shared across DCT calls.

I'm not quite sure I follow: in the FFT, you return a new array, because you want to leave x alone. So, can't you just do the assignment in C like here?

*tmp++ = func2(array);
array += in2->strides[ULAB_MAX_DIMS - 1];
}
tmp -= len;
} else {
// if there is only one input argument, clear the imaginary part
memset(tmp, 0, len * sizeof(mp_float_t));
}
tmp -= len;
fft_kernel(tmp, tmp + len, len, 1);

However, it seems most of the time is spent in FFT and in generating the L_N with the exponent.

On a microcrontoller, that's not the most efficient code: you calculate the exponential of a purely imaginary number, and then you also take the real and imaginary parts at the very end of your function. The same can be achieved by taking the sin and cos of the same array. You can save a lot by that. Standard math functions accept the out keyword argument, which means that you can also re-use RAM, if you calculate Y.real * L_N.real - Y.imag * L_N.iimag in two separate calls.

Also note that the instruction L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N)) creates

  1. an integer array (np.arange(N)),
  2. then a float (np.arange(N)/(2*N)),
  3. then another float (np.pi * np.arange(N)/(2*N)),
  4. then a complex (-1j*np.pi*np.arange(N)/(2*N)),
  5. then another compex (np.exp(-1j*np.pi*np.arange(N)/(2*N)),
  6. and finally another complex (2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))

And after all this, you then take L_N.real and L_N.imag, which then generate two real arrays.

This is really a lot of hassle, when you actually want something like this:

t = np.linspace(0, 0.5*np.pi, num=N)

...

Y.real *= np.cos(t)
Y.imag *= np.sin(t)
Y.real -= Y.imag
Y.real *= 2

return Y.real 

If you really want to, you can save one memory allocation by doing this:

t = np.linspace(0, 0.5*np.pi, num=N)

...
scratchpad = np.zeros(N)
np.cos(t, out=scratchpad)

Y.real *= scratchpad

np.sint, out=scratchpad)

Y.imag *= scratchpad
Y.real -= Y.imag
Y.real *= 2

return Y.real

I don't know where the most time is spent, but since the operations are all in-place, your RAM won't be fragmented. I expect this to be very processor-specific, thus, hard numbers won't probably mean too much.

@zeroby0
Copy link
Author

zeroby0 commented Jan 4, 2025

I thought I knew how to write performant code and, reading this reply, I'm just amazed. I've never seriously written for micro-controllers, but now I wish I had. I'm going to try out these ideas first thing in the morning! Thank you so much!

@v923z
Copy link
Owner

v923z commented Jan 4, 2025

As I said, gains are specific to the hardware, even to the compiler settings, so you might have to experiment. But here are a couple of comments: https://micropython-ulab.readthedocs.io/en/latest/ulab-tricks.html, and I would also highly recommend Damien George's talk at pyconau: https://www.youtube.com/watch?v=hHec4qL00x0.

Part of the problem here is that the word efficient is ill-defined. Do you mean that the code should run fast? Or that it should consume little RAM? Or it could consume RAM, but it shouldn't call the garbage collector all the time? Or it should consume little flash? Or should the call overhead be minimal? These are not trivial questions, and your answer will depend on many factors, on the application, on the hardware, on the acceptable complexity of the firmware etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants