Compared to 0.2.2, For 0.3.0, this is fast: x=np.random.rand(480

Resampy 0.3.0 is slow for multi-dimensional input about resampy HOT 5 CLOSED

svankuyk commented on July 22, 2024

Resampy 0.3.0 is slow for multi-dimensional input

from resampy.

Comments (5)

bmcfee commented on July 22, 2024 1

Thanks for raising this issue. I'll admit to being pretty confused by this!

Just testing out your setup, I'm seeing the following:

In [13]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
22.9 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
5.59 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
730 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
277 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So indeed, the multidimensional case is slower, but there's still a speedup from parallel=False to parallel=True as we should expect. I don't think parallelism is the source of the problem.

There were a couple of other changes from 0.2.2 to 0.3 that could be relevant here, notably #90. In the example above, there are no substantive memory layout changes that should affect anything here, and the swapaxes calls are bypassed (since we're resampling axis 0).

The only thing that I can think of that might have caused this regression is a simplification of the interpolation loop code from explicitly iterating over channels:

resampy/resampy/interpn.py

Lines 41 to 47 in ccb8557

    
           # Compute the left wing of the filter response 
        
           i_max = min(n + 1, (nwin - offset) // index_step) 
        
           for i in range(i_max): 
        
               weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step]) 
        
               for j in range(n_channels): 
        
                   y[t, j] += weight * x[n - i, j]

to implicitly doing so:

resampy/resampy/interpn.py

Lines 39 to 44 in f8d4bb9

    
           # Compute the left wing of the filter response 
        
           i_max = min(n + 1, (nwin - offset) // index_step) 
        
           for i in range(i_max): 
        
               weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step]) 
        
               y[t] += weight * x[n-i]

This was done to remove the need to flatten the input/output arrays into 2D, which was causing some issues in higher-dimensional interpolation #73. It might be that numba is having a harder time optimizing this now since the number of dimensions is variable at runtime, and the trailing dimensions are syntactically hidden.

from resampy.

bmcfee commented on July 22, 2024

Ok, after digging into this a bit, I think the recommended strategy here would be to rewrite the interpolator as a generalized ufunc using @guvectorize. This would essentially lift the interpolation loop out to run in parallel over all trailing dimensions (I think) and ought to alleviate the issue without having to jump through more hoops.

I'm not sure how this will interact with the parallel/serial implementation though. We can use target='parallel' instead of target='cpu', but I'm not sure it will have the same effect overall. Something to experiment with anyway.

from resampy.

bmcfee commented on July 22, 2024

Took a quick stab at guvectorize here - did not pay off, but maybe it could with a more careful implementation.

from resampy.

bmcfee commented on July 22, 2024

Ok, I've got this back down to parity with serial single-threaded execution.

`@guvectorize` implementation

import numba
from numba import float64


def _resample_f(x, t_out, interp_win, interp_delta, num_table, scale, y):

    index_step = int(scale * num_table)
    time_register = 0.0

    n = 0
    frac = 0.0
    index_frac = 0.0
    offset = 0
    eta = 0.0
    weight = 0.0

    nwin = interp_win.shape[0]
    n_orig = x.shape[0]
    n_out = t_out.shape[0]

    for t in numba.prange(n_out):
        time_register = t_out[t]

        # Grab the top bits as an index to the input buffer
        n = int(time_register)

        # Grab the fractional component of the time index
        frac = scale * (time_register - n)

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the left wing of the filter response
        i_max = min(n + 1, (nwin - offset) // index_step)
        for i in range(i_max):

            weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step])
            y[t] += weight * x[n-i]

        # Invert P
        frac = scale - frac

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the right wing of the filter response
        k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
        for k in range(k_max):
            weight = (interp_win[offset + k * index_step] + eta * interp_delta[offset + k * index_step])
            y[t] += weight * x[n + k + 1]

    pass

resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", target='parallel')(_resample_f)
resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", target='cpu')(_resample_f)

Benchmark results below. resample is the guvectorize version, resampy.resample is 0.3.0.

>>> x = np.random.randn(48000, 1)
>>> y = resample(x, 48000, 24000, axis=0, parallel=True)
>>> y2 = resampy.resample(x, 48000, 24000, axis=0, parallel=True)
>>> np.allclose(y, y2)
True

>>> # Benchmark with and without parallelism
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
26.1 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
26.8 ms ± 923 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
744 ms ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
281 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> # And compare to 1d slices
>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
25.8 ms ± 721 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
25.7 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
23.5 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
7.28 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the good news is that we get back down to match 1D serial performance. The bad news is that target='parallel' does not match the numba parallel=True option (last line). This makes sense in this context because there's no channel-level parallelism to exploit here (there's only one dummy channel), and guvectorize does not parallelize prange loops.

There might be a way to do this where we split the interpolator into an outer vectorized function and an inner parallel-jitted function, but at present I don't see an easy way to accomplish both with one function.

from resampy.

bmcfee commented on July 22, 2024

Ok, confirmed that the indirect vector-parallel wrapper does the job!

Skipping the gory details, the trick looks as follows:

def _resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)

def _resample_f_s(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_s(x, t_out, interp_win, interp_delta, num_table, scale, y)
    
    
_resample_loop_p = numba.jit(nopython=True, parallel=True)(_resample_loop)
_resample_loop_s = numba.jit(nopython=True, parallel=False)(_resample_loop)

resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_p)

resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_s)

Runtimes on the previous example:

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
23.8 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
7.17 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
733 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
279 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
21.1 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
6.55 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm happy to call this one done. Will try to get it implemented properly and shipped as 0.3.1 by the end of the week.

from resampy.

Resampy 0.3.0 is slow for multi-dimensional input about resampy HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# Compute the left wing of the filter response
	i_max = min(n + 1, (nwin - offset) // index_step)
	for i in range(i_max):

	weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step])
	for j in range(n_channels):
	y[t, j] += weight * x[n - i, j]