Code Monkey home page Code Monkey logo

Comments (5)

bmcfee avatar bmcfee commented on July 22, 2024 1

Thanks for raising this issue. I'll admit to being pretty confused by this!

Just testing out your setup, I'm seeing the following:

In [13]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
22.9 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
5.59 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
730 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
277 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So indeed, the multidimensional case is slower, but there's still a speedup from parallel=False to parallel=True as we should expect. I don't think parallelism is the source of the problem.

There were a couple of other changes from 0.2.2 to 0.3 that could be relevant here, notably #90. In the example above, there are no substantive memory layout changes that should affect anything here, and the swapaxes calls are bypassed (since we're resampling axis 0).

The only thing that I can think of that might have caused this regression is a simplification of the interpolation loop code from explicitly iterating over channels:

# Compute the left wing of the filter response
i_max = min(n + 1, (nwin - offset) // index_step)
for i in range(i_max):
weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step])
for j in range(n_channels):
y[t, j] += weight * x[n - i, j]

to implicitly doing so:

# Compute the left wing of the filter response
i_max = min(n + 1, (nwin - offset) // index_step)
for i in range(i_max):
weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step])
y[t] += weight * x[n-i]

This was done to remove the need to flatten the input/output arrays into 2D, which was causing some issues in higher-dimensional interpolation #73. It might be that numba is having a harder time optimizing this now since the number of dimensions is variable at runtime, and the trailing dimensions are syntactically hidden.

from resampy.

bmcfee avatar bmcfee commented on July 22, 2024

Ok, after digging into this a bit, I think the recommended strategy here would be to rewrite the interpolator as a generalized ufunc using @guvectorize. This would essentially lift the interpolation loop out to run in parallel over all trailing dimensions (I think) and ought to alleviate the issue without having to jump through more hoops.

I'm not sure how this will interact with the parallel/serial implementation though. We can use target='parallel' instead of target='cpu', but I'm not sure it will have the same effect overall. Something to experiment with anyway.

from resampy.

bmcfee avatar bmcfee commented on July 22, 2024

Took a quick stab at guvectorize here - did not pay off, but maybe it could with a more careful implementation.

from resampy.

bmcfee avatar bmcfee commented on July 22, 2024

Ok, I've got this back down to parity with serial single-threaded execution.

`@guvectorize` implementation
import numba
from numba import float64


def _resample_f(x, t_out, interp_win, interp_delta, num_table, scale, y):

    index_step = int(scale * num_table)
    time_register = 0.0

    n = 0
    frac = 0.0
    index_frac = 0.0
    offset = 0
    eta = 0.0
    weight = 0.0

    nwin = interp_win.shape[0]
    n_orig = x.shape[0]
    n_out = t_out.shape[0]

    for t in numba.prange(n_out):
        time_register = t_out[t]

        # Grab the top bits as an index to the input buffer
        n = int(time_register)

        # Grab the fractional component of the time index
        frac = scale * (time_register - n)

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the left wing of the filter response
        i_max = min(n + 1, (nwin - offset) // index_step)
        for i in range(i_max):

            weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step])
            y[t] += weight * x[n-i]

        # Invert P
        frac = scale - frac

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the right wing of the filter response
        k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
        for k in range(k_max):
            weight = (interp_win[offset + k * index_step] + eta * interp_delta[offset + k * index_step])
            y[t] += weight * x[n + k + 1]

    pass

resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", target='parallel')(_resample_f)
resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", target='cpu')(_resample_f)

Benchmark results below. resample is the guvectorize version, resampy.resample is 0.3.0.

>>> x = np.random.randn(48000, 1)
>>> y = resample(x, 48000, 24000, axis=0, parallel=True)
>>> y2 = resampy.resample(x, 48000, 24000, axis=0, parallel=True)
>>> np.allclose(y, y2)
True

>>> # Benchmark with and without parallelism
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
26.1 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
26.8 ms ± 923 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
744 ms ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
281 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> # And compare to 1d slices
>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
25.8 ms ± 721 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
25.7 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
23.5 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
7.28 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the good news is that we get back down to match 1D serial performance. The bad news is that target='parallel' does not match the numba parallel=True option (last line). This makes sense in this context because there's no channel-level parallelism to exploit here (there's only one dummy channel), and guvectorize does not parallelize prange loops.

There might be a way to do this where we split the interpolator into an outer vectorized function and an inner parallel-jitted function, but at present I don't see an easy way to accomplish both with one function.

from resampy.

bmcfee avatar bmcfee commented on July 22, 2024

Ok, confirmed that the indirect vector-parallel wrapper does the job!

Skipping the gory details, the trick looks as follows:

def _resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)

def _resample_f_s(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_s(x, t_out, interp_win, interp_delta, num_table, scale, y)
    
    
_resample_loop_p = numba.jit(nopython=True, parallel=True)(_resample_loop)
_resample_loop_s = numba.jit(nopython=True, parallel=False)(_resample_loop)

resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_p)

resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_s)

Runtimes on the previous example:

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
23.8 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
7.17 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
733 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
279 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
21.1 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
6.55 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm happy to call this one done. Will try to get it implemented properly and shipped as 0.3.1 by the end of the week.

from resampy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.