Comments (5)
Thanks for raising this issue. I'll admit to being pretty confused by this!
Just testing out your setup, I'm seeing the following:
In [13]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
22.9 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
5.59 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
730 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
277 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So indeed, the multidimensional case is slower, but there's still a speedup from parallel=False
to parallel=True
as we should expect. I don't think parallelism is the source of the problem.
There were a couple of other changes from 0.2.2 to 0.3 that could be relevant here, notably #90. In the example above, there are no substantive memory layout changes that should affect anything here, and the swapaxes calls are bypassed (since we're resampling axis 0).
The only thing that I can think of that might have caused this regression is a simplification of the interpolation loop code from explicitly iterating over channels:
Lines 41 to 47 in ccb8557
to implicitly doing so:
Lines 39 to 44 in f8d4bb9
This was done to remove the need to flatten the input/output arrays into 2D, which was causing some issues in higher-dimensional interpolation #73. It might be that numba is having a harder time optimizing this now since the number of dimensions is variable at runtime, and the trailing dimensions are syntactically hidden.
from resampy.
Ok, after digging into this a bit, I think the recommended strategy here would be to rewrite the interpolator as a generalized ufunc using @guvectorize
. This would essentially lift the interpolation loop out to run in parallel over all trailing dimensions (I think) and ought to alleviate the issue without having to jump through more hoops.
I'm not sure how this will interact with the parallel/serial implementation though. We can use target='parallel'
instead of target='cpu'
, but I'm not sure it will have the same effect overall. Something to experiment with anyway.
from resampy.
Took a quick stab at guvectorize here - did not pay off, but maybe it could with a more careful implementation.
from resampy.
Ok, I've got this back down to parity with serial single-threaded execution.
`@guvectorize` implementation
import numba
from numba import float64
def _resample_f(x, t_out, interp_win, interp_delta, num_table, scale, y):
index_step = int(scale * num_table)
time_register = 0.0
n = 0
frac = 0.0
index_frac = 0.0
offset = 0
eta = 0.0
weight = 0.0
nwin = interp_win.shape[0]
n_orig = x.shape[0]
n_out = t_out.shape[0]
for t in numba.prange(n_out):
time_register = t_out[t]
# Grab the top bits as an index to the input buffer
n = int(time_register)
# Grab the fractional component of the time index
frac = scale * (time_register - n)
# Offset into the filter
index_frac = frac * num_table
offset = int(index_frac)
# Interpolation factor
eta = index_frac - offset
# Compute the left wing of the filter response
i_max = min(n + 1, (nwin - offset) // index_step)
for i in range(i_max):
weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step])
y[t] += weight * x[n-i]
# Invert P
frac = scale - frac
# Offset into the filter
index_frac = frac * num_table
offset = int(index_frac)
# Interpolation factor
eta = index_frac - offset
# Compute the right wing of the filter response
k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
for k in range(k_max):
weight = (interp_win[offset + k * index_step] + eta * interp_delta[offset + k * index_step])
y[t] += weight * x[n + k + 1]
pass
resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
"(n),(m),(p),(p),(),()->(m)", target='parallel')(_resample_f)
resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
"(n),(m),(p),(p),(),()->(m)", target='cpu')(_resample_f)
Benchmark results below. resample
is the guvectorize version, resampy.resample
is 0.3.0.
>>> x = np.random.randn(48000, 1)
>>> y = resample(x, 48000, 24000, axis=0, parallel=True)
>>> y2 = resampy.resample(x, 48000, 24000, axis=0, parallel=True)
>>> np.allclose(y, y2)
True
>>> # Benchmark with and without parallelism
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
26.1 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
26.8 ms ± 923 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
744 ms ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
281 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> # And compare to 1d slices
>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
25.8 ms ± 721 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
25.7 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
23.5 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
7.28 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So the good news is that we get back down to match 1D serial performance. The bad news is that target='parallel'
does not match the numba parallel=True
option (last line). This makes sense in this context because there's no channel-level parallelism to exploit here (there's only one dummy channel), and guvectorize
does not parallelize prange
loops.
There might be a way to do this where we split the interpolator into an outer vectorized function and an inner parallel-jitted function, but at present I don't see an easy way to accomplish both with one function.
from resampy.
Ok, confirmed that the indirect vector-parallel wrapper does the job!
Skipping the gory details, the trick looks as follows:
def _resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
_resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)
def _resample_f_s(x, t_out, interp_win, interp_delta, num_table, scale, y):
_resample_loop_s(x, t_out, interp_win, interp_delta, num_table, scale, y)
_resample_loop_p = numba.jit(nopython=True, parallel=True)(_resample_loop)
_resample_loop_s = numba.jit(nopython=True, parallel=False)(_resample_loop)
resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
"(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_p)
resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
"(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_s)
Runtimes on the previous example:
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
23.8 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
7.17 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
733 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
279 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
21.1 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
6.55 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm happy to call this one done. Will try to get it implemented properly and shipped as 0.3.1 by the end of the week.
from resampy.
Related Issues (20)
- v0.3.0 produces other results than v0.2.2 HOT 3
- Issues with Numba 0.55.2 HOT 4
- Resampy 0.3.1 slows down other libraries HOT 43
- Use parallel=True by default can be problematic in multiprocessing applications HOT 8
- Bringing import time back down
- Edge Case, rounding error in length calculation HOT 3
- Missing argument in guvectorize decorator ? HOT 2
- Quality issues? HOT 5
- resampling to the same hz HOT 1
- Incompatible with latest numba HOT 3
- Best quality resampling?
- PyPI status HOT 2
- Add linting CI
- Add pypi packaging workflow HOT 1
- Drop legacy python support
- Fix documentation builds HOT 1
- Update README badges
- Improve default filters and make the whole process reproducible HOT 14
- Cache filters on load
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from resampy.