Comments (43)
@Path-A that's indeed a separate problem, and resolved by PR #109
from resampy.
Just as a heads up, I won't have time to look at this more until next week due to deadlines and travel. But I'm sure we'll get to the bottom of it.
from resampy.
Interesting - worth noting that it can't only be circleCI, since the librosa issue is reported on heroku, but maybe there's some common factor to how both of those platforms use docker that isn't shared by GCP.
I think this is beginning to smell like an upstream issue in numba (or maybe llvmlite), and not something due to how resampy is using numba.
from resampy.
Now i'm really confused, because the issue over in librosa did not have torch involved at all.
from resampy.
You are really good at spotting this
git whatchanged --since="2022-02-16" --until="2022-02-17"
is a life-saver. Also having been bitten by silent MKL quirks in the past.
from resampy.
Just as a heads up, PR #109 implements a handful of changes as mentioned in #108, and also reverts parallel=False by default. I'm not sure this directly resolves the potential MKL conflict / docker issue, but if you could give that a shot and let me know, it would help prioritize putting out a bugfix release next week.
from resampy.
That's deeply weird. If you can't provide a reproducing example, can you at least describe more about what the environment that triggers slow-down looks like?
Nothing in librosa has changed relative to the resampy updates.
The only thing that comes to mind is that resampy 0.3.1 has a bit more going on at compilation time than 0.3.0 (different dtype targets are explicitly listed), but a 20x slowdown in unrelated code makes zeros sense to me.
from resampy.
We observed the slowdown in circleCI while running our tests, the whole instructions are in this circleCI config file. I realize it may be difficult to reproduce, but basically all jobs that install the torch-speech
dependency were getting really slow, so anything like:
pip install transformers[sklearn,torch,testing,sentencepiece,torch-speech]
Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.
from resampy.
Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.
Strange. The dependencies have not changed from 0.3.0 to 0.3.1, so I don't see why that would happen.
The only thing that changed was how we handle multi-channel resampling (see PR #104 ) which now uses numba generalized ufunc vectorization to bring down the interpolation loop to operate independently for each channel. In moving to @guvectorize
, I had to explicitly set out the dtype targets, which I could envision adding a little overhead, but only when the functions are actually called (and then, only the first time).
The only time this should actually have an observable effect is when resampling multichannel signals, and then, it should be strictly better in 0.3.1 (see #102 for discussion). If you're not doing multi-channel, performance should be equivalent (and match or beat that of the 0.2 series).
The other thing that could be happening here is that resampy is going parallel and interfering with other processes in your CI. You could check this quickly by limiting the number of threads that numba can take (see docs) in your environment. At present, resampy exposes an interface to disable parallelism, but librosa does not (see this issue for what needs to happen there), so a global override is probably your best strategy for determining if this is the culprit if you're only accessing resampy through librosa.
from resampy.
@sgugger curious if you've made any headway on this?
from resampy.
We encountered the same weird problem ...
The inference speed of the transformer will slow down with the resampy 0.3.1.
Downgrading the resampy ( 0.3.1 -> 0.2.2) solved the problem.
from resampy.
@vBaiCai thanks - can you provide any kind of hints for reproducing this behavior though?
from resampy.
I tried to reproduce, even using CircleCI's convenient docker image running on a GCP VM which has no issue. So far I can only see this issue when the code is running on CircleCI (with docker executor).
@vBaiCai: do you also see this issue on CircleCI, or some other environments?
The issue occurs after the commit a36275670f284e8f09d71ef3515dc7513aeba6e6
(fixed #102, fast resampling in high dimensions).
from resampy.
With some more investigation, here is my finding
- The
parallel=True
in_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)
causes the problem.- If I remove
@guvectorize
(forresample_f_p
), no slow down. - If I change to
parallel=False
, no slow down.
- If I remove
- It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.
- I can only observe this issue on CircleCI docker executor.
Results (simplified case, see reproduce section below)
-
_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)
: -
_resample_loop_p = jit(nopython=True, nogil=True, parallel=False)(_resample_loop)
:
To reproduce (running on CircleCI docker executor)
import torch
from torch import nn
from tqdm import tqdm
"""Numba implementation of resampler"""
from numba import (
guvectorize,
float32,
float64,
jit,
prange,
int16,
int32,
int64,
complex64,
complex128,
)
def _resample_loop(x, t_out, interp_win, interp_delta, num_table, scale, y):
index_step = int(scale * num_table)
time_register = 0.0
n = 0
frac = 0.0
index_frac = 0.0
offset = 0
eta = 0.0
weight = 0.0
nwin = interp_win.shape[0]
n_orig = x.shape[0]
n_out = t_out.shape[0]
for t in prange(n_out):
time_register = t_out[t]
# Grab the top bits as an index to the input buffer
n = int(time_register)
# Grab the fractional component of the time index
frac = scale * (time_register - n)
# Offset into the filter
index_frac = frac * num_table
offset = int(index_frac)
# Interpolation factor
eta = index_frac - offset
# Compute the left wing of the filter response
i_max = min(n + 1, (nwin - offset) // index_step)
for i in range(i_max):
weight = (
interp_win[offset + i * index_step]
+ eta * interp_delta[offset + i * index_step]
)
y[t] += weight * x[n - i]
# Invert P
frac = scale - frac
# Offset into the filter
index_frac = frac * num_table
offset = int(index_frac)
# Interpolation factor
eta = index_frac - offset
# Compute the right wing of the filter response
k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
for k in range(k_max):
weight = (
interp_win[offset + k * index_step]
+ eta * interp_delta[offset + k * index_step]
)
y[t] += weight * x[n + k + 1]
_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)
@guvectorize(
[
(int16[:], float64[:], float64[:], float64[:], int32, float32, int16[:]),
(int32[:], float64[:], float64[:], float64[:], int32, float32, int32[:]),
(int64[:], float64[:], float64[:], float64[:], int32, float32, int64[:]),
(float32[:], float64[:], float64[:], float64[:], int32, float32, float32[:]),
(float64[:], float64[:], float64[:], float64[:], int32, float32, float64[:]),
(
complex64[:],
float64[:],
float64[:],
float64[:],
int32,
float32,
complex64[:],
),
(
complex128[:],
float64[:],
float64[:],
float64[:],
int32,
float32,
complex128[:],
),
],
"(n),(m),(p),(p),(),()->(m)",
nopython=True,
)
def resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
_resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)
class DummyModel(nn.Module):
def __init__(self, n_layers, dim):
super().__init__()
self.layers = [nn.Linear(dim, dim) for _ in range(n_layers)]
self.activation = nn.Tanh()
def forward(self, hidden_states):
for layer in self.layers:
hidden_states = layer(hidden_states)
hidden_states = self.activation(hidden_states)
return hidden_states
if __name__ == "__main__":
batch_size = 2
seq_len = 16384
# dim = 128
# model = DummyModel(n_layers=16, dim=dim)
# input = torch.ones(batch_size, seq_len, dim)
#
# for i in tqdm(range(16)):
# output = model(input)
# print(output.shape)
from transformers import AutoModelForSpeechSeq2Seq
model = AutoModelForSpeechSeq2Seq.from_pretrained("hf-internal-testing/tiny-random-speech-encoder-decoder")
encoder_input = torch.ones(batch_size, seq_len, dtype=torch.float32)
decoder_input = torch.ones(batch_size, 128, dtype=torch.int32)
for i in tqdm(range(16)):
output = model(inputs=encoder_input, decoder_input_ids=decoder_input)
print(output.logits.shape)
from resampy.
Actually, I also observe (some) slowdown on other environments. For example on GCP VM,
parallel=True
: 30.0it/sparallel=False
: 50.0it/s
(but this doesn't cause real problem, as they are both fast enough)
from resampy.
Thanks @ydshieh for the detailed report!
It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.
This to me says that the issue is the jit-compile time, rather than the run time. I wonder if adding cache=True
here could help resolve things?
from resampy.
We have some confirmation over in librosa that this is indeed a jit compilation issue and not a runtime issue per se.
Unfortunately, adding cache=True
to the build to amortize this cost triggers a segfault in numba across sessions. This is apparently a known and long-standing issue numba/numba#4807
This is where it gets a little dicey from our side:
- We can cache the serial implementation
resample_f_s
without issue. Caching onresample_f_p
breaks. - We cannot cache the build of the inner loop function that is common to both. This is because numba currently does not use compilation flags when hashing the function, so enabling
cache=True
on these would result in either the parallel implementation being used for the serial or vice versa. Definitely not what we want. This too is a known issue: numba/numba#6845
(1) means that we always have to pay a compilation cost for the parallel import. Now, we could get around this by breaking into submodules and using some lazy-import tricks to defer the parallel build until it's absolutely necessary, but that makes me a bit nervous.
(2) means that at present, we have to pay the compilation cost twice on import until the upstream issue is resolved. I don't see a clean way around this, but I have a branch working right now where the inner loop function is compiled and cached twice using deep copies to avoid collisions, and that looks like it might be working.
from resampy.
@bmcfee Thank you for looking into this. I am not very familiar with these, but I have one question:
If you look the To reproduce (running on CircleCI docker executor)
section above, you will see that no resample_XXX
function is used at all while measuring the timing. There is _resample_loop_p = jit(...)
, but it is done before the time measurement. However, if jit with parallel=True
+ guvectorize
, it takes much long time to finish the loop.
So I don't think the issue (just) comes from compilation cost.
from resampy.
There is
_resample_loop_p = jit(...)
, but it is done before the time measurement. However, ifjit with parallel=True
+guvectorize
, it takes much long time to finish the loop.So I don't think the issue (just) comes from compilation cost.
Well that is downright baffling, and makes me suspect that something is going very wrong either in torch or in numba/llvmlite.
If the function isn't being called at all, compilation is the only thing we have left. Are you absolutely certain that something isn't calling the compiled interpolator function? Sorry, I have no idea what's going on in the transformers object hierarchy, and really don't have the bandwidth to dig into it further.
from resampy.
Hi
We are observing the same issue in torchaudio CI.
resampy 0.3.0 or 0.2.2 works fine but resampy 0.3.1 causes timeout in tutorial generation job.
-
Failure with resampy 0.3.1
https://app.circleci.com/pipelines/github/pytorch/audio/11789/workflows/4f6d4b3d-5cab-4d9c-b9d9-2b2a7a3ebeb0/jobs/802671 -
Successful run after pinning resampy 0.2.2.
https://app.circleci.com/pipelines/github/pytorch/audio/11801/workflows/69164737-8627-464f-bb62-3ee1beb89163/jobs/804295?invite=true#step-106-241
The tutorial experiencing the timeout does not use resampy
. It does not even import it. However, the tutorial executed before the failing one uses librosa.power_to_db
. So there seems to be some hysteresis.
The librosa
version is 0.9.2
in any cases.
All of build_docs
job in build
workflow after July 7th are failing. https://app.circleci.com/pipelines/github/pytorch/audio?branch=nightly
The execution stack looks like
-
CircleCI
- Docker
- Make
- conda
- sphinx_gallery
- Python
- sphinx_gallery
- conda
- Make
- Docker
-
The tutorial experiencing timeout is https://github.com/pytorch/audio/blob/main/examples/tutorials/online_asr_tutorial.py
-
The tutorial executed before is https://github.com/pytorch/audio/blob/main/examples/tutorials/audio_feature_augmentation_tutorial.py which uses
librosa.power_to_db
.
The particular code exhibiting timeout is our custom C++ operation, and this code is calling FFmpeg to fetch MP3 from AWS. The C++ code is bounded to Python using PyTorch, but not sure if it's related.
from resampy.
Are you absolutely certain that something isn't calling the compiled interpolator function?
I will try to find an example which only uses simple torch operations, or see if the the compiled function is called somehow.
from resampy.
I can only observe this issue on CircleCI docker executor.
I'm starting to wonder if docker is a necessary ingredient here - a similar issue was reported in librosa recently librosa/librosa#1541 , note the following comment in the issue:
I'm running things inside of a Docker container running heroku/heroku:20 (based off Ubuntu 20).
If I run import librosa from the host computer running macOS the speed is normal.
The Mac is a 2021 MacBook Pro M1.
Now, I'm pretty certain that the librosa issue is actually a duplicate of this issue, but I mention it because it specifically mentions that taking docker out of the equation does resolve the problem.
The fact that resampy 0.3.0 does not seem to trigger this points to a specific interaction between docker and @guvectorize
(added in 0.3.1), but not parallel compilation (added in 0.3.0).
from resampy.
Hi, yes it seems to relate to docker. More precisely, I only see this issue occurs on CircleCI docker executor.
- same docker image on GCP VM -> no issue
- Linux VM executor (no docker) on CircleCI -> no issue
(also mentioned this in some previous comments)
from resampy.
@bmcfee I am able to reproduce this issue with simple torch ops
-
without
parallel=False
(and withguvectorize
)- job run link
nn.Embedding
:1e-5 ~ 5e-5
second+
:1e-5 ~ 1e-4
second
- job run link
-
with
parallel=True
(and withguvectorize
)- job run link
nn.Embedding
:5e-2
second+
:5e-2
second
- job run link
This is 1000x slowdown!
Would you like to provide a simple version regarding _resample_loop
(could be any dummy function) and guvectorize
signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)
To reproduce
change the __main__
in this comment to the following
if __name__ == "__main__":
batch_size = 64
seq_len = 128
vocab_size = 1024
hidden_size = 32
layer = nn.Embedding(vocab_size, hidden_size)
inputs = torch.ones(batch_size, seq_len, dtype=torch.int32)
for i in range(10):
output = layer(inputs)
for i in range(10):
s = datetime.datetime.now()
for j in range(100):
output = layer(inputs)
e = datetime.datetime.now()
print(f"nn.Embedding ({i}): {(e - s).total_seconds() / 100} seconds")
print("=" * 40)
x = torch.ones(64, 128, 32, dtype=torch.float32)
y = torch.ones(64, 128, 32, dtype=torch.float32)
for i in range(10):
z = x + y
for i in range(10):
s = datetime.datetime.now()
for j in range(100):
z = x + y
e = datetime.datetime.now()
print(f"z = x + y ({i}): {(e - s).total_seconds() / 100} seconds")
print("=" * 40)
q = torch.ones(64, 4, 128, 8, dtype=torch.float32)
k = torch.ones(64, 4, 8, 128, dtype=torch.float32)
for i in range(10):
attention_scores = torch.matmul(q, k)
for i in range(10):
s = datetime.datetime.now()
for j in range(100):
attention_scores = torch.matmul(q, k)
e = datetime.datetime.now()
print(f"torch.matmul ({i}): {(e - s).total_seconds() / 100} seconds")
print("=" * 40)
from resampy.
Would you like to provide a simple version regarding
_resample_loop
(could be any dummy function) andguvectorize
signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)
I was able to run this script, and verify that I don't see any slowdown on my dev machine (not using docker). I think that before raising an issue upstream, we should try to reproduce this outside of circleci, just to eliminate other sources of variation on shared machines.
Do you know if it's possible to pull down the docker image you're using so we can try it locally?
from resampy.
It is cimg/python:3.7.12
.
However, running that image with GCP VM didn't produce issue (I tried with some previous test code). I can try the above latest code snippet with docker on GCP. --> Checked, same image running on GCP VM: no issue
In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.
from resampy.
In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.
Do we know anything more about how circle and gcp differ in this?
from resampy.
Do we know anything more about how circle and gcp differ in this?
Not from me, unfortunately. However, It would be great if the author of the librosa issue could try my code snippet on heroku
too and see if the slowdown also occurs.
from resampy.
Looks like with torch 1.12, this issue disappear.
See
from resampy.
Sorry to add more confusing information, but we observed this with PyTorch 1.12.0 as well in TorchAudio release branch.
The failing job
https://app.circleci.com/pipelines/github/pytorch/audio/11756/workflows/84b6bb5e-47f5-4494-9663-19f045180d61/jobs/797806
Pinning resampy 0.2.2 fixed the issue.
from resampy.
😵💫
Now i'm really confused, because the issue over in librosa did not have torch involved at all.
Let me run a few more times. I am worried that my previous statement might be False
from resampy.
ok, well nice to know that the torch update is (might be?) orthogonal to this issue.
from resampy.
Let me run a few more times. I am worried that my previous statement might be False
@ydshieh any confirmation on this? I have some cycles to work on this issue now.
from resampy.
I have enough evidence to say yes. More precisely, the script I provided is slow on torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl
but not on torch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl
.
(Of course, all statements are for running on CircleCI with docker executor).
I do have 2 or 3 times not seeing the issue under the same condition.
If you are interested, here is the run page. You can just check the 4 latest runs.
I am trying to build torch from source for the commits between these 2 dates. Hope we can see which change/commit solves the issue.
I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.
from resampy.
I have enough evidence to say yes. More precisely, the script I provided is slow on
torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl
but not ontorch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl
. (Of course, all statements are for running on CircleCI with docker executor).I do have 2 or 3 times not seeing the issue under the same condition.
Oof. How hard would it be to get a diff of those two versions? My hunch here is that none of this has anything to do with changes in pytorch, and might just come down to inherent randomness of running on a shared host. I'd love to be wrong about that though!
I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.
Good observation. The slow import might be due to all the build targets in guvectorize that hit at import-time (rather than JIT). I have a local branch that can bring that down substantially by dropping the explicit build targets. The implications of that are rather involved though, so I'll migrate it to another issue.
from resampy.
There are about 120 commits that day in torch
, and look like a lot of files changed.
I also think it's possible from the randomness of different host.
from resampy.
There are about 120 commits that day in
torch
, and look like a lot of files changed.
Looking through the log now, this looks like the only reasonable culprit IMO:
I could definitely see potential issues stemming from MKL interacting badly with (other) parallel code.
from resampy.
You are really good at spotting this - I can double check with the builds from source, but probably not today.
from resampy.
OK!
from resampy.
The PR makes my script run fast independent of the torch versions.
from resampy.
The PR makes my script run fast independent of the torch versions.
Great, thanks for confirming. I'll merge that PR and cut the 0.4 release on monday.
from resampy.
This may be tangentially related, but I've also isolated a huge slowdown to our scripts that import librosa with resampy 0.3.1 on our Coral Dev Board.
The result of sudo python3 -X importtime -c "import librosa"
has 92975737 microseconds (~93 seconds) of the import taken up by resampy.interpn . After downgrading resampy to 0.2.2, resampy.interpn only takes 3066 microseconds (~0.003 seconds) of the import.
To be more transparent about system:
Coral Dev Board (Mendel Linux 5.3) w/ LLVM 7
audioread 2.1.9
joblib 1.1.0
librosa 0.8.1
llvmlite 0.32.0
numba 0.49.0
numpy 1.16.2
resampy 0.3.1
scikit-learn 0.20.2
scipy 1.1.0
SoundFile 0.10.3.post1
from resampy.
Ok, the 0.4.0 release is up on pypi and making its way through conda-forge now. I'll close this issue out, but please reopen it if problems recur with the new release.
from resampy.
Related Issues (20)
- v0.3.0 produces other results than v0.2.2 HOT 3
- Resampy 0.3.0 is slow for multi-dimensional input HOT 5
- Issues with Numba 0.55.2 HOT 4
- Use parallel=True by default can be problematic in multiprocessing applications HOT 8
- Bringing import time back down
- Edge Case, rounding error in length calculation HOT 3
- Missing argument in guvectorize decorator ? HOT 2
- Quality issues? HOT 5
- resampling to the same hz HOT 1
- Incompatible with latest numba HOT 3
- Best quality resampling?
- PyPI status HOT 2
- Add linting CI
- Add pypi packaging workflow HOT 1
- Drop legacy python support
- Fix documentation builds HOT 1
- Update README badges
- Improve default filters and make the whole process reproducible HOT 14
- Cache filters on load
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from resampy.