Hi there! We just debugged a problem on the <a href="https://github.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<g-emoji class="g-emoji" alias="face_with_spiral_eyes" fallback-src="https://github.gi

You are really good at spotting this <g-e

Just as a heads up, PR <a class="issue-link js-issue-link" data-error-text="Failed to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Resampy 0.3.1 slows down other libraries,about bmcfee/resampy

Comments (43)

bmcfee commented on July 22, 2024 2

@Path-A that's indeed a separate problem, and resolved by PR #109

from resampy.

bmcfee commented on July 22, 2024 1

Just as a heads up, I won't have time to look at this more until next week due to deadlines and travel. But I'm sure we'll get to the bottom of it.

from resampy.

bmcfee commented on July 22, 2024 1

Interesting - worth noting that it can't only be circleCI, since the librosa issue is reported on heroku, but maybe there's some common factor to how both of those platforms use docker that isn't shared by GCP.

I think this is beginning to smell like an upstream issue in numba (or maybe llvmlite), and not something due to how resampy is using numba.

from resampy.

bmcfee commented on July 22, 2024 1

😵‍💫

Now i'm really confused, because the issue over in librosa did not have torch involved at all.

from resampy.

bmcfee commented on July 22, 2024 1

You are really good at spotting this

😁 git whatchanged --since="2022-02-16" --until="2022-02-17" is a life-saver. Also having been bitten by silent MKL quirks in the past.

from resampy.

bmcfee commented on July 22, 2024 1

Just as a heads up, PR #109 implements a handful of changes as mentioned in #108, and also reverts parallel=False by default. I'm not sure this directly resolves the potential MKL conflict / docker issue, but if you could give that a shot and let me know, it would help prioritize putting out a bugfix release next week.

from resampy.

bmcfee commented on July 22, 2024

That's deeply weird. If you can't provide a reproducing example, can you at least describe more about what the environment that triggers slow-down looks like?

Nothing in librosa has changed relative to the resampy updates.

The only thing that comes to mind is that resampy 0.3.1 has a bit more going on at compilation time than 0.3.0 (different dtype targets are explicitly listed), but a 20x slowdown in unrelated code makes zeros sense to me.

from resampy.

sgugger commented on July 22, 2024

We observed the slowdown in circleCI while running our tests, the whole instructions are in this circleCI config file. I realize it may be difficult to reproduce, but basically all jobs that install the torch-speech dependency were getting really slow, so anything like:

pip install transformers[sklearn,torch,testing,sentencepiece,torch-speech]

Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.

from resampy.

bmcfee commented on July 22, 2024

Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.

Strange. The dependencies have not changed from 0.3.0 to 0.3.1, so I don't see why that would happen.

The only thing that changed was how we handle multi-channel resampling (see PR #104 ) which now uses numba generalized ufunc vectorization to bring down the interpolation loop to operate independently for each channel. In moving to @guvectorize, I had to explicitly set out the dtype targets, which I could envision adding a little overhead, but only when the functions are actually called (and then, only the first time).

The only time this should actually have an observable effect is when resampling multichannel signals, and then, it should be strictly better in 0.3.1 (see #102 for discussion). If you're not doing multi-channel, performance should be equivalent (and match or beat that of the 0.2 series).

The other thing that could be happening here is that resampy is going parallel and interfering with other processes in your CI. You could check this quickly by limiting the number of threads that numba can take (see docs) in your environment. At present, resampy exposes an interface to disable parallelism, but librosa does not (see this issue for what needs to happen there), so a global override is probably your best strategy for determining if this is the culprit if you're only accessing resampy through librosa.

from resampy.

bmcfee commented on July 22, 2024

@sgugger curious if you've made any headway on this?

from resampy.

vBaiCai commented on July 22, 2024

We encountered the same weird problem ...

The inference speed of the transformer will slow down with the resampy 0.3.1.

Downgrading the resampy ( 0.3.1 -> 0.2.2) solved the problem.

from resampy.

bmcfee commented on July 22, 2024

@vBaiCai thanks - can you provide any kind of hints for reproducing this behavior though?

from resampy.

ydshieh commented on July 22, 2024

I tried to reproduce, even using CircleCI's convenient docker image running on a GCP VM which has no issue. So far I can only see this issue when the code is running on CircleCI (with docker executor).

@vBaiCai: do you also see this issue on CircleCI, or some other environments?

The issue occurs after the commit a36275670f284e8f09d71ef3515dc7513aeba6e6 (fixed #102, fast resampling in high dimensions).

from resampy.

ydshieh commented on July 22, 2024

With some more investigation, here is my finding

The parallel=True in _resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop) causes the problem.
- If I remove @guvectorize (for resample_f_p), no slow down.
- If I change to parallel=False, no slow down.
It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.
I can only observe this issue on CircleCI docker executor.

Results (simplified case, see reproduce section below)

_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop):
- 6.20s/it
- https://app.circleci.com/pipelines/github/huggingface/transformers/44062/workflows/91a929fa-cf95-4867-8e8b-8b68a7a24be0/jobs/511312
_resample_loop_p = jit(nopython=True, nogil=True, parallel=False)(_resample_loop):
- 47.26it/s
- https://app.circleci.com/pipelines/github/huggingface/transformers/44064/workflows/dd491839-2aa8-47a2-83e4-ab632bbfcf65/jobs/511328

To reproduce (running on CircleCI docker executor)

import torch
from torch import nn
from tqdm import tqdm


"""Numba implementation of resampler"""
from numba import (
    guvectorize,
    float32,
    float64,
    jit,
    prange,
    int16,
    int32,
    int64,
    complex64,
    complex128,
)


def _resample_loop(x, t_out, interp_win, interp_delta, num_table, scale, y):

    index_step = int(scale * num_table)
    time_register = 0.0

    n = 0
    frac = 0.0
    index_frac = 0.0
    offset = 0
    eta = 0.0
    weight = 0.0

    nwin = interp_win.shape[0]
    n_orig = x.shape[0]
    n_out = t_out.shape[0]

    for t in prange(n_out):
        time_register = t_out[t]

        # Grab the top bits as an index to the input buffer
        n = int(time_register)

        # Grab the fractional component of the time index
        frac = scale * (time_register - n)

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the left wing of the filter response
        i_max = min(n + 1, (nwin - offset) // index_step)
        for i in range(i_max):

            weight = (
                interp_win[offset + i * index_step]
                + eta * interp_delta[offset + i * index_step]
            )
            y[t] += weight * x[n - i]

        # Invert P
        frac = scale - frac

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the right wing of the filter response
        k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
        for k in range(k_max):
            weight = (
                interp_win[offset + k * index_step]
                + eta * interp_delta[offset + k * index_step]
            )
            y[t] += weight * x[n + k + 1]


_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)


@guvectorize(
    [
        (int16[:], float64[:], float64[:], float64[:], int32, float32, int16[:]),
        (int32[:], float64[:], float64[:], float64[:], int32, float32, int32[:]),
        (int64[:], float64[:], float64[:], float64[:], int32, float32, int64[:]),
        (float32[:], float64[:], float64[:], float64[:], int32, float32, float32[:]),
        (float64[:], float64[:], float64[:], float64[:], int32, float32, float64[:]),
        (
            complex64[:],
            float64[:],
            float64[:],
            float64[:],
            int32,
            float32,
            complex64[:],
        ),
        (
            complex128[:],
            float64[:],
            float64[:],
            float64[:],
            int32,
            float32,
            complex128[:],
        ),
    ],
    "(n),(m),(p),(p),(),()->(m)",
    nopython=True,
)
def resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)


class DummyModel(nn.Module):
    def __init__(self, n_layers, dim):
        super().__init__()
        self.layers = [nn.Linear(dim, dim) for _ in range(n_layers)]
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        for layer in self.layers:
            hidden_states = layer(hidden_states)
            hidden_states = self.activation(hidden_states)

        return hidden_states


if __name__ == "__main__":
    batch_size = 2
    seq_len = 16384

    # dim = 128
    # model = DummyModel(n_layers=16, dim=dim)
    # input = torch.ones(batch_size, seq_len, dim)
    #
    # for i in tqdm(range(16)):
    #     output = model(input)
    #     print(output.shape)

    from transformers import AutoModelForSpeechSeq2Seq
    model = AutoModelForSpeechSeq2Seq.from_pretrained("hf-internal-testing/tiny-random-speech-encoder-decoder")
    encoder_input = torch.ones(batch_size, seq_len, dtype=torch.float32)
    decoder_input = torch.ones(batch_size, 128, dtype=torch.int32)

    for i in tqdm(range(16)):
        output = model(inputs=encoder_input, decoder_input_ids=decoder_input)
        print(output.logits.shape)

from resampy.

ydshieh commented on July 22, 2024

Actually, I also observe (some) slowdown on other environments. For example on GCP VM,

parallel=True: 30.0it/s
parallel=False: 50.0it/s

(but this doesn't cause real problem, as they are both fast enough)

from resampy.

bmcfee commented on July 22, 2024

Thanks @ydshieh for the detailed report!

It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.

This to me says that the issue is the jit-compile time, rather than the run time. I wonder if adding cache=True here could help resolve things?

from resampy.

bmcfee commented on July 22, 2024

We have some confirmation over in librosa that this is indeed a jit compilation issue and not a runtime issue per se.

Unfortunately, adding cache=True to the build to amortize this cost triggers a segfault in numba across sessions. This is apparently a known and long-standing issue numba/numba#4807

This is where it gets a little dicey from our side:

We can cache the serial implementation resample_f_s without issue. Caching on resample_f_p breaks.
We cannot cache the build of the inner loop function that is common to both. This is because numba currently does not use compilation flags when hashing the function, so enabling cache=True on these would result in either the parallel implementation being used for the serial or vice versa. Definitely not what we want. This too is a known issue: numba/numba#6845

(1) means that we always have to pay a compilation cost for the parallel import. Now, we could get around this by breaking into submodules and using some lazy-import tricks to defer the parallel build until it's absolutely necessary, but that makes me a bit nervous.

(2) means that at present, we have to pay the compilation cost twice on import until the upstream issue is resolved. I don't see a clean way around this, but I have a branch working right now where the inner loop function is compiled and cached twice using deep copies to avoid collisions, and that looks like it might be working.

from resampy.

ydshieh commented on July 22, 2024

@bmcfee Thank you for looking into this. I am not very familiar with these, but I have one question:

If you look the To reproduce (running on CircleCI docker executor) section above, you will see that no resample_XXX function is used at all while measuring the timing. There is _resample_loop_p = jit(...), but it is done before the time measurement. However, if jit with parallel=True + guvectorize, it takes much long time to finish the loop.

So I don't think the issue (just) comes from compilation cost.

from resampy.

bmcfee commented on July 22, 2024

There is _resample_loop_p = jit(...), but it is done before the time measurement. However, if jit with parallel=True + guvectorize, it takes much long time to finish the loop.

So I don't think the issue (just) comes from compilation cost.

Well that is downright baffling, and makes me suspect that something is going very wrong either in torch or in numba/llvmlite.

If the function isn't being called at all, compilation is the only thing we have left. Are you absolutely certain that something isn't calling the compiled interpolator function? Sorry, I have no idea what's going on in the transformers object hierarchy, and really don't have the bandwidth to dig into it further.

from resampy.

mthrok commented on July 22, 2024

We are observing the same issue in torchaudio CI.
resampy 0.3.0 or 0.2.2 works fine but resampy 0.3.1 causes timeout in tutorial generation job.

Failure with resampy 0.3.1
https://app.circleci.com/pipelines/github/pytorch/audio/11789/workflows/4f6d4b3d-5cab-4d9c-b9d9-2b2a7a3ebeb0/jobs/802671
Successful run after pinning resampy 0.2.2.
https://app.circleci.com/pipelines/github/pytorch/audio/11801/workflows/69164737-8627-464f-bb62-3ee1beb89163/jobs/804295?invite=true#step-106-241

The tutorial experiencing the timeout does not use resampy. It does not even import it. However, the tutorial executed before the failing one uses librosa.power_to_db. So there seems to be some hysteresis.
The librosa version is 0.9.2 in any cases.

All of build_docs job in build workflow after July 7th are failing. https://app.circleci.com/pipelines/github/pytorch/audio?branch=nightly

The execution stack looks like

CircleCI
- Docker
  - Make
    - conda
      - sphinx_gallery
        
        Python
The tutorial experiencing timeout is https://github.com/pytorch/audio/blob/main/examples/tutorials/online_asr_tutorial.py
The tutorial executed before is https://github.com/pytorch/audio/blob/main/examples/tutorials/audio_feature_augmentation_tutorial.py which uses librosa.power_to_db.

The particular code exhibiting timeout is our custom C++ operation, and this code is calling FFmpeg to fetch MP3 from AWS. The C++ code is bounded to Python using PyTorch, but not sure if it's related.

from resampy.

ydshieh commented on July 22, 2024

Are you absolutely certain that something isn't calling the compiled interpolator function?

I will try to find an example which only uses simple torch operations, or see if the the compiled function is called somehow.

from resampy.

bmcfee commented on July 22, 2024

I can only observe this issue on CircleCI docker executor.

I'm starting to wonder if docker is a necessary ingredient here - a similar issue was reported in librosa recently librosa/librosa#1541 , note the following comment in the issue:

I'm running things inside of a Docker container running heroku/heroku:20 (based off Ubuntu 20).
If I run import librosa from the host computer running macOS the speed is normal.
The Mac is a 2021 MacBook Pro M1.

Now, I'm pretty certain that the librosa issue is actually a duplicate of this issue, but I mention it because it specifically mentions that taking docker out of the equation does resolve the problem.

The fact that resampy 0.3.0 does not seem to trigger this points to a specific interaction between docker and @guvectorize (added in 0.3.1), but not parallel compilation (added in 0.3.0).

from resampy.

ydshieh commented on July 22, 2024

Hi, yes it seems to relate to docker. More precisely, I only see this issue occurs on CircleCI docker executor.

same docker image on GCP VM -> no issue
Linux VM executor (no docker) on CircleCI -> no issue

(also mentioned this in some previous comments)

from resampy.

ydshieh commented on July 22, 2024

@bmcfee I am able to reproduce this issue with simple torch ops

without parallel=False (and with guvectorize)
- job run link
  - nn.Embedding: 1e-5 ~ 5e-5 second
  - +: 1e-5 ~ 1e-4 second
with parallel=True (and with guvectorize)
- job run link
  - nn.Embedding: 5e-2 second
  - +: 5e-2 second

This is 1000x slowdown!

Would you like to provide a simple version regarding _resample_loop (could be any dummy function) and guvectorize signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)

To reproduce

change the __main__ in this comment to the following

if __name__ == "__main__":

    batch_size = 64
    seq_len = 128

    vocab_size = 1024
    hidden_size = 32
    layer = nn.Embedding(vocab_size, hidden_size)

    inputs = torch.ones(batch_size, seq_len, dtype=torch.int32)

    for i in range(10):
        output = layer(inputs)

    for i in range(10):
        s = datetime.datetime.now()
        for j in range(100):
            output = layer(inputs)
        e = datetime.datetime.now()
        print(f"nn.Embedding ({i}): {(e - s).total_seconds() / 100} seconds")
    print("=" * 40)

    x = torch.ones(64, 128, 32, dtype=torch.float32)
    y = torch.ones(64, 128, 32, dtype=torch.float32)

    for i in range(10):
        z = x + y

    for i in range(10):
        s = datetime.datetime.now()
        for j in range(100):
            z = x + y
        e = datetime.datetime.now()
        print(f"z = x + y ({i}): {(e - s).total_seconds() / 100} seconds")
    print("=" * 40)

    q = torch.ones(64, 4, 128, 8, dtype=torch.float32)
    k = torch.ones(64, 4, 8, 128, dtype=torch.float32)

    for i in range(10):
        attention_scores = torch.matmul(q, k)

    for i in range(10):
        s = datetime.datetime.now()
        for j in range(100):
            attention_scores = torch.matmul(q, k)
        e = datetime.datetime.now()
        print(f"torch.matmul ({i}): {(e - s).total_seconds() / 100} seconds")
    print("=" * 40)

from resampy.

bmcfee commented on July 22, 2024

Would you like to provide a simple version regarding _resample_loop (could be any dummy function) and guvectorize signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)

I was able to run this script, and verify that I don't see any slowdown on my dev machine (not using docker). I think that before raising an issue upstream, we should try to reproduce this outside of circleci, just to eliminate other sources of variation on shared machines.

Do you know if it's possible to pull down the docker image you're using so we can try it locally?

from resampy.

ydshieh commented on July 22, 2024

It is cimg/python:3.7.12.

However, running that image with GCP VM didn't produce issue (I tried with some previous test code). ~~I can try the above latest code snippet with docker on GCP.~~ --> Checked, same image running on GCP VM: no issue

In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.

from resampy.

bmcfee commented on July 22, 2024

In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.

Do we know anything more about how circle and gcp differ in this?

from resampy.

ydshieh commented on July 22, 2024

Do we know anything more about how circle and gcp differ in this?

Not from me, unfortunately. However, It would be great if the author of the librosa issue could try my code snippet on heroku too and see if the slowdown also occurs.

from resampy.

ydshieh commented on July 22, 2024

Looks like with torch 1.12, this issue disappear.

See

1.11 run
https://app.circleci.com/pipelines/github/huggingface/transformers/44796/workflows/b6f297e2-c08c-4b39-8a5d-306f6f16304b

1.12 run
https://app.circleci.com/pipelines/github/huggingface/transformers/44794/workflows/52c5b23a-12fb-461d-aea4-0673b99a9e47

from resampy.

mthrok commented on July 22, 2024

Sorry to add more confusing information, but we observed this with PyTorch 1.12.0 as well in TorchAudio release branch.

The failing job
https://app.circleci.com/pipelines/github/pytorch/audio/11756/workflows/84b6bb5e-47f5-4494-9663-19f045180d61/jobs/797806

Pinning resampy 0.2.2 fixed the issue.

from resampy.

ydshieh commented on July 22, 2024

😵‍💫

Now i'm really confused, because the issue over in librosa did not have torch involved at all.

Let me run a few more times. I am worried that my previous statement might be False

from resampy.

bmcfee commented on July 22, 2024

ok, well nice to know that the torch update is (might be?) orthogonal to this issue.

from resampy.

bmcfee commented on July 22, 2024

Let me run a few more times. I am worried that my previous statement might be False

@ydshieh any confirmation on this? I have some cycles to work on this issue now.

from resampy.

ydshieh commented on July 22, 2024

I have enough evidence to say yes. More precisely, the script I provided is slow on torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl but not on torch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl.
(Of course, all statements are for running on CircleCI with docker executor).

I do have 2 or 3 times not seeing the issue under the same condition.

If you are interested, here is the run page. You can just check the 4 latest runs.

I am trying to build torch from source for the commits between these 2 dates. Hope we can see which change/commit solves the issue.

I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.

from resampy.

bmcfee commented on July 22, 2024

I have enough evidence to say yes. More precisely, the script I provided is slow on torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl but not on torch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl. (Of course, all statements are for running on CircleCI with docker executor).

I do have 2 or 3 times not seeing the issue under the same condition.

Oof. How hard would it be to get a diff of those two versions? My hunch here is that none of this has anything to do with changes in pytorch, and might just come down to inherent randomness of running on a shared host. I'd love to be wrong about that though!

I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.

Good observation. The slow import might be due to all the build targets in guvectorize that hit at import-time (rather than JIT). I have a local branch that can bring that down substantially by dropping the explicit build targets. The implications of that are rather involved though, so I'll migrate it to another issue.

from resampy.

ydshieh commented on July 22, 2024

There are about 120 commits that day in torch, and look like a lot of files changed.

I also think it's possible from the randomness of different host.

from resampy.

bmcfee commented on July 22, 2024

There are about 120 commits that day in torch, and look like a lot of files changed.

Looking through the log now, this looks like the only reasonable culprit IMO:

pytorch/pytorch@17b3ba148

I could definitely see potential issues stemming from MKL interacting badly with (other) parallel code.

from resampy.

ydshieh commented on July 22, 2024

You are really good at spotting this - I can double check with the builds from source, but probably not today.

from resampy.

ydshieh commented on July 22, 2024

OK!

from resampy.

ydshieh commented on July 22, 2024

The PR makes my script run fast independent of the torch versions.

from resampy.

bmcfee commented on July 22, 2024

The PR makes my script run fast independent of the torch versions.

Great, thanks for confirming. I'll merge that PR and cut the 0.4 release on monday.

from resampy.

Path-A commented on July 22, 2024

This may be tangentially related, but I've also isolated a huge slowdown to our scripts that import librosa with resampy 0.3.1 on our Coral Dev Board.

The result of sudo python3 -X importtime -c "import librosa" has 92975737 microseconds (~93 seconds) of the import taken up by resampy.interpn . After downgrading resampy to 0.2.2, resampy.interpn only takes 3066 microseconds (~0.003 seconds) of the import.

To be more transparent about system:

Coral Dev Board (Mendel Linux 5.3) w/ LLVM 7
audioread 2.1.9
joblib 1.1.0
librosa 0.8.1
llvmlite 0.32.0
numba 0.49.0
numpy 1.16.2
resampy 0.3.1
scikit-learn 0.20.2
scipy 1.1.0
SoundFile 0.10.3.post1

from resampy.

bmcfee commented on July 22, 2024

Ok, the 0.4.0 release is up on pypi and making its way through conda-forge now. I'll close this issue out, but please reopen it if problems recur with the new release.

from resampy.

Resampy 0.3.1 slows down other libraries about resampy HOT 43 CLOSED

Comments (43)

Results (simplified case, see reproduce section below)

To reproduce (running on CircleCI docker executor)

To reproduce

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent