wesbz / soundstream Goto Github PK

View Code? Open in Web Editor NEW

342.0 10.0 52.0 67 KB

This repository is an implementation of this article: https://arxiv.org/pdf/2107.03312.pdf

Python 100.00%

soundstream's Introduction

SoundStream: An End-to-End Neural Audio Codec

This repository is an implementation of the article with same name.

The RVQ (stands for Residual Vector Quantizer) relies on lucidrains' repository.

I built this implementation to serve my needs and some features are missing from the original article.

Missing pieces

Denoising: this implementation is not built to denoise, so there is no conditioning signal nor Feature-wise Linear Modulation blocks.
Bitrate scalability: for now, quantizer dropout has not been implemented.

Citations

@misc{zeghidour2021soundstream,
    title   = {SoundStream: An End-to-End Neural Audio Codec},
    author  = {Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi},
    year    = {2021},
    eprint  = {2107.03312},
    archivePrefix = {arXiv},
    primaryClass = {cs.SD}
}

soundstream's People

Contributors

Stargazers

Watchers

Forkers

alandene masterendless comeonlby alextrimm devslashnil ywhwang ishine zqevans wenzheliu-speech johnsoncn drscotthawley qwerwxyz georgeliou yuguochencuc pppku khang4dang jimimased ammanoid metathesage javiernistal huizhixie1983 techthiyanes toanhvu silyfox colinski zbowen0225 laozhanger zf223669 fywnb zyh1010 azahed98 linhld0811 cisetn oytunturk harikrishna-vydana diffdynamo czcollier sgaopico alisonbma xiaolonz bg4xsd dy2009 qiuqiangkong dudnik-ilia synthaether runngezhang-jx lukaskristensen jeffmung xiaonengmiao hyzhang24 yuechuanli

soundstream's Issues

License?

Hi,
The project is missing any kind of information on the license.
What license applies to this code?

Problem downloading dataset

i want to try out this model but could not find the Nsynth dataset anywhere. the link on the official website seems to be broken. can anyone kindly share this dataset?

Issues on the mismatch of the sequence length

Hi, thank you so much for your contribution to this project! One issue is that after the padding on the SoundStream model, the output G(x) and input x has different length (e.g., 152000 vs 151920), which will cause the mismatch of feature map length, how to solve this problem? Thanks.

What is the frame_length of the input

I did not find the setting of the frame length, would you please kindly tell me where it is? Do you take the whole audio chip as an input?

Updates

Hi! May I know if you will continue to work on this? Thanks!

Is the code runnable without changing parameters?

Hi, first of all thank you for sharing the code!

Having been working on this code for a while, I am wondering how to run the code - is the code runnable without any modification?

For me, I first put the trimmed training file(mono, 10 seconds, 44100, wav) into a folder, changed the path in the script, but it seemed that there are some dimension errors and the whole training procedure cannot continue. I looked into the code and added permute in some parts of the code to make the errors disappear (e.g. permute(0,2,1) right before the quantizer). After the modifications, the code finally run, but the quantizer is producing constants, so the result sound horrible.

Therefore, I am wondering if I made anything wrong, in particular, is editing the code necessary, did I encounter the problems because of some fault in my parameters?

Thank you!

Inquiry about Pre-trained Models

Hello, I've been following your project with great interest.
I was wondering if you have any pre-trained models available for this project?
Thank you for your time and effort in this project.

Question about bit rate

"each quantizer uses a codebook of size N = 2r/Nq = 280/8 = 1024"
but i find each quantizer uses a codebook is 1024x512, so N=1024x512? and 8x10x9=720bit, not 80bit?

Denoise and Enhancement

Mentioned in the paper that:

In the code, there is no relevant content, can you show it

gradient computation has been modified by an inplace operation

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1, 1, 7]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck

question about the length

hi, thanks for your great work, but I have some confuses with the length of below code,

SoundStream/main.py

Line 119 in e9dac26

lengths_s_x = 1 + torch.div(lengths_x, 256, rounding_mode="floor")

I wonder if the length should be:

lengths_s_x =  torch.div(lengths_x, 256, rounding_mode="floor") -  (1024/256 - 1)

really thanks for your answer.

spectral_reconstruction_loss

In spectral_reconstruction_loss I see that n_mels = 8.
But, in the paper they say it's 64.

"t-th frame of a 64-bin mel-spectrogram computed with window length equal to s and hop length equal to s/4"

Problems about this project.

Firstly, thank you for sharing this code. And i trained with vctk data set. But unfortunately, i didn't get good result. These are the main problems i found:
1 The audio generated by the Generator are just some tune noise and totally irrelevant to the input signal, even after several epochs of training.
2 I have noticed that the different component of g_loss are badly unbalanced, the adv loss is about 1e0 magnitude, feat loss is about 1e3 magnitude, rec loss is about 1e6 magnitude. I have tried to scale them to the same magnitude but it didn't seem to help to the final output signal.
3 I tried to implement the paper myself and got bad quality audio signal. There must be some mistakes and i really don't have a clue.
@wesbz Have you encounter the problems above? Or have you get promising result with this project?

Potential issue with `CausalConvTranspose1d`

Hello,

First of all thanks a lot for publishing this repo. I was trying to understand how transposing a causal convolution works, but I am having a tough time wrapping my head around it.

I notice that in CausalConvTranspose1d.forward you took the source code of the torch.nn.ConvTranspose1d with a twist at the end where you remove the last few elements of the output.

Shouldn't the first few elements be removed instead? Causal convolution implies that padding is added to the left of the input signal and the purpose of a transposed convolution, from my understanding, is to get a signal similar to the one introduced in the convolution operation.

Then I would expect that the first few elements (basically the ones introduced by the padding of the corresponding convolution) to be removed.

Looking forward to hearing your thoughts on this!

Is here any samples?

Hi, this is a great works, and is there any samples right now?