Code Monkey home page Code Monkey logo

soundstream's Introduction

SoundStream: An End-to-End Neural Audio Codec

This repository is an implementation of the article with same name.

SoundStream's architecture

The RVQ (stands for Residual Vector Quantizer) relies on lucidrains' repository.

I built this implementation to serve my needs and some features are missing from the original article.

Missing pieces

  • Denoising: this implementation is not built to denoise, so there is no conditioning signal nor Feature-wise Linear Modulation blocks.
  • Bitrate scalability: for now, quantizer dropout has not been implemented.

Citations

@misc{zeghidour2021soundstream,
    title   = {SoundStream: An End-to-End Neural Audio Codec},
    author  = {Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi},
    year    = {2021},
    eprint  = {2107.03312},
    archivePrefix = {arXiv},
    primaryClass = {cs.SD}
}

soundstream's People

Contributors

wesbz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

soundstream's Issues

License?

Hi,
The project is missing any kind of information on the license.
What license applies to this code?

Question about commit loss

Hi, Thanks for this cool work! I noticed that your code did not use commit loss,

def forward(self, x):
    e = self.encoder(x)
    quantized, _, _ = self.quantizer(e)
    o = self.decoder(quantized)
    return o

Why do this?

Issues on the mismatch of the sequence length

Hi, thank you so much for your contribution to this project! One issue is that after the padding on the SoundStream model, the output G(x) and input x has different length (e.g., 152000 vs 151920), which will cause the mismatch of feature map length, how to solve this problem? Thanks.

Updates

Hi! May I know if you will continue to work on this? Thanks!

Is the code runnable without changing parameters?

Hi, first of all thank you for sharing the code!

Having been working on this code for a while, I am wondering how to run the code - is the code runnable without any modification?

For me, I first put the trimmed training file(mono, 10 seconds, 44100, wav) into a folder, changed the path in the script, but it seemed that there are some dimension errors and the whole training procedure cannot continue. I looked into the code and added permute in some parts of the code to make the errors disappear (e.g. permute(0,2,1) right before the quantizer). After the modifications, the code finally run, but the quantizer is producing constants, so the result sound horrible.

Therefore, I am wondering if I made anything wrong, in particular, is editing the code necessary, did I encounter the problems because of some fault in my parameters?

Thank you!

Inquiry about Pre-trained Models

Hello, I've been following your project with great interest.
I was wondering if you have any pre-trained models available for this project?
Thank you for your time and effort in this project.

Question about bit rate

image
"each quantizer uses a codebook of size N = 2r/Nq = 280/8 = 1024"
but i find each quantizer uses a codebook is 1024x512, so N=1024x512? and 8x10x9=720bit, not 80bit?

gradient computation has been modified by an inplace operation

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1, 1, 7]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck

question about the length

hi, thanks for your great work, but I have some confuses with the length of below code,

lengths_s_x = 1 + torch.div(lengths_x, 256, rounding_mode="floor")

I wonder if the length should be:

lengths_s_x =  torch.div(lengths_x, 256, rounding_mode="floor") -  (1024/256 - 1)

really thanks for your answer.

spectral_reconstruction_loss

In spectral_reconstruction_loss I see that n_mels = 8.
But, in the paper they say it's 64.

"t-th frame of a 64-bin mel-spectrogram computed with window length equal to s and hop length equal to s/4"

Problems about this project.

Firstly, thank you for sharing this code. And i trained with vctk data set. But unfortunately, i didn't get good result. These are the main problems i found:
1 The audio generated by the Generator are just some tune noise and totally irrelevant to the input signal, even after several epochs of training.
2 I have noticed that the different component of g_loss are badly unbalanced, the adv loss is about 1e0 magnitude, feat loss is about 1e3 magnitude, rec loss is about 1e6 magnitude. I have tried to scale them to the same magnitude but it didn't seem to help to the final output signal.
3 I tried to implement the paper myself and got bad quality audio signal. There must be some mistakes and i really don't have a clue.
@wesbz Have you encounter the problems above? Or have you get promising result with this project?

Potential issue with `CausalConvTranspose1d`

Hello,

First of all thanks a lot for publishing this repo. I was trying to understand how transposing a causal convolution works, but I am having a tough time wrapping my head around it.

I notice that in CausalConvTranspose1d.forward you took the source code of the torch.nn.ConvTranspose1d with a twist at the end where you remove the last few elements of the output.

Shouldn't the first few elements be removed instead? Causal convolution implies that padding is added to the left of the input signal and the purpose of a transposed convolution, from my understanding, is to get a signal similar to the one introduced in the convolution operation.

Then I would expect that the first few elements (basically the ones introduced by the padding of the corresponding convolution) to be removed.

Looking forward to hearing your thoughts on this!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.