Code Monkey home page Code Monkey logo

phonelm's Introduction

PhoneLM

About

Text to speech using phonemes as inputs and audio codec codes as outputs. Loosely based on MegaByte, VALL-E and Encodec.

Method

  • Use G2P to encode text.
  • Use encodec to encode and decode audio.
  • Custom LJSpeech dataloader to include phonemes and encodec audio codes

LJSpeech

  • Overfit model on one sample from LJSpeech
    • Combine token space of text and audio codec codes
    • LJ016-0073-synth.wav The initial "Mr. Cope" can just about be made out

Inspiration

This model is loosely based on the VALL-E paper by Microsoft. It uses the MegaByte inspired model from Lucidrains as the Transformer Decoder model. Just as in VALL-E, a users text prompt is converted into phonemes using G2P (Grapheme-to-phoneme), and then the encodec audio codec codes are predicted. However, unlike VALL-E, only an autoregressive model is used. The VALL-E paper uses an autoregressive model to accept phonemes and audio codec code snippets of a source audio and uses that to predict the first codebook codes. The rest of the codebook codes are then predicted when the AR model is finished, it accepts the entire sequence, and then predicts all of the codebook 2 to codebook N codes. However, this increases the complexity of the approach as two models are now required and raises the possibility that the NAR model can not attend to all past inputs unlike the AR which can reduce audio quality output and may lead to repeating of outputs. In practice, the use of phonemes as input into VALL-E may alleviate this, however, this approach explores just predicting the entire sequence auto-regressively (across all codebooks at once).

This is inspired by the fact that the authors of the original MegaByte paper perform autoregressive audio prediction on raw audio data. They treat the audio files as just raw byte sequences and train a model to predict audio on 2TB worth of audio and find that compared to a vanilla transformer or Perceiver architectures, it scores a higher bpb. In principle, this means that the model is more efficient and accurate at modelling raw audio byte sequences than other approaches. The other benefits of the method is that the patch based auto-regressive generation may be well suited to the codebooks used by encodec. As the patch size can be set to 4 (for 4 codebooks each of which can be 1 of 1024 values), this means the local model of the MegaByte model can focus on modelling individual audio codec elements and the global model can focus on the larger context. Hopefully this greatly improves audio quality compared to VALL-E while being much simpler to train.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.