PhoneLM

About

Text to speech using phonemes as inputs and audio codec codes as outputs. Loosely based on MegaByte, VALL-E and Encodec.

Method

Use G2P to encode text.
Use encodec to encode and decode audio.
Custom LJSpeech dataloader to include phonemes and encodec audio codes

LJSpeech

Overfit model on one sample from LJSpeech
- Combine token space of text and audio codec codes
- LJ016-0073-synth.wav The initial "Mr. Cope" can just about be made out

Inspiration

This model is loosely based on the VALL-E paper by Microsoft. It uses the MegaByte inspired model from Lucidrains as the Transformer Decoder model. Just as in VALL-E, a users text prompt is converted into phonemes using G2P (Grapheme-to-phoneme), and then the encodec audio codec codes are predicted. However, unlike VALL-E, only an autoregressive model is used. The VALL-E paper uses an autoregressive model to accept phonemes and audio codec code snippets of a source audio and uses that to predict the first codebook codes. The rest of the codebook codes are then predicted when the AR model is finished, it accepts the entire sequence, and then predicts all of the codebook 2 to codebook N codes. However, this increases the complexity of the approach as two models are now required and raises the possibility that the NAR model can not attend to all past inputs unlike the AR which can reduce audio quality output and may lead to repeating of outputs. In practice, the use of phonemes as input into VALL-E may alleviate this, however, this approach explores just predicting the entire sequence auto-regressively (across all codebooks at once).

This is inspired by the fact that the authors of the original MegaByte paper perform autoregressive audio prediction on raw audio data. They treat the audio files as just raw byte sequences and train a model to predict audio on 2TB worth of audio and find that compared to a vanilla transformer or Perceiver architectures, it scores a higher bpb. In principle, this means that the model is more efficient and accurate at modelling raw audio byte sequences than other approaches. The other benefits of the method is that the patch based auto-regressive generation may be well suited to the codebooks used by encodec. As the patch size can be set to 4 (for 4 codebooks each of which can be 1 of 1024 values), this means the local model of the MegaByte model can focus on modelling individual audio codec elements and the global model can focus on the larger context. Hopefully this greatly improves audio quality compared to VALL-E while being much simpler to train.

whitefu / phonelm Goto Github PK

phonelm's Introduction

PhoneLM

About

Method

LJSpeech

Inspiration

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent