Code Monkey home page Code Monkey logo

shepardtts's Introduction

ShepardTTS

ShepardTTS is a free and open-source fine-tuned XTTS v2.0.3 model, trained on paired dialogue/audio samples from the Mass Effect 2 and Mass Effect 3 base games. It is a multilingual and multispeaker model, and can make all of our beloved characters come to life.

Pull requests, feature requests, and discussions are welcome!

If you are a researcher, and you want access to the public ShepardTTS deployment, contact me.

Usage notes

Most voices perform best when narrating medium-length sentences with medium-length words. They tend to produce garbage and artifacts when confronted with very short words and sentences, excessive punctuation, and abbreviations. Sentences which are too long tend to cause hallucinations. As a rule of thumb: provide text input such that it could have reasonably occurred in the games. The more out-of-domain - and unnatural - the text input, the lower the chances of a good narration.

This paragraph is a good example of appropriate text input.

Deployment

GitHub Actions automatically produces a fresh image on every push to the main branch. See docker-compose.example.yml on how it can be deployed.

History (and other experiments)

I initially fine-tuned SpeechT5, but the results were disappointing. That model very frequently produced garbage and/or hallucinated output for most voices. Interestingly, it also had a very strong bias towards female speakers.

Dataset

After dumping dialogue strings with the Legendary Explorer and dumping audio samples with Gibbed's ME2/ME3 extractor, you can use create_dataset.py to align and filter the two. This transforms the dialogue-audio pairs into a HuggingFace dataset, which it then exports into the ljspeech format.

You can then proceed to train the model, and create character embeddings when it finishes training.

The audio samples and dialogue strings are extremely clean. The audio has a sample rate of 24000Hz (downsampled to 22050 for training). The dialogue strings are corrupted in some cases (issue with the Legendary Explorer?).

Training

Trained for 12 epochs on a RTX 3060 with 12GB VRAM. Took about 14 hours. Judging from the eval loss, this is roughly the point where it starts overfitting. See train.py for the used parameters.

Future work

See the project board.

GPU inference with DeepSpeed is ~20x faster (minutes to seconds), but renting GPU's is very expensive. Do we have a generous sponsor in the audience perhaps?

Ethical (and legal) statement

There are probably copyright issues with a generative model trained on game files. More importantly, I'm not sure how the voice actors feel about their voice being cloned. Do not use ShepardTTS for commercial or harmful purposes. This software is a labor of love built for the Mass Effect fan community.

Due to these legal and ethical issues, I will not distribute the game files nor the model checkpoint at this time. Dump and fine-tune yourself.

Risks

Voice cloning technology has been around for a couple of years. Hand-picked audio samples with commercial-grade voice models likely produce better audio than ShepardTTS. Furthermore, waveforms produced by this model are easily recognizable as such just by visual inspection, as it always produces (some) characteristic artifacts.

The access to the public deployment is highly restricted, and as such there is no straightforward way to use the system such that it hurts the interests of the original voice actors.

All things considered, this software should not produce additional harm beyond what already exists.

License

The model and its output: Coqui Public Model License (CPML)

The code: GNU General Public License v3.0

Acknowledgements

shepardtts's People

Contributors

darwinkel avatar dependabot[bot] avatar

Stargazers

YvanKOB avatar Bukit Sorrento avatar  avatar zandi t. avatar  avatar FHAUKEM avatar

Watchers

 avatar

shepardtts's Issues

Remove radio samples

Some audio samples have a radio-like filter over them (usually denoted by radio_ in the filename). As we don't want the model to learn from noise, these should probably be removed.

Add some (basic) unit tests

The code is currently not really testable. But it would be nice to have some basic sanity checking and CI setup.

Generate voice embeddings per dialogue/scene

Initial experiments have demonstrated that low-resource voices actually do quite well. There is thus plenty of room to split up e.g., the 4000+ samples for Broshep and Femshep. This should give users more flexibility in picking specific intonation characteristics as well as just higher quality.

It does come with two complications: first, the list of voices in Gradio becomes very very long which is not user friendly. Secondly, it may result in user fatigue in the sense that users don't know which voice to pick.

Potential solution is expanding the Examples section with suggested/recommended embeddings per character.

Add multilingual audio and text support

XTTS v2.0.3 supports English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), and Hindi (hi).

The Mass Effect games have dialogue and audio for English, French, German, Italian, Japanese, Spanish, Russian, and Polish. These are voiced by different actors. All of these languages are already supported by XTTS.

It should be reasonably easy to expand the training pipeline and modify the demo such that users can select a language. This will results in an exploding amount of available voices, though. Also, the training time will increase by ~8x.

Replace user/pass authentication with token

It would be nice if the examples could be viewed and listened to even when not logged in. I'm thinking of doing away with the authentication, and replacing it by a token input which must be filled when running inference.

Extract Legendary Edition dialogue and audio

The LE contains approx 4x as much dialogue than the base games, presumably because of the DLC. Training on more data is better.

The Legendary Explorer already supports extracting the dialogue strings as xlsx. The only missing functionality is automatic bulk export of all .pcc/.afc files with proper filenames.

Also, it looks like the LE audio samples have a sampling rate of 44100Hz. Nice.

Get into C# and submit a pull request to the Legendary Explorer to batch export audio.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.