collabora / whisperspeech Goto Github PK

View Code? Open in Web Editor NEW

3.4K 68.0 165.0 41.16 MB

An Open Source text-to-speech system built by inverting Whisper.

Home Page: https://collabora.github.io/WhisperSpeech/

License: MIT License

Jupyter Notebook 99.16% Python 0.84% SCSS 0.01% CSS 0.01%

pytorch speech-synthesis tts

whisperspeech's People

Contributors

Stargazers

Watchers

Forkers

zoq intory89 ishine whitefu maxmax2016 shaun95 techthiyanes entn-at amorjnyh talipturkmen startreker-shzy atuxhe jpc macroustc zhangziliang04 aqfu silyfox pan-yangxu ghlee3401 mengting7tw gheyret ianblenke jpwhiting shahin-trunk naseem56 qyum fiditenemini gaoxiaowei jack-wang-personal dy2009 rasenganai andyweiqiu makaveli10 jackbeback kustomzone mrcodechef stasulam piotrlnordea sriramvaidyanathan sekmet hbcbh1999 craigdmileham eforen zperzendetta taocao kotthoff segmond jithinraj mccharley chandan0000 mz0in moaazsidat samliu tsok-xyz liuzl blmis tuanbc codeaudit keyman9848 touristshaun laofuciu im-hidden rengongzhihuimengjing liuxing9848 edustack jrcribb f901107 joeaelkhoury richardsonjf stonefl dwangf0 haodaohong brandon-dev-aleriola josephrp godwin3737 superoldman96 gmh5225 kill136 smallest-admin arthritiskneedoctor arthur8312 chiefstone winscat fang-zhang 1sankalp mbrukman haiyunsky fmbento ekusiadadus ssahgal rogervaas ukaserge jcggl beslan09 ailabteam evdcush jade2290 v3ucn ccaiccie tramleit

whisperspeech's Issues

The sequence length when training semantic token to acoustic token

Hi, I want to ask the sequence length when you train your S2A model. Because SpearTTS directly predict 3 codebook's tokens. We found that if the target audio is more than 10s, the total sequence length will be very long, and the transformer model is hard to model the long sequence.
Furthermore, whether you considering to use better codec model? We have train some codec model on TTS data, I belive it can improve the performance of this project. We have train codec on 16khz, which may more easy to train. We have release part of code on https://github.com/yangdongchao/AcademiCodec
If you are interested in this, please let me know, I am willing to help you provide better codec model. I want to finish this amazing project together!

Measuring the acoustic -> semantic -> text modeling difficulty

The semantic tokens are supposed to be somewhere in between the acoustic tokens and text. Preferably they contain phonetic and lose all prosody/speaker information. Will Whisper embeddings perform this task successfully?

In the SPEAR TTS paper the semantic to acoustic task is relatively easy (they used a decoder-only model with 12 layers, about the size of Whisper Base) while the text to semantic task is very hard (T5-Large – 24 layer encoder + 24 layer decoder, the exact same size as Whisper Medium). Judging from our initial tests this will most likely not be the case for our Whisper-based semantic tokens. (#4 (comment))

To quantify this in some way we will try to train the Whisper decoders from scratch (using frozen encoder embeddings) on subsets of LibriLight (this is equivalent to the backtranslation task in the paper). For this task we know what great performance looks like (the pre-trained Whisper model) so we will be able to focus on dataset sizes and training hyperparameters.

6. Gather more multi-lingual data

Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.

Fine-tuning WhisperSpeech on a custom speech dataset

Hello,
Thank you for sharing this project.
I followed the steps in Whisper Encoder/Decoder training and was able to train models. The tensorboard looks good as well as the checkpoints produced. But I do not see steps to use the models for inference. Can you someone please share these steps?
Thanks,
Emmanuel

Any way to make inference faster?

Update model metadata and information on the Hugging Face Hub.

Hi @jpc and team,

Congratulations on such a brilliant model, and thanks for going the extra mile to make sure it is easily accessible. I'm VB I lead the advocacy efforts for Audio at Hugging Face.
Adding a model card to the checkpoint on the Hub would be great, and adding some metadata allows users to search for the model. An example model card would be suno/bark.

Note: the model card's content could be similar to the GitHub README.

We've seen this as a great way to increase the model's visibility, too.

In addition to that, it'd be great to create a demo for it on the Hugging Face spaces. We'd be happy to support you with GPU grants (for inference on the space) to help further democratise this model's use.

I'm just a ping away if you need any help.

Cheers!
VB

License

Hi,
Thanks for making this great project! I noticed that on PyPI the license is listed as Apache while on GitHub it's listed as MIT.
Do you know which one should be used, or is the project dual-licensed (both Apache and MIT)?
Thank you!

About text to semantic tokens

Hi, thanks for your repo. I want to ask you about how to measure the effectiveness of the t2s model? Or to what extent does the loss converge to make the model usable?

Thanks

Extracting semantic tokens

I've experimented with extracting semantic tokens from Whisper and there are two challenges:

Whisper does not use fixed-windows during the speech-to-text decoding. Instead they are using a sliding window approach when they start the window at the end of the last complete utterance. This probably avoids edge-issues and improves quality a lot but complicates the encoding because you need to run the full decoding process.
Whisper does not have any bottlenecks in the model which means there is almost no incentive for the network to drop information. As shown by @sidhantls in his blog post on speaker identification in Lex Friedman podcasts the last encoder layer retains enough information to perform binary speaker identification (Lex vs. all other people) to 80% accuracy.

Minimize and stabilize the inference dependencies

It would be great to minimize the dependencies that need to be installed for the inference demo.

We need a lot more things for training but the inference code should be pretty self-contained (we should be able to drop WhisperX, Whisper, even the SpeechBrain model by default). This would speed up the Collab demo a lot.

It would be good to also freeze the versions so we avoid surprise warnings like these: #35

Fix the CPS calculation in data preparation to avoid strange tempo changes around pauses

Response time - how fast is it?

Is it normal that on rtx 3080ti it takes from 2 seconds for two word sentence to get response to even 30 seconds for 40 word sentence?

I used this mockup to test:

import torch
import torchaudio
from whisperspeech.pipeline import Pipeline
import io
import sounddevice as sd
import time
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')
output = io.BytesIO()
print(torch.cuda.is_available())
while True:

	input1 = input("Powiedz: ")
	start_time = time.perf_counter()
	output.seek(0)

	pipe.generate_to_file(output, input1, lang='pl')
	output.seek(0)
	stop_time = time.perf_counter()
	print(f"Processing took {stop_time - start_time:0.4f} seconds")
	waveform, sample_rate = torchaudio.load(output)
	audio_numpy = waveform.numpy()
	sd.play(audio_numpy[0], 24000)
	sd.wait()

Powiedz: Napisz coś ciekawszego.
Processing took 4.5448 seconds

Powiedz: Sprwadzamy czas odpowiedzi. Czy czas będzie krótki czy długi? Jak szybko zareaguje? Bardzo fajny model o sporych możliwościach.
Processing took 19.0766 seconds

To run this mockup i had to change line 36 i a2wav.py
from

        torchaudio.save(fname, audio.cpu(), 24000)

        torchaudio.save(fname, audio.cpu(), 24000, format="wav")

Maybe there is a simpler way but this is what I could do.

CPU + MPS Support

Hi
Do you know if CPU and MPS support is on the roadmap?
Thanks!

Optimize Semantic token model training

We got a decent working TTS pipeline now. So, it makes sense step back to scale and optimize for faster and better results.

Generate Whisper embeddings faster:

Use Faster-Whisper to generate whisper embeddings atleast 3x faster.
Add batching to further speed-up the process.

Optimize training:

Test the warmup steps to deal with initial inefficient training.
Add option to resume training from a checkpoint.
Add memory efficient attention scaled dot product attention
Multi-Gpu training.(something that can be done later if using larger embeddings for e.g. whisper-medium which takes huge amount of time to train even with LibriLight-small.)

Colab Demo appears to be broken - libcuda.so not found

Full error:

[2024-01-18 16:33:36,920] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored

---------------------------------------------------------------------------

BackendCompilerFailed                     Traceback (most recent call last)

[<ipython-input-7-8f3d1d1ad737>](https://localhost:8080/#) in <cell line: 3>()
      1 # this is very slow right now since our inference code is not very optimized
      2 # but even without this crucial optimization it is still better than real-time on an RTX 4090
----> 3 pipe.generate_to_notebook("""
      4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
      5 """)

52 frames

[/usr/lib/python3.10/concurrent/futures/_base.py](https://localhost:8080/#) in __get_result(self)
    401         if self._exception:
    402             try:
--> 403                 raise self._exception
    404             finally:
    405                 # Break a reference cycle with the exception in self._exception

BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!


Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Hoping this is fixed soon. WhisperSpeech so far is sounding quite promising, and I'm eager to test it further, preferably through a typically reliable and less locally resource intensive method like Colab. Also, partly because I have many other AI installed already and have a tendency to accidentally break things when adding another, haha.

Fine-Grained Pitch/Prosody Control

Hi,
Do you know if it's possible to control the pitch of each letter, similar to in Coqui Studio or xVA-Synth?
Thanks!

Colab notebook inference error

Hello,
I'm using the demo colab notebook (with T4 GPU) from the Main page and I get this error when running this cell:

# this is very slow right now since our inference code is not very optimized
# but even without this crucial optimization it is still better than real-time on an RTX 4090
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")

The error I get:

0.00% [0/749 00:00<?]
[2024-01-18 10:49:50,490] [0/1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
---------------------------------------------------------------------------
BackendCompilerFailed                     Traceback (most recent call last)
[<ipython-input-10-8f3d1d1ad737>](https://localhost:8080/#) in <cell line: 3>()
      1 # this is very slow right now since our inference code is not very optimized
      2 # but even without this crucial optimization it is still better than real-time on an RTX 4090
----> 3 pipe.generate_to_notebook("""
      4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
      5 """)

52 frames
[/usr/lib/python3.10/concurrent/futures/_base.py](https://localhost:8080/#) in __get_result(self)
    401         if self._exception:
    402             try:
--> 403                 raise self._exception
    404             finally:
    405                 # Break a reference cycle with the exception in self._exception

BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!


Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Long-Form Generation

Hi,
Do you know if it's possible to smoothly generate longer audio with WhisperSpeech? And dialogue with multiple characters?
Thanks!

New Feature Request: Enable Streaming

Request
At present there is no provision to enable streaming of audio chunks. If this can be implemented, the response time would be even faster than what it is now.

Extracting acoustic tokens

We have a notebook that shows how to extract acoustic tokens.

We are using the 1,5kbps codec model for now despite the fact that the speech quality is terrible. Google generation examples have a lot better quality - one explanation is that they have trained a special purpose VQ codec on the speech-only LibreLight dataset and it's providing better quality at lower bitrates than the general purpose audio codecs trained on speech, music and other sounds.

Sound quality is something we can fix with more training later on, after we prove the whole pipeline works, so for now we should cut everything and focus on making training easiest.

Output audio "format not recognized"

Im making a demo for the whisperspeech and run into an error , see here to see the disscussion and make a PR :

https://huggingface.co/spaces/Tonic/laion-whisper/discussions/1

this is the error :

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 164, in thread_wrapper
    res = future.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/app/app.py", line 54, in whisper_speech_demo
    sf.write(tmp_file_name, audio_np, 24000)
  File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 343, in write
    with SoundFile(file, 'w', samplerate, channels,
  File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/tmp/tmp69sgx7yk.wav': Format not recognised.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/gradio/queueing.py", line 495, in call_prediction
    output = await route_utils.call_process_api(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/route_utils.py", line 232, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1561, in process_api
    result = await self.call_function(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1179, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "/home/user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/home/user/.local/lib/python3.10/site-packages/gradio/utils.py", line 678, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 118, in gradio_handler
    raise res.value
soundfile.LibsndfileError: Error opening '/tmp/tmp69sgx7yk.wav': Format not recognised.

here is the code :

https://huggingface.co/spaces/Tonic/laion-whisper/blob/main/app.py

import spaces
import tempfile
import gradio as gr
import os
from whisperspeech.pipeline import Pipeline
import torch
import soundfile as sf
import numpy as np
import torch.nn.functional as F
from whisperspeech.languages import LANGUAGES
from whisperspeech.pipeline import Pipeline
from whisperspeech.utils import resampler

title = """# 🙋🏻‍♂️ Welcome to🌟Tonic's🌬️💬📝WhisperSpeech

You can use this ZeroGPU Space to test out the current model [🌬️💬📝collabora/whisperspeech](https://huggingface.co/collabora/whisperspeech). 🌬️💬📝collabora/whisperspeech is An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch. It's like Stable Diffusion but for speech – both powerful and easily customizable.
You can also use 🌬️💬📝WhisperSpeech by cloning this space. 🧬🔬🔍 Simply click here: <a style="display:inline-block" href="https://huggingface.co/spaces/Tonic/laion-whisper?duplicate=true"><img src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&logoWidth=14" alt="Duplicate Space"></a></h3> 
Join us : 🌟TeamTonic🌟 is always making cool demos! Join our active builder's🛠️community 👻  [![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/GWpVpekp) On 🤗Huggingface: [TeamTonic](https://huggingface.co/TeamTonic) & [MultiTransformer](https://huggingface.co/MultiTransformer) On 🌐Github: [Polytonic](https://github.com/tonic-ai) & contribute to 🌟 [Poly](https://github.com/tonic-ai/poly) 🤗Big thanks to Yuvi Sharma and all the folks at huggingface for the community grant 🤗
"""

@spaces.GPU
def whisper_speech_demo(text, lang, speaker_audio, mix_lang, mix_text):
    pipe = Pipeline()
    speaker_url = None

    if speaker_audio is not None:
        speaker_url = speaker_audio

    if mix_lang and mix_text:
        mixed_langs = lang.split(',') + mix_lang.split(',')
        mixed_texts = [text] + mix_text.split(',')
        stoks = pipe.t2s.generate(mixed_texts, lang=mixed_langs)
        audio_data = pipe.generate(stoks, speaker_url, lang=mixed_langs[0])
    else:
        audio_data = pipe.generate(text, speaker_url, lang)

    resample_audio = resampler(newsr=24000)
    audio_data_resampled = next(resample_audio([{'sample_rate': 22050, 'samples': audio_data.cpu()}]))['samples_24k']

    # Normalize and write to a WAV file
    with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmp_file:
        tmp_file_name = tmp_file.name
        audio_np = audio_data_resampled.numpy()  # Convert to numpy array
    
        # Normalize if necessary
        if audio_np.max() > 1.0 or audio_np.min() < -1.0:
            audio_np = audio_np / np.max(np.abs(audio_np))
    
        # Ensure the audio data is 2D (num_samples, num_channels)
        if audio_np.ndim == 1:
            audio_np = np.expand_dims(audio_np, axis=1)
    
        # Write the file
        sf.write(tmp_file_name, audio_np, 24000)
    
    return tmp_file_name

with gr.Blocks() as demo:
    gr.Markdown(title)

    with gr.Tabs():
        with gr.TabItem("🌬️💬📝Standard TTS"):
            with gr.Row():
                text_input_standard = gr.Textbox(label="Enter text")
                lang_input_standard = gr.Dropdown(choices=list(LANGUAGES.keys()), label="Language")
                speaker_input_standard = gr.Audio(label="Upload or Record Speaker Audio (optional)", sources=["upload", "microphone"], type="filepath")
                placeholder_mix_lang = gr.Textbox(visible=False)  # Placeholder, hidden
                placeholder_mix_text = gr.Textbox(visible=False)  # Placeholder, hidden
                generate_button_standard = gr.Button("Generate Speech")
            output_audio_standard = gr.Audio(label="🌬️💬📝WhisperSpeech")
    
            generate_button_standard.click(
                whisper_speech_demo,
                inputs=[text_input_standard, lang_input_standard, speaker_input_standard, placeholder_mix_lang, placeholder_mix_text],
                outputs=output_audio_standard
            )
    
        with gr.TabItem("🌬️💬📝Mixed Language TTS"):
            with gr.Row():
                placeholder_text_input = gr.Textbox(visible=False)  # Placeholder, hidden
                placeholder_lang_input = gr.Dropdown(choices=[], visible=False)  # Placeholder, hidden
                placeholder_speaker_input = gr.Audio(visible=False)  
                mix_lang_input_mixed = gr.CheckboxGroup(choices=list(LANGUAGES.keys()), label="Select Languages")
                mix_text_input_mixed = gr.Textbox(label="Enter mixed language text", placeholder="e.g., Hello, Cześć")
                generate_button_mixed = gr.Button("Generate Mixed Speech")
            output_audio_mixed = gr.Audio(label="Mixed🌬️💬📝WhisperSpeech")
    
            generate_button_mixed.click(
                whisper_speech_demo,
                inputs=[placeholder_text_input, placeholder_lang_input, placeholder_speaker_input, mix_lang_input_mixed, mix_text_input_mixed],
                outputs=output_audio_mixed
            )

demo.launch()

would love some direction to resolve the returns on this one :-)

Question about semantic tokens extraction

Hello, I read the script (2C. Whisper semantic embedding extraction.ipynb and 2F. Residual (RQ) semantic token extraction model.ipynb) about semantic token extraction, and I have a few questions. I would like to discuss them with you.

Is there a mistake in the image, because I cannot find the encode method in vqmodel.
Do you measure the Speak Infomation in the Whisper encoder's embedding after you add the RVQ ? Maybe the semantic tokens still have much Speaker Infomation.

Emotion markers

It would be amazing if emotion markers can be supported (or if they already are, documentation on how to use them), for example providing indicators like <angry>, <excited>, etc. or use of emoji's for the same.

Investigate using pitch extraction as an additional intermediate representation

Whisper encoder does not seem to carry pitch information (question marks in Whisper output seem to be added based on grammar, not intonation). This should get us better repeatability, higher S2A quality with a smaller model and enable additional conditioning on pitch (instead of just on speaker embeddings).

[Question]Any demo?

Hi, thanks for your great work, could you share the demo audio from this code?

Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor

Hi, I got the following error when run the example colab:

RuntimeError                              Traceback (most recent call last)

[<ipython-input-7-8f3d1d1ad737>](https://localhost:8080/#) in <cell line: 3>()
      1 # this is very slow right now since our inference code is not very optimized
      2 # but even without this crucial optimization it is still better than real-time on an RTX 4090
----> 3 pipe.generate_to_notebook("""
      4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
      5 """)

4 frames

[/usr/local/lib/python3.10/dist-packages/whisperspeech/s2a_delar_mup_wds_mlang.py](https://localhost:8080/#) in <listcomp>(.0)
    514         if show_progress_bar: it = progress_bar(it)
    515         with record_function("encode"):
--> 516             stoks, speakers = [x.repeat(bs, 1) for x in (stoks, speakers)]
    517             xenc, xenc_positions, _ = self.run_encoder(stoks, speakers)
    518             toks_positions = torch.arange(N, device=dev)

RuntimeError: Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor

something wrong in the S2A model

hi,
I carefully read your S2A model code, i found that you try to model 2 codebook, when you transfer the codebook tokens into vector using nn.Embedding, you assume the first codebook's token and the second codebook's token using the same nn.Embedding, which means that you assume that the two codebook's token enjoy the same data distributions. I suggest you use different nn.Embedding to deal with different codebook's token.
Maybe my suggestion is wrong, I am willing to discuss with you.

Document torch.compile, characters per second conditioning, the available model sizes and their performance

Make `torch.compile` more robust to changes in `T` and `top_k`

torch.compile is confused when we change sampling temperature and top_k parameters and recompiles the model. To avoid this I think it should be enough to wrap them in cuda tensors before passing them into generate_next.

Output length for file generation

Hello
Is script developed to only have 30 seconds of conversion?
my output files are cut to 30 sec, is there any way to change this limit?

Thanks

Multi-Voice/Voice Cloning?

fine-tuning.

Can you write more about training, how the dataset should look like, etc.? I see that you are from Poland, do you plan to add more Polish voices? Because the current model struggles with accents and style.

Semantic -> acoustic modeling

We got #3 working so now it's time to try to convert from Whisper-based semantic tokens (#3) to EnCodec-based acoustic tokens (#2).

We found out that better semantic tokens (from Whisper medium) make this task a lot easier and even tiny models sound great. Multilingual semantic token training helps and cross-language voice cloning works great.

There are a couple of hypothesis to test:

Can we train a forward model or does it have to be autoregressive to get anywhere? (no, but see SoundStorm)
To start simple, could we get away with single speaker training only? This would allow us to ignore the prompting for now and just let the model memorize the speaker. (seems to work on 1000hrs of one speaker)
How much data is needed to get bad performance (low quality intelligible speech)? (a 1000 hours seems enough, takes about a day on A100 to train)
And finally, last but not least: do the Whisper encoder embeddings retain enough phonetic information to do this at all. (from initial tests in #5 they seem to be closer to speech than to text)

We also still have a couple of engineering challenges:

fix the issue where the model starts generating noise after exactly 10s (this may be related to cross-attention and the 3x length difference between the encoder and decoder contexts)
investigate sigmaReparam from Apple (supposed to make training more stable)
use the optimized scaled dot product attention kernels from the newest PyTorch (should
speed up the training a lot)
add prompting and multiple-speakers support (we currently condition on SpeechBrain speaker embeddings)
switch to AdaFactor (should use less memory than Ada so we can train on smaller GPUs)

7. Train the final models

Once all the bugs are ironed out (#4), we have a text to semantic model (#9), we improve the speech codec (#10) and we have more high-quality data (#11) we will train final models that should match (or even exceed) the quality Google showed in their SPEAR TTS demo page.

e2e inference results

Thanks for sharing the model. However, I try to inferenc use the trained model, the generated audio is all white noise. I just follow you inference scripts with nothing changed. It seems strange.

ROPE is necessary in VQ Stoks, but position is not provided

WhisperSpeech/whisperspeech/vq_stoks.py

Line 340 in b6fc87c

x = self.ln_post(self.out_blocks(x))

Getting an error here - but upon digging in I don’t know what the solution would be, and don’t know why this doesn’t happen for every inference.

Essentially, if ROPE is true (and it is by default for VQ stoks), then you expect positional information to be passed here otherwise you get an error (None can’t multiply float):

WhisperSpeech/whisperspeech/modules.py

Line 108 in b6fc87c

x = rope_rotate(x, x_positions * subsampling, *self.rotary(x))

But it isn’t as you can see from the initial call to ‘out_blocks’. I see that the ‘upgrade’ method in the tunable for VQ stoks turns off rope - but don’t see this method called anywhere in the repo. Should rope be off for VQ stoks?

Masking not done correctly in VQ?

WhisperSpeech/whisperspeech/vq_stoks.py

Line 88 in b6fc87c

mask[:int(seconds * 16000) // 320] = 1

Shouldn't this mask the actual padding area, which includes a random part of both the left and right part of the audio?

All progress updates

Progress updates (from newest):

2023-12-10

Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!

English speech, female voice (transferred from a Polish language dataset):

whisperspeech-sample.mp4

A Polish sample, male voice:

whisperspeech-sample-pl.mp4

2023-07-14

We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.

An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):

Female voice:

we-choose-tts.mp4

Male voice:

we-choose-tts-s467.mp4

We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:

2023-04-13

We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).

End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:

(don't forget to unmute the video)

test-e2e-jfk-T0.7.mp4

Ground truth:

we-choose.mp4

2023-04-03

We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.

Validation set ground truth (don't forget to unmute):

ground-truth.mov

The generated output from the S->A model (multinomial sampling, temperature 0.8):

saar-1300hr-2l-20e-T0.8.mov

roadmap of this project

Hi, There is roadmap for this project? I wonder if the project availabel recently?

Request for Data Usage Permission

Hi, excellent work! I am interested in using the data from this repository Huggingface for academic research purposes. However, I noticed that there is no license information provided. Could you please clarify whether the data in this repository is available for reuse, and if so, under what terms and conditions?

Thank you for your assistance in clarifying the usage rights for this data.

Sincerely,

Better support for zero-shot voice-cloning

Thansk for your great job. Seems you use the speaker id not the reference wav to separate the speakers. I wonder will this repo support zero-shot voice-cloning?

Investigate prompting as a tool to zero-shot condition both the S2A and T2S models

This could also allow us to:

zero-shot voice (and prosody) clone existing recording
generate some random samples and then freeze one style we like most for subsequent generations.

sorry but request steps help

There are a lot of jupyter notebooks here, and I'm not very clear what the steps to train a model are...

5. Improve the EnCodec speech quality

Right now the EnCodec speech quality at 1.5kbps is pretty terrible (far from what Google shows for their SoundStream-based codec). I am pretty sure the problem is caused by EnCodec being a universal sound codec because the official samples for SoundStream at 1.5kbps sound quite similar (Lyra-v2 sound even worse than that). That's why I suspect SPEAR TTS is based on an unreleased speech-only codec.

Since EnCodec has multi-rate capability so the overall model knows how to represent high-quality speech. The pretty good results we had for compressing the Whisper embeddings suggest we might get away with retraining just the quantization layer to reprioritize the bandwidth allocation and improve speech quality (at the cost of ignoring music and other audio).

Model Share

Hello,

Thank you for this great work. With latest commits I've seen that you didn't push the checkpoints. I want to try the semantic token extractor. Is there any way to access to the latest models?

Kind regards

ImportError when installing locally

Hi,
I'm trying to install the package locally. I cloned the git repository and installed it w/ pip, however when running it I get the following error:

ImportError: cannot import name 'languages' from 'whisperspeech'

Create a Huggingface demo page

The output quality is good enough that it would be useful to allow more folks to test our model out on Huggingface.

Improve inference of short sentences

Right now the S2A model struggles with short semantic token sequences generated by the T2S model. For longer sequences the quality also deteriorates towards the end.

This is most likely because the semantic tokens are padded to 1500 tokens but the S2A dataset we train on always has full 30 second (1500 token) fragments.

We can fix it by using the voice activity detection data we extracted for the T2S training to train on sequences of varying length.

4. Text -> semantic tokens modeling

This will be a model that converts text tokens into Whisper-encoder-derived semantic tokens. With that we will have a complete TTS pipeline.

To train it we can re-use Whisper as our back-translation model (the paper had to train one from scratch). We can use the existing distillation setup as a starting point but we will have to make sure we get all the text tokens since Whisper has a tendency to cut the decodings short and asks you (using timestamp tokens) to rerun it with more data.

This was a pretty difficult task in the original SPEAR TTS implementation (they had to use a 24-layer model).

multilanguage support

Will support Mandrain?