collabora / whisperspeech Goto Github PK
View Code? Open in Web Editor NEWAn Open Source text-to-speech system built by inverting Whisper.
Home Page: https://collabora.github.io/WhisperSpeech/
License: MIT License
An Open Source text-to-speech system built by inverting Whisper.
Home Page: https://collabora.github.io/WhisperSpeech/
License: MIT License
Hi, I want to ask the sequence length when you train your S2A model. Because SpearTTS directly predict 3 codebook's tokens. We found that if the target audio is more than 10s, the total sequence length will be very long, and the transformer model is hard to model the long sequence.
Furthermore, whether you considering to use better codec model? We have train some codec model on TTS data, I belive it can improve the performance of this project. We have train codec on 16khz, which may more easy to train. We have release part of code on https://github.com/yangdongchao/AcademiCodec
If you are interested in this, please let me know, I am willing to help you provide better codec model. I want to finish this amazing project together!
The semantic tokens are supposed to be somewhere in between the acoustic tokens and text. Preferably they contain phonetic and lose all prosody/speaker information. Will Whisper embeddings perform this task successfully?
In the SPEAR TTS paper the semantic to acoustic task is relatively easy (they used a decoder-only model with 12 layers, about the size of Whisper Base) while the text to semantic task is very hard (T5-Large – 24 layer encoder + 24 layer decoder, the exact same size as Whisper Medium). Judging from our initial tests this will most likely not be the case for our Whisper-based semantic tokens. (#4 (comment))
To quantify this in some way we will try to train the Whisper decoders from scratch (using frozen encoder embeddings) on subsets of LibriLight (this is equivalent to the backtranslation task in the paper). For this task we know what great performance looks like (the pre-trained Whisper model) so we will be able to focus on dataset sizes and training hyperparameters.
Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.
Hello,
Thank you for sharing this project.
I followed the steps in Whisper Encoder/Decoder training
and was able to train models. The tensorboard looks good as well as the checkpoints produced. But I do not see steps to use the models for inference. Can you someone please share these steps?
Thanks,
Emmanuel
Hi @jpc and team,
Congratulations on such a brilliant model, and thanks for going the extra mile to make sure it is easily accessible. I'm VB I lead the advocacy efforts for Audio at Hugging Face.
Adding a model card to the checkpoint on the Hub would be great, and adding some metadata allows users to search for the model. An example model card would be suno/bark.
Note: the model card's content could be similar to the GitHub README.
We've seen this as a great way to increase the model's visibility, too.
In addition to that, it'd be great to create a demo for it on the Hugging Face spaces. We'd be happy to support you with GPU grants (for inference on the space) to help further democratise this model's use.
I'm just a ping away if you need any help.
Cheers!
VB
Hi,
Thanks for making this great project! I noticed that on PyPI the license is listed as Apache while on GitHub it's listed as MIT.
Do you know which one should be used, or is the project dual-licensed (both Apache and MIT)?
Thank you!
Hi, thanks for your repo. I want to ask you about how to measure the effectiveness of the t2s model? Or to what extent does the loss converge to make the model usable?
Thanks
I've experimented with extracting semantic tokens from Whisper and there are two challenges:
It would be great to minimize the dependencies that need to be installed for the inference demo.
We need a lot more things for training but the inference code should be pretty self-contained (we should be able to drop WhisperX, Whisper, even the SpeechBrain model by default). This would speed up the Collab demo a lot.
It would be good to also freeze the versions so we avoid surprise warnings like these: #35
Is it normal that on rtx 3080ti it takes from 2 seconds for two word sentence to get response to even 30 seconds for 40 word sentence?
I used this mockup to test:
import torch
import torchaudio
from whisperspeech.pipeline import Pipeline
import io
import sounddevice as sd
import time
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')
output = io.BytesIO()
print(torch.cuda.is_available())
while True:
input1 = input("Powiedz: ")
start_time = time.perf_counter()
output.seek(0)
pipe.generate_to_file(output, input1, lang='pl')
output.seek(0)
stop_time = time.perf_counter()
print(f"Processing took {stop_time - start_time:0.4f} seconds")
waveform, sample_rate = torchaudio.load(output)
audio_numpy = waveform.numpy()
sd.play(audio_numpy[0], 24000)
sd.wait()
Powiedz: Napisz coś ciekawszego.
Processing took 4.5448 seconds
Powiedz: Sprwadzamy czas odpowiedzi. Czy czas będzie krótki czy długi? Jak szybko zareaguje? Bardzo fajny model o sporych możliwościach.
Processing took 19.0766 seconds
To run this mockup i had to change line 36 i a2wav.py
from
torchaudio.save(fname, audio.cpu(), 24000)
to
torchaudio.save(fname, audio.cpu(), 24000, format="wav")
Maybe there is a simpler way but this is what I could do.
Hi
Do you know if CPU and MPS support is on the roadmap?
Thanks!
We got a decent working TTS pipeline now. So, it makes sense step back to scale and optimize for faster and better results.
Generate Whisper embeddings faster:
Optimize training:
whisper-medium
which takes huge amount of time to train even with LibriLight-small
.)Full error:
[2024-01-18 16:33:36,920] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
---------------------------------------------------------------------------
BackendCompilerFailed Traceback (most recent call last)
[<ipython-input-7-8f3d1d1ad737>](https://localhost:8080/#) in <cell line: 3>()
1 # this is very slow right now since our inference code is not very optimized
2 # but even without this crucial optimization it is still better than real-time on an RTX 4090
----> 3 pipe.generate_to_notebook("""
4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
5 """)
52 frames
[/usr/lib/python3.10/concurrent/futures/_base.py](https://localhost:8080/#) in __get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception
BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Hoping this is fixed soon. WhisperSpeech so far is sounding quite promising, and I'm eager to test it further, preferably through a typically reliable and less locally resource intensive method like Colab. Also, partly because I have many other AI installed already and have a tendency to accidentally break things when adding another, haha.
Hi,
Do you know if it's possible to control the pitch of each letter, similar to in Coqui Studio or xVA-Synth?
Thanks!
Hello,
I'm using the demo colab notebook (with T4 GPU) from the Main page and I get this error when running this cell:
# this is very slow right now since our inference code is not very optimized
# but even without this crucial optimization it is still better than real-time on an RTX 4090
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")
The error I get:
0.00% [0/749 00:00<?]
[2024-01-18 10:49:50,490] [0/1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
---------------------------------------------------------------------------
BackendCompilerFailed Traceback (most recent call last)
[<ipython-input-10-8f3d1d1ad737>](https://localhost:8080/#) in <cell line: 3>()
1 # this is very slow right now since our inference code is not very optimized
2 # but even without this crucial optimization it is still better than real-time on an RTX 4090
----> 3 pipe.generate_to_notebook("""
4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
5 """)
52 frames
[/usr/lib/python3.10/concurrent/futures/_base.py](https://localhost:8080/#) in __get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception
BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Hi,
Do you know if it's possible to smoothly generate longer audio with WhisperSpeech? And dialogue with multiple characters?
Thanks!
Request
At present there is no provision to enable streaming of audio chunks. If this can be implemented, the response time would be even faster than what it is now.
We have a notebook that shows how to extract acoustic tokens.
We are using the 1,5kbps codec model for now despite the fact that the speech quality is terrible. Google generation examples have a lot better quality - one explanation is that they have trained a special purpose VQ codec on the speech-only LibreLight dataset and it's providing better quality at lower bitrates than the general purpose audio codecs trained on speech, music and other sounds.
Sound quality is something we can fix with more training later on, after we prove the whole pipeline works, so for now we should cut everything and focus on making training easiest.
Im making a demo for the whisperspeech and run into an error , see here to see the disscussion and make a PR :
https://huggingface.co/spaces/Tonic/laion-whisper/discussions/1
this is the error :
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 164, in thread_wrapper
res = future.result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/user/app/app.py", line 54, in whisper_speech_demo
sf.write(tmp_file_name, audio_np, 24000)
File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 343, in write
with SoundFile(file, 'w', samplerate, channels,
File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 658, in __init__
self._file = self._open(file, mode_int, closefd)
File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 1216, in _open
raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/tmp/tmp69sgx7yk.wav': Format not recognised.
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/gradio/queueing.py", line 495, in call_prediction
output = await route_utils.call_process_api(
File "/home/user/.local/lib/python3.10/site-packages/gradio/route_utils.py", line 232, in call_process_api
output = await app.get_blocks().process_api(
File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1561, in process_api
result = await self.call_function(
File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1179, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/home/user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
File "/home/user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
File "/home/user/.local/lib/python3.10/site-packages/gradio/utils.py", line 678, in wrapper
response = f(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 118, in gradio_handler
raise res.value
soundfile.LibsndfileError: Error opening '/tmp/tmp69sgx7yk.wav': Format not recognised.
here is the code :
https://huggingface.co/spaces/Tonic/laion-whisper/blob/main/app.py
import spaces
import tempfile
import gradio as gr
import os
from whisperspeech.pipeline import Pipeline
import torch
import soundfile as sf
import numpy as np
import torch.nn.functional as F
from whisperspeech.languages import LANGUAGES
from whisperspeech.pipeline import Pipeline
from whisperspeech.utils import resampler
title = """# 🙋🏻♂️ Welcome to🌟Tonic's🌬️💬📝WhisperSpeech
You can use this ZeroGPU Space to test out the current model [🌬️💬📝collabora/whisperspeech](https://huggingface.co/collabora/whisperspeech). 🌬️💬📝collabora/whisperspeech is An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch. It's like Stable Diffusion but for speech – both powerful and easily customizable.
You can also use 🌬️💬📝WhisperSpeech by cloning this space. 🧬🔬🔍 Simply click here: <a style="display:inline-block" href="https://huggingface.co/spaces/Tonic/laion-whisper?duplicate=true"><img src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&logoWidth=14" alt="Duplicate Space"></a></h3>
Join us : 🌟TeamTonic🌟 is always making cool demos! Join our active builder's🛠️community 👻 [![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/GWpVpekp) On 🤗Huggingface: [TeamTonic](https://huggingface.co/TeamTonic) & [MultiTransformer](https://huggingface.co/MultiTransformer) On 🌐Github: [Polytonic](https://github.com/tonic-ai) & contribute to 🌟 [Poly](https://github.com/tonic-ai/poly) 🤗Big thanks to Yuvi Sharma and all the folks at huggingface for the community grant 🤗
"""
@spaces.GPU
def whisper_speech_demo(text, lang, speaker_audio, mix_lang, mix_text):
pipe = Pipeline()
speaker_url = None
if speaker_audio is not None:
speaker_url = speaker_audio
if mix_lang and mix_text:
mixed_langs = lang.split(',') + mix_lang.split(',')
mixed_texts = [text] + mix_text.split(',')
stoks = pipe.t2s.generate(mixed_texts, lang=mixed_langs)
audio_data = pipe.generate(stoks, speaker_url, lang=mixed_langs[0])
else:
audio_data = pipe.generate(text, speaker_url, lang)
resample_audio = resampler(newsr=24000)
audio_data_resampled = next(resample_audio([{'sample_rate': 22050, 'samples': audio_data.cpu()}]))['samples_24k']
# Normalize and write to a WAV file
with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmp_file:
tmp_file_name = tmp_file.name
audio_np = audio_data_resampled.numpy() # Convert to numpy array
# Normalize if necessary
if audio_np.max() > 1.0 or audio_np.min() < -1.0:
audio_np = audio_np / np.max(np.abs(audio_np))
# Ensure the audio data is 2D (num_samples, num_channels)
if audio_np.ndim == 1:
audio_np = np.expand_dims(audio_np, axis=1)
# Write the file
sf.write(tmp_file_name, audio_np, 24000)
return tmp_file_name
with gr.Blocks() as demo:
gr.Markdown(title)
with gr.Tabs():
with gr.TabItem("🌬️💬📝Standard TTS"):
with gr.Row():
text_input_standard = gr.Textbox(label="Enter text")
lang_input_standard = gr.Dropdown(choices=list(LANGUAGES.keys()), label="Language")
speaker_input_standard = gr.Audio(label="Upload or Record Speaker Audio (optional)", sources=["upload", "microphone"], type="filepath")
placeholder_mix_lang = gr.Textbox(visible=False) # Placeholder, hidden
placeholder_mix_text = gr.Textbox(visible=False) # Placeholder, hidden
generate_button_standard = gr.Button("Generate Speech")
output_audio_standard = gr.Audio(label="🌬️💬📝WhisperSpeech")
generate_button_standard.click(
whisper_speech_demo,
inputs=[text_input_standard, lang_input_standard, speaker_input_standard, placeholder_mix_lang, placeholder_mix_text],
outputs=output_audio_standard
)
with gr.TabItem("🌬️💬📝Mixed Language TTS"):
with gr.Row():
placeholder_text_input = gr.Textbox(visible=False) # Placeholder, hidden
placeholder_lang_input = gr.Dropdown(choices=[], visible=False) # Placeholder, hidden
placeholder_speaker_input = gr.Audio(visible=False)
mix_lang_input_mixed = gr.CheckboxGroup(choices=list(LANGUAGES.keys()), label="Select Languages")
mix_text_input_mixed = gr.Textbox(label="Enter mixed language text", placeholder="e.g., Hello, Cześć")
generate_button_mixed = gr.Button("Generate Mixed Speech")
output_audio_mixed = gr.Audio(label="Mixed🌬️💬📝WhisperSpeech")
generate_button_mixed.click(
whisper_speech_demo,
inputs=[placeholder_text_input, placeholder_lang_input, placeholder_speaker_input, mix_lang_input_mixed, mix_text_input_mixed],
outputs=output_audio_mixed
)
demo.launch()
would love some direction to resolve the returns on this one :-)
Hello, I read the script (2C. Whisper semantic embedding extraction.ipynb
and 2F. Residual (RQ) semantic token extraction model.ipynb
) about semantic token extraction, and I have a few questions. I would like to discuss them with you.
It would be amazing if emotion markers can be supported (or if they already are, documentation on how to use them), for example providing indicators like <angry>
, <excited>
, etc. or use of emoji's for the same.
Whisper encoder does not seem to carry pitch information (question marks in Whisper output seem to be added based on grammar, not intonation). This should get us better repeatability, higher S2A quality with a smaller model and enable additional conditioning on pitch (instead of just on speaker embeddings).
Hi, thanks for your great work, could you share the demo audio from this code?
Hi, I got the following error when run the example colab:
RuntimeError Traceback (most recent call last)
[<ipython-input-7-8f3d1d1ad737>](https://localhost:8080/#) in <cell line: 3>()
1 # this is very slow right now since our inference code is not very optimized
2 # but even without this crucial optimization it is still better than real-time on an RTX 4090
----> 3 pipe.generate_to_notebook("""
4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
5 """)
4 frames
[/usr/local/lib/python3.10/dist-packages/whisperspeech/s2a_delar_mup_wds_mlang.py](https://localhost:8080/#) in <listcomp>(.0)
514 if show_progress_bar: it = progress_bar(it)
515 with record_function("encode"):
--> 516 stoks, speakers = [x.repeat(bs, 1) for x in (stoks, speakers)]
517 xenc, xenc_positions, _ = self.run_encoder(stoks, speakers)
518 toks_positions = torch.arange(N, device=dev)
RuntimeError: Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor
hi,
I carefully read your S2A model code, i found that you try to model 2 codebook, when you transfer the codebook tokens into vector using nn.Embedding, you assume the first codebook's token and the second codebook's token using the same nn.Embedding, which means that you assume that the two codebook's token enjoy the same data distributions. I suggest you use different nn.Embedding to deal with different codebook's token.
Maybe my suggestion is wrong, I am willing to discuss with you.
torch.compile
is confused when we change sampling temperature and top_k
parameters and recompiles the model. To avoid this I think it should be enough to wrap them in cuda
tensors before passing them into generate_next
.
Hello
Is script developed to only have 30 seconds of conversion?
my output files are cut to 30 sec, is there any way to change this limit?
Thanks
Can you write more about training, how the dataset should look like, etc.? I see that you are from Poland, do you plan to add more Polish voices? Because the current model struggles with accents and style.
We got #3 working so now it's time to try to convert from Whisper-based semantic tokens (#3) to EnCodec-based acoustic tokens (#2).
We found out that better semantic tokens (from Whisper medium
) make this task a lot easier and even tiny
models sound great. Multilingual semantic token training helps and cross-language voice cloning works great.
There are a couple of hypothesis to test:
We also still have a couple of engineering challenges:
Once all the bugs are ironed out (#4), we have a text to semantic model (#9), we improve the speech codec (#10) and we have more high-quality data (#11) we will train final models that should match (or even exceed) the quality Google showed in their SPEAR TTS demo page.
Thanks for sharing the model. However, I try to inferenc use the trained model, the generated audio is all white noise. I just follow you inference scripts with nothing changed. It seems strange.
WhisperSpeech/whisperspeech/vq_stoks.py
Line 340 in b6fc87c
Getting an error here - but upon digging in I don’t know what the solution would be, and don’t know why this doesn’t happen for every inference.
Essentially, if ROPE is true (and it is by default for VQ stoks), then you expect positional information to be passed here otherwise you get an error (None can’t multiply float):
WhisperSpeech/whisperspeech/modules.py
Line 108 in b6fc87c
But it isn’t as you can see from the initial call to ‘out_blocks’. I see that the ‘upgrade’ method in the tunable for VQ stoks turns off rope - but don’t see this method called anywhere in the repo. Should rope be off for VQ stoks?
WhisperSpeech/whisperspeech/vq_stoks.py
Line 88 in b6fc87c
Shouldn't this mask the actual padding area, which includes a random part of both the left and right part of the audio?
Progress updates (from newest):
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
A Polish sample, male voice:
We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.
An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):
Female voice:
Male voice:
We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:
We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:
(don't forget to unmute the video)
Ground truth:
We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
The generated output from the S->A model (multinomial sampling, temperature 0.8):
Hi, There is roadmap for this project? I wonder if the project availabel recently?
Hi, excellent work! I am interested in using the data from this repository Huggingface for academic research purposes. However, I noticed that there is no license information provided. Could you please clarify whether the data in this repository is available for reuse, and if so, under what terms and conditions?
Thank you for your assistance in clarifying the usage rights for this data.
Sincerely,
Thansk for your great job. Seems you use the speaker id not the reference wav to separate the speakers. I wonder will this repo support zero-shot voice-cloning?
This could also allow us to:
There are a lot of jupyter notebooks here, and I'm not very clear what the steps to train a model are...
Right now the EnCodec speech quality at 1.5kbps is pretty terrible (far from what Google shows for their SoundStream-based codec). I am pretty sure the problem is caused by EnCodec being a universal sound codec because the official samples for SoundStream at 1.5kbps sound quite similar (Lyra-v2 sound even worse than that). That's why I suspect SPEAR TTS is based on an unreleased speech-only codec.
Since EnCodec has multi-rate capability so the overall model knows how to represent high-quality speech. The pretty good results we had for compressing the Whisper embeddings suggest we might get away with retraining just the quantization layer to reprioritize the bandwidth allocation and improve speech quality (at the cost of ignoring music and other audio).
Hello,
Thank you for this great work. With latest commits I've seen that you didn't push the checkpoints. I want to try the semantic token extractor. Is there any way to access to the latest models?
Kind regards
Hi,
I'm trying to install the package locally. I cloned the git repository and installed it w/ pip, however when running it I get the following error:
ImportError: cannot import name 'languages' from 'whisperspeech'
The output quality is good enough that it would be useful to allow more folks to test our model out on Huggingface.
Right now the S2A model struggles with short semantic token sequences generated by the T2S model. For longer sequences the quality also deteriorates towards the end.
This is most likely because the semantic tokens are padded to 1500 tokens but the S2A dataset we train on always has full 30 second (1500 token) fragments.
We can fix it by using the voice activity detection data we extracted for the T2S training to train on sequences of varying length.
This will be a model that converts text tokens into Whisper-encoder-derived semantic tokens. With that we will have a complete TTS pipeline.
To train it we can re-use Whisper as our back-translation model (the paper had to train one from scratch). We can use the existing distillation setup as a starting point but we will have to make sure we get all the text tokens since Whisper has a tendency to cut the decodings short and asks you (using timestamp tokens) to rerun it with more data.
This was a pretty difficult task in the original SPEAR TTS implementation (they had to use a 24-layer model).
Will support Mandrain?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.