fishaudio / fish-speech Goto Github PK

View Code? Open in Web Editor NEW

12.0K 89.0 912.0 17.28 MB

Brand new TTS solution

Home Page: https://speech.fish.audio

License: Other

Python 95.10% CSS 0.69% HTML 0.09% JavaScript 0.71% Batchfile 2.05% Jupyter Notebook 1.35%

llama transformer tts valle vits vqgan vqvae

fish-speech's Introduction

Fish Speech

English | 中文简体 | Portuguese (Brazil)| 日本語

This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.

Disclaimer

We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

Online Demo

Fish Audio

Quick Start for Local Inference

inference.ipynb

Videos

V1.4 Demo Video: Youtube

Documents

Samples

Credits

Sponsor

Data Processing sponsor by 6Block

Fish Audio is served on Lepton.AI

fish-speech's People

Contributors

Stargazers

Watchers

Forkers

honst211 splinter21 yifree tundrawork clumsyroot oedosoldier anyacoder kenwaytis sunnnnnnnny faceair hironow zzc0208 huangweiboy2 anthonyyuan funying synthaether lordelf csgcmai meimei-123 mikelgh anhlbt kokosensei lsdthy arbitraryking zshy1205 gysan ykimdeveloper rainsoa andantei laojiu boostpapa oceantalk ugo99 charslee013 artrajz liudiao1992 junhaohuang0615 army11711 buyersystem glide-the jin1258804025 macroustc aiwzx kimamananeko abersheeran fmbao cnchtu comeback14 luuumity 1127152834 dy2009 jasonzhang761213 jimmyshenjianjun huanlinoto qiuyuzhao haishengliang edustack stardust-minus newportchen songsihan zhaopufeng gaopeng-bai wehos ishine pupumao jimmyleesnow decisiontreee hubin858130 ethe whale-dolphin blaisewf chenying99 xxxsjan long630904 jackdiy kevinxu816 geistmond elvuel v3ucn weiqinxiao gentlem4dman zhuang1125 aceliuchanghong changfeifan fancyerii vrgz2022 33646341 wubangjunjava ouy160 sapphirelab hdmjdp mty13298060699 duliangang xiaojun777-huang mkygogo assassindesign uk0 markyfsun hosinozola wblgers

fish-speech's Issues

[Feature] 目前的项目，支持中英文混合输出吗？

很多bert和vits在多语言混合输出上都有些问题，这个版本会有改进么。

[QUESTION] VQGAN Low quality prediction for some audios

Recently I found VQGAN seems not very stable on the prediction quality. It will give obviously low quality prediction on many cases.

Here is one typical bad case for example:

sample-6-gt-and-pred.zip

Is this a weakness of the VQGAN model, or it can be workarounded by having higher sampling rate?

I have seen that in vqgan_pretrain.yaml we have sample_rate: 22050, while in vqgan_pretrain_v2.yaml we have sample_rate: 44100.

[Feature] how to start many data_server at the same time?

The default ip for the data_server service is 127.0.0.1:50051, can this be changed? In other words I want to start two data_servers at the same time (each with different data), how can I do that? Thank you!

add feature support VITS2 training like dataset format

We are coming from Bert-VITS2, its a very good project, and we already have many dataset which using vits like format:

aaa.wav|Jake|ZH|我来了

is there a way to suppor this format as well?

[help] RuntimeError: Error(s) in loading state_dict for VQGAN Missing key(s) in state_dict

os: centos 7.9
python: 3.10
cuda: 11.7
torch: 2.0.1

when i run python tools/vqgan/inference.py -i test.wav, this error happed. it seems that missing files or model issues.（I manually downloaded these models），the detail error:

i don't know how to solve it , i need help

[BUG]tone_sandhi相关

RVC-Boss/GPT-SoVITS@7fc2161
RVC-Boss/GPT-SoVITS#475 (comment)

[BUG] 使用了参考音频之后，导出的结果音频只有”咦~哦~嗯“之类的简单语音

用的是Win10，Cu121，参考音频英文。
是我配置不对还是哪里的问题呢？

How many epoch for training from scratch?

          > 10k hours chinese, 700 hours english, 300 hours japanese.

How many epoch for training from scratch?

Originally posted by @hdmjdp in #25 (comment)

Please update quickly（（（

Inconsistent audio samples between steps in training

I am attempting to finetune the VQGAN. On the audio page, under sample-0/wavs/prediction, the sound played at step 0 is a inferred voice segment (which is correct). However, at step 2000, the sound played is a different voice clip from the training set, and the content of these two segments is different.

I'm unsure where the issue lies and how to troubleshoot it.

[QUESTION] generated audio always has low pitch noise blended especially for male voice

I used this model to generate male voice for example Zhongli from Genshin, however there are always some low pitch noise mixed with the generated speaking audio. Not sure if it is because male voice tone is low, since I did not observe this issue for female voice generation. When I use vqgan for prompt generation, the fake wav sounds okay and less noisy, but when I use that prompt to generate voice from new text, the noise or electronic sounds happens. Where should the problem coming from, vqgan or llama? Is there any solution for that? I have tried finetune vqgan with 50 hours audio, but did not observe improvement. Any suggestion to reduce the noise and improve the audio quality will be greatly helpful and appreciated.

[Feature] 是否支持mps 芯片的训练？

入手了mac 自家芯片的笔记本，想玩玩，希望本地能够跑起来，但是发现只能cuda才行。报如下错误

return self._apply(lambda t: t.cuda(device))
RuntimeError: Invalid device, must be cuda device

[BUG] Hanging on train.py

Does this model support streaming？

Hello, this is a very exciting project! Does this model support streaming？Thank you！

RuntimeError: cutlassF: no kernel found to launch!

The following error occurred when executing the second part of the inference code. How can I resolve this?

Driver Version: 546.33

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

Traceback (most recent call last):
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 476, in <module>
    main()
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 440, in main
    y = generate(
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 237, in generate
    next_token = prefill(
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 129, in prefill
    logits = model.forward_generate(x, input_pos)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 209, in forward_generate
    return self.compute(x, freqs_cis, mask, input_pos=input_pos)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 151, in compute
    x = layer(x, freqs_cis, mask, input_pos=input_pos)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 223, in forward
    h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 278, in forward
    y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
RuntimeError: cutlassF: no kernel found to launch!

need a pretrainmodel()

thanks，love from China.

[Feature] Other language support?

Is your feature request related to a problem? Please describe.
I'm always finding a text to speech project that can be used on Cantonese. But most are outdated and the quality is not so well. I tried to port vits2 and innnky/emotional-vits to Cantonese by copy the old Cantonese cleaner, but the result is so poor.

Describe the solution you'd like
Use chinese-dialect-lexicons as text cleaner?

Describe alternatives you've considered
I've noticed that phonemizer is going to be dropped soon. I wonder if with --no-g2p can archive it.

Additional context
N/A

2 Questions

Hello, amazing job is done here,

But I had 2 or maybe 3 quick questions!

1- Regarding bertvits2 do you have a sample with normal vits2 then the same sample after bert ?? I would like to hear the actual difference.

2- Why do you guys use Normal bert ? I have seen Png-bert , mp-bert , pl-bert but none of them are being used! is it because the normal bert that is being used is way better than those 3 ?

3-https://speech.fish.audio/en/samples/ i listened to these samples and they are really really realistic and emotional almost like eleven labs , How did you achieve the emotion ? is it only by integrating bert ? i thought bert is for the prosody enhancing only but didn't imagine that it would increase the emotion by that much !

[BUG]生成的音频出现拼接问题

python tools/llama/generate.py
--text "海鸥岛有哪些好玩的地方"
--prompt-text "打开晾衣架照明"
--prompt-tokens "fake.npy"
--checkpoint-path "results/text2semantic_400m_finetune_spk/checkpoints/step_000000400.ckpt"
--speaker "1"
--num-samples 2
--compile

用自己的语料微调了一下vqgan & llama

prompt对应的音频是：打开晾衣架照明
生成的音频内容是：哎。。。打开晾衣架照明，海鸥岛有哪些好玩的地方

[Question] VQGAN fine-tuning dataset size and training time

I'd like to gather some insights about the fine-tuning process from those who have experienced it.

What is the recommended dataset size for fine-tuning?
What is the anticipated training time? I noticed there's a max_steps: 100000 configuration for the lightning trainer, but based on my current training speed, this will take a significant amount of time (>7 days). I have attempted to run it on both a single-card A100 instance and an eight-card V100 instance, both resulted a training speed at ~7500 epochs per 12 hours.
And one more question: Is the VQGAN model sensitive to the emotion or speaking style of the speaker? For example, a single speaker might express different emotions in different samples; will this influence the training results?

Optimize chinese text normalization

The current logic of text_normalize is not capable of correctly processing scenarios involving years.

def text_normalize(text):
    numbers = re.findall(r"\d+(?:\.?\d+)?", text)
    for number in numbers:
        text = text.replace(number, cn2an.an2cn(number), 1)
    return text

If you plan to continue using cn2an, you should consider using the transform method.

Alternatively, consider using WeTextProcessing for processing. From my perspective, it seems like WeTextProcessing appears to be more professional.

[BUG] Confusion caused by 'prompt-text/tokens' in inference stage

Thank you so much for your open-source initiative. The transition from a VITS structure to a transformer-based structure is exciting.

Describe the bug
It could be a mistake in the description of the introduction/inference, or it could be something else.

Details
I noticed that in the introduction stage, the inference part of the model was described as requiring only textual input. But under inference, the user is asked to enter 'prompt-text/tokens''.

Perhaps you can consider adding a simple inference code that requires only --text or explain why prompt input is still required during the inference phase.

Thank you for your contributions.

训练数据必须区分发音人吗？

您好：
我想重新训练一门新的语言，数据是从网络下载，没有发音人标签。请问需要先进行发音人聚类，然后进行模型训练吗？

     谢谢了~

[BUG] Generated speech is unclear

Describe the bug

Is the unclear pronunciation caused by the "half" parameter breaking something?

fake.wav.zip

Another question is how should I fill in the "prompt-text"? I noticed in the code that "prompt-text" is eventually concatenated with "text", so is it okay if I don't write anything for it?

To Reproduce

Here is the command I used.

python tools/vqgan/inference.py \
    -i "胡子v1_0.wav" \
    --checkpoint-path "checkpoints/vqgan-v1.pth"

python tools/llama/generate.py \
    --text "要转换的文本" \
    --prompt-text "你的参考文本" \
    --prompt-tokens "fake.npy" \
    --checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
    --compile \
    --half

python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/vqgan-v1.pth"

data

you said that the pretrained model was trained on
10k hours Chinese, 700 hours English, 300 hours Japanese.
is this data public ? where can I find those 11k hours ? and what khz are they 44.1 or 48khz ? also link them if possible

Thanks in Advance!

Question about ASR results.

Nice project! For time reasons, I did not read the specific code carefully. So I am confused with the ASR results.
The original ASR model here is openai whisper and I found that the outputs use space instead punctuation marks. But actually whisper performes not so good on madarin. So I use funasr instead. But the outputs of funasr model are with punctuation marks.

Should I replace punctuation with Spaces?
I found that the prompt text and text in your demo are with punctuation marks. So It makes confused.

[BUG] Corrupted generated speeches

My fine-tuned model will generate some corrupted speech audios. They all seems like have some words / sentences shuffled.

Here's an example of generated speech:

fake.zip

The expected speech text is:

这是嘟嘟可，是可莉很久以前就交到的好朋友，要记得他的名字哦！以后别叫他挂在你包上的玩偶了！

While the actual speech content in fake.wav is:

这是嘟嘟可，<inaudible>记得他的名字哦！以后别<inaudible>就交到的好朋友，要记得他的名字哦！以后别叫他挂在你包上的玩偶了！

The inference steps for this output:

python tools/vqgan/inference.py -i "/mnt/d/Projects/VocalAI/Datasets/GenshinImpact-Voices-Labeled/Klee/vo_JNJEQ003_6_klee_01.wav" --checkpoint-path "results/vqgan_finetune_GenshinImpact_Klee/checkpoints/step_000020000.ckpt"
python tools/llama/generate.py --text "这是嘟嘟可，是可莉很久以前就交到的好朋友，要记得他的名字哦！以后别叫他挂在你包上的玩偶了！" --prompt-text "妈妈说过，今天大家要迎接风神大人，要请他喝酒！如果风神大人高兴，就会变成风来祝福大家！" --prompt-tokens fake.npy --checkpoint-path "results/text2semantic_finetune_spk_GenshinImpact_Klee/checkpoints/step_000001000.ckpt" --config-name text2semantic_finetune_spk_GenshinImpact_Klee --speaker SPK1 --num-samples 2 --compile
python tools/vqgan/inference.py -i codes_0.npy --checkpoint-path "results/vqgan_finetune_GenshinImpact_Klee/checkpoints/step_000020000.ckpt"

Given that the VQGAN is relatively stable, and this kind of chaotic output is often seen in an LLM's output, I believe these corrupted speeches are caused by the Llama model.

Is this behavior expected from the model? Do you have any suggestions on how to avoid this kind of issue?

[BUG] VQGAN copysyn

When running the command python tools/vqgan/inference.py -i test/01040701.wav, the synthesized fake.wav file contains only NaN values. Upon checking fake.npy, I found that its values are all zeros. Could you please tell me what the value range should be for the input .wav file? The range of the .wav file I inputted was between -0.5 and 0.5.

Any plan to write a technical report or paper?

Hi, It is a great work.
I want to ask whether you have any plans to write a technical report or paper? so that we can better understand the project.

Training issue

Hello, I have been deeply impressed by your project after reviewing it. Thank you for open-sourcing such a fantastic project. Additionally, I have two questions for you:

Computing Power: Could you please share the GPU model you used to train the current results? Also, approximately how much time did it take for the training?
Are you planning to release the training code?

If it is convenient for you, I would greatly appreciate it if you could clarify these questions for me. Once again, thank you very much.

[BUG] How to reproduce demo for English TTS

I am trying to use webui. I load the checkpoint and use the reference wav provided in demo page. But I can not reproduce.

Describe the bug
I am trying to reproduce this demo

model settings

I used same reference wav.

But the results are pretty bad
audio.zip

How could I reproduce the demo performance? Thanks!

OutOfMemoryError

Describe the bug
I followed your steps for fine-tuning. During the last step of training, this error occurred.

Screenshots / log

Additional context
My machine is a 32G V100.
The maximum audio length in my data set is 8 seconds with a sample rate of 22050.
This problem still occurs when the batch size is set to 2

what's the target of this project?

Inquiry About Inference Speed in Git Repository with Specific CUDA and PyTorch Configuration

I have encountered an issue related to inference speed that I hope you can assist me with. Below are the details of my environment and the problem:

CUDA and PyTorch Version:
Due to the limitations of my environment, I am restricted to using CUDA version 117, which corresponds to PyTorch version 2.1.0.

Modification for Compatibility:
To make the code run smoothly, I had to comment out line 20 in the file tools/llama/generate.py, which is torch._inductor.config.coordinate_descent_tuning = True.

Issue with Inference Speed:
I observed that the inference speed has not significantly improved as expected. The speed seems to have increased only marginally from 30 tokens/second to 35 tokens/second, rather than reaching the anticipated 500 tokens/second.

Given this context, I have a couple of questions:

Have you conducted any experiments or do you have insights related to inference speed with this specific configuration (CUDA 117 and PyTorch 2.1.0)?
Is there a known issue or limitation with this setup that could explain the lack of expected acceleration in token processing speed?
Any guidance or information you can provide would be greatly appreciated.

Thank you for your time and assistance.

it can't reproduce demo's result?

i use default code and pretrained models from https://speech.fish.audio/zh/, but i don't generate result that https://speech.fish.audio/zh/samples/;The only difference is that I renamed the file https://huggingface.co/fishaudio/speech-lm-v1/blob/main/tokenizer_config.json to config.json, and found that the id corresponding to pad_token is 36407, not 32311. The audio it generates has a lot of repetitions.Below is my inference code.

text='人间灯火倒映湖中，她的渴望让静水泛起涟漪。若代价只是孤独，那就让这份愿望肆意流淌。流入她所注视的世间，也流入她如湖水般澄澈的目光。'
#<<8
python tools/vqgan/inference.py \
    -i "/tmp/xxx/naxita.wav" \
    --checkpoint-path "/tmp/xxx/fishaudio-speech-lm-v1-pretrained/vqgan-v1.pth"

#8
#<<8
python tools/llama/generate.py \
    --text ${text} \
    --prompt-text ${text}\
    --prompt-tokens "fake.npy" \
    --checkpoint-path "/tmp/xxx/fishaudio-speech-lm-v1-pretrained/text2semantic-400m-v0.2-4k.pth" \
    --tokenizer "/tmp/xxx/fishaudio-speech-lm-v1-pretrained" \
    --num-samples 2 \
    --compile
#8

#<<8
python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "/tmp/xxx/fishaudio-speech-lm-v1-pretrained/vqgan-v1.pth"
#8

about base vq model and llm training dataset[Feature]

hi author, thanks for your contribution, sharing such an excellent project.
I wonder how many hours in training base vq model and llm(llama).

[Help] "pip install -e" fails

Hello, here is the description:

Describe the bug
I cannot install it, this commands "pip install -e ." or (or with pip3), FAILS.

To Reproduce
create venv (pip -m venv venv)
activate it (venv\scripts/activate)
install torch etc (pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121)
run this: (pip3 install -e .)

Expected behavior
I expected it to conclude the installation normally but instead it stopped with an error, which is:

....
copying pynini\lib\py.typed -> build\lib.win-amd64-cpython-310\pynini\lib
      running build_ext
      building '_pywrapfst' extension
      creating build\temp.win-amd64-cpython-310
      creating build\temp.win-amd64-cpython-310\Release
      creating build\temp.win-amd64-cpython-310\Release\extensions
      "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\FishSPEECH1\fish-speech\venvFishSpeech\include -IC:\Users\aaa\AppData\Local\Programs\Python\Python310\include -IC:\Users\aaa\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" /EHsc /Tpextensions/_pywrapfst.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/_pywrapfst.obj -std=c++17 -Wno-register -Wno-deprecated-declarations -Wno-unused-function -Wno-unused-local-typedefs -funsigned-char
      clÿ: Ligne de commande error D8021ÿ: argument num‚rique non valide '/Wno-register'
      error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.38.33130\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pynini
  Running setup.py clean for pynini
Successfully built fish-speech
Failed to build pynini
ERROR: Could not build wheels for pynini, which is required to install pyproject.toml-based projects

Screenshots / log
https://imgur.com/m17l6OW

Additional context
before getting this error I had this one:

---------------------------------------- 196.4/196.4 kB ? eta 0:00:00
Collecting pynini==2.1.5
Downloading pynini-2.1.5.tar.gz (627 kB)
  ---------------------------------------- 627.6/627.6 kB 38.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
   Traceback (most recent call last):
     File "<string>", line 2, in <module>
     File "<pip-setuptools-caller>", line 34, in <module>
     File "C:\Users\Saaa\AppData\Local\Temp\pip-install-t0wvvm_h\pynini_460d91c97a9d40899ca2bd7ffa515beb\setup.py", line 22, in <module>
       from Cython.Build import cythonize
   ModuleNotFoundError: No module named 'Cython'
   [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

[notice] A new release of pip available: 22.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

So I did this:
py -m pip install --upgrade pip setuptools wheel

Which led me to the current error now.

I tried some solutions as:

pip install aiohttp
pip install aiohttp==3.9.0b0

Solutions I found somehwere on stackoverflow and a website called weoghnite and in microsoft forums,

Maybe I should my python? I am using 3.10.6 as seen in the first screenshot!?

[Feature] Add ASR utils

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Integrate Whisper and Fun ASR.

Describe alternatives you've considered
N/A

Additional context
N/A

Examples

Hi,
This project looks very promising! Would you mind providing some English samples?
Thank you!

[Feature] How to accelerate inference

Is your feature request related to a problem? Please describe.

Is the inference time normal? Not considering the model loading time, is it normal to take more than 30s to generate a speech with 60 english words. How could I accelerate it if I want to make the inference time in ~2s? Thanks!

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Model output keeps repeating itself, after certain input/max token combination was executed.

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
When certain audio and max tokens settings were used to generate the first audio, all subsequent generations will keep using the inputs of the first generation no matter what is given to the model.

To Reproduce
Steps to reproduce the behavior:
See video below:
https://www.awesomescreenshot.com/video/23829885?key=2c741f2c20efe2939f38085c73b3d740

Expected behavior
A clear and concise description of what you expected to happen.
Subsequent generations should generate what is given as the input, not what is given in the first generation.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.
https://www.awesomescreenshot.com/video/23829885?key=2c741f2c20efe2939f38085c73b3d740

Additional context
Created a PR to fix it. Possibly related setup_caches, reason unknown, maybe the repo owner can help investigate.

[Feature] llama model size selection

Thank you so much for your open-source project.

Have you considered using llama models of larger sizes?

After completing several finetuning tasks, I found that the current llama model, even after finetuning training, has difficulty learning the difference between synonyms in different contexts.

For example, the word '滚' in the Chinese blessing '财运滚滚' was read like a swearing.
Perhaps this is the limitation of small-size llama. Or I may need more complex training data to finetune. (currently using 20 hours)

Maybe offering different sizes like openai-whisper would be a good solution.

[Feature] Try Unsloth to optimize Llama fine-tuning performance

~~Fine-tuning the Llama model currently isn't achievable on a single A100 card machine using the default configuration, as it requires more than 80GB of vRAM.~~ Outdated, see #41 .

It is mentioned that adjusting the configuration or using gradient checkpointing could potentially solve this issue, and I've noticed that you also have implemented FlashAttention to save resources and speed up fine-tuning.

However, Unsloth claims to reduce GPU memory usage by 40%, while also provides over 75% performance gain using its free "Unsloth Open" version, compared to FlashAttention (note that I think they are comparing with FlashAttention version 1, which is outdated). And it also have dedicated kernel optimization for LoRA.

Do you think it's worth giving it a try?

[Feature]LLAMA 微调所需数据集的格式

.
├── SPK1
│ ├── 21.15-26.44.lab
│ ├── 21.15-26.44.mp3
│ ├── 27.51-29.98.lab
│ ├── 27.51-29.98.mp3
│ ├── 30.1-32.71.lab
│ └── 30.1-32.71.mp3
└── SPK2
├── 38.79-40.85.lab
└── 38.79-40.85.mp3

这里面的"lab"文件，如果是使用"txt"文件，格式是什么样子的

[Feature] Add ms-stft and ms-sb-cqt discriminators

Thanks for very cool project.

This is the best and simple LLM-based TTS Implementation I have ever seen!

For audio quality, I highly recommend adding MS-STFT Discriminator of Encodec, and MS-SB-CQT Discriminator (https://arxiv.org/abs/2311.14957).

When using above discriminators, I have experienced a better audio quality for this kind of autoencoder model.

Thanks!

[BUG] WeTextProcessing not available in Windows.

I follow the interface guide （https://speech.fish.audio/inference/）

This step requires running "tools/llama/generate.py" , but i can't install "WeTextProcessing" (bucause "Pynini" doesn't seem to support "Windows") .

Who Can Help Me , THANKS！

[BUG] InstantiationException: Error locating target 'fish_speech.models.vqgan.VQGAN'

when I want to finetune,I run ython fish_speech/train.py --config-name vqgan_finetune ,then it showed :

hydra.utils.instantiate maybe is reason,but i don't know how to solve?

偶发 Invalid argument的问题，出现在inference.py -i的环节

环境：

Ubuntu 22.04

torch版本

torch                     2.1.2
torchaudio                2.1.2
torchmetrics              1.3.0.post0
torchvision               0.16.2

显卡：

A30

cuda_12.2.r12.2

Driver Version: 535.54.03

操作步骤

源音频阶段，-i wav文件，下面的lzl1音频，后续流程都无法进行下去。
而后续生成阶段（输入别的音频文件后），-i npy文件则会经常会触发，推理文本稍微长点，会必现

# 源音频输入阶段
python tools/vqgan/inference.py \
                                          -i "lzl1.wav" \
                                          --checkpoint-path "checkpoints/vqgan-v1.pth"


# 生成阶段
python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/vqgan-v1.pth"

音频：

17s
16bit，44100Hz，1411 kbps
lzl1.zip

文本：

从小到大，都是别人教你：该做什么，不该做什么，其实，人生，这么复杂，哪里是一句：一份耕耘、一份收获，就可以讲的清楚的呢？遵从内心一次。

错误详情

/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2024-01-25 04:19:39.209 | INFO     | __main__:main:51 - Restored model from checkpoint
2024-01-25 04:19:39.210 | INFO     | __main__:main:54 - Processing in-place reconstruction of lzl1.wav
2024-01-25 04:19:40.343 | INFO     | __main__:main:62 - Loaded audio with 17.06 seconds
2024-01-25 04:19:41.554 | INFO     | __main__:main:102 - Generated indices of shape torch.Size([4, 368])
2024-01-25 04:19:42.025 | INFO     | __main__:main:126 - VQ Encoded, indices: torch.Size([4, 1, 368, 1]) equivalent to 21.53 Hz
Traceback (most recent call last):
  File "/path/to/fish-speech/tools/vqgan/inference.py", line 147, in <module>
    main()
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/path/to/fish-speech/tools/vqgan/inference.py", line 135, in main
    fake_audios = model.generator(decoded_mels)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/fish-speech/fish_speech/models/vqgan/modules/decoder.py", line 76, in forward
    xs += self.resblocks[i * self.num_kernels + j](x)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/fish-speech/fish_speech/models/vqgan/modules/decoder.py", line 172, in forward
    xt = c1(xt)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Invalid argument

which pytorch version do you use?

python tools/llama/generate.py \

--text "要转换的文本" \
--prompt-text "你的参考文本" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
--num-samples 2

Traceback (most recent call last):
File "xxx/fish-speech/tools/llama/generate.py", line 22, in
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future
File "xxx/fish-speech/lib/python3.10/site-packages/torch/_dynamo/config_utils.py", line 72, in setattr
raise AttributeError(f"{self.name}.{name} does not exist")
AttributeError: torch._inductor.config.fx_graph_cache does not exist