Code Monkey home page Code Monkey logo

fish-speech's Introduction

Fish Speech

English | 中文简体 | Portuguese (Brazil)| 日本語

This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.

Disclaimer

We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

Online Demo

Fish Audio

Quick Start for Local Inference

inference.ipynb

Videos

V1.4 Demo Video: Youtube

Documents

Samples

Credits

Sponsor

fish-speech's People

Contributors

anyacoder avatar bfs18 avatar blaisewf avatar cminus01 avatar duliangang avatar dur-randir avatar eltociear avatar erquren avatar faceair avatar hscspring avatar initialencounter avatar jmoney7823956789378 avatar kenwaytis avatar leng-yue avatar naozumi520 avatar oedosoldier avatar potato-mika avatar ppmzhang2 avatar pre-commit-ci[bot] avatar sapphirelab avatar stardust-minus avatar therc4 avatar thiagojramos avatar touch-night avatar tps-f avatar tsingshui avatar v3ucn avatar wblgers avatar whale-dolphin avatar yorelog avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fish-speech's Issues

[QUESTION] VQGAN Low quality prediction for some audios

Recently I found VQGAN seems not very stable on the prediction quality. It will give obviously low quality prediction on many cases.

Here is one typical bad case for example:

sample-6-gt-and-pred.zip

Is this a weakness of the VQGAN model, or it can be workarounded by having higher sampling rate?

I have seen that in vqgan_pretrain.yaml we have sample_rate: 22050, while in vqgan_pretrain_v2.yaml we have sample_rate: 44100.

Inconsistent audio samples between steps in training

image

I am attempting to finetune the VQGAN. On the audio page, under sample-0/wavs/prediction, the sound played at step 0 is a inferred voice segment (which is correct). However, at step 2000, the sound played is a different voice clip from the training set, and the content of these two segments is different.

I'm unsure where the issue lies and how to troubleshoot it.

[QUESTION] generated audio always has low pitch noise blended especially for male voice

I used this model to generate male voice for example Zhongli from Genshin, however there are always some low pitch noise mixed with the generated speaking audio. Not sure if it is because male voice tone is low, since I did not observe this issue for female voice generation. When I use vqgan for prompt generation, the fake wav sounds okay and less noisy, but when I use that prompt to generate voice from new text, the noise or electronic sounds happens. Where should the problem coming from, vqgan or llama? Is there any solution for that? I have tried finetune vqgan with 50 hours audio, but did not observe improvement. Any suggestion to reduce the noise and improve the audio quality will be greatly helpful and appreciated.

[Feature] 是否支持mps 芯片的训练?

入手了mac 自家芯片的笔记本,想玩玩,希望本地能够跑起来,但是发现只能cuda才行。报如下错误

return self._apply(lambda t: t.cuda(device))
RuntimeError: Invalid device, must be cuda device

RuntimeError: cutlassF: no kernel found to launch!

The following error occurred when executing the second part of the inference code. How can I resolve this?

Driver Version: 546.33

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0
Traceback (most recent call last):
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 476, in <module>
    main()
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 440, in main
    y = generate(
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 237, in generate
    next_token = prefill(
  File "/home/meowlgm/fish-speech/tools/llama/generate.py", line 129, in prefill
    logits = model.forward_generate(x, input_pos)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 209, in forward_generate
    return self.compute(x, freqs_cis, mask, input_pos=input_pos)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 151, in compute
    x = layer(x, freqs_cis, mask, input_pos=input_pos)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 223, in forward
    h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/meowlgm/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/meowlgm/fish-speech/fish_speech/models/text2semantic/llama.py", line 278, in forward
    y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
RuntimeError: cutlassF: no kernel found to launch!

[Feature] Other language support?

Is your feature request related to a problem? Please describe.
I'm always finding a text to speech project that can be used on Cantonese. But most are outdated and the quality is not so well. I tried to port vits2 and innnky/emotional-vits to Cantonese by copy the old Cantonese cleaner, but the result is so poor.

Describe the solution you'd like
Use chinese-dialect-lexicons as text cleaner?

Describe alternatives you've considered
I've noticed that phonemizer is going to be dropped soon. I wonder if with --no-g2p can archive it.

Additional context
N/A

2 Questions

Hello, amazing job is done here,

But I had 2 or maybe 3 quick questions!

1- Regarding bertvits2 do you have a sample with normal vits2 then the same sample after bert ?? I would like to hear the actual difference.

2- Why do you guys use Normal bert ? I have seen Png-bert , mp-bert , pl-bert but none of them are being used! is it because the normal bert that is being used is way better than those 3 ?

3-https://speech.fish.audio/en/samples/ i listened to these samples and they are really really realistic and emotional almost like eleven labs , How did you achieve the emotion ? is it only by integrating bert ? i thought bert is for the prosody enhancing only but didn't imagine that it would increase the emotion by that much !

[BUG]生成的音频出现拼接问题

python tools/llama/generate.py
--text "海鸥岛有哪些好玩的地方"
--prompt-text "打开晾衣架照明"
--prompt-tokens "fake.npy"
--checkpoint-path "results/text2semantic_400m_finetune_spk/checkpoints/step_000000400.ckpt"
--speaker "1"
--num-samples 2
--compile

用自己的语料微调了一下vqgan & llama

prompt对应的音频是:打开晾衣架照明
生成的音频内容是:哎。。。打开晾衣架照明,海鸥岛有哪些好玩的地方

[Question] VQGAN fine-tuning dataset size and training time

I'd like to gather some insights about the fine-tuning process from those who have experienced it.

  • What is the recommended dataset size for fine-tuning?

  • What is the anticipated training time? I noticed there's a max_steps: 100000 configuration for the lightning trainer, but based on my current training speed, this will take a significant amount of time (>7 days). I have attempted to run it on both a single-card A100 instance and an eight-card V100 instance, both resulted a training speed at ~7500 epochs per 12 hours.

  • And one more question: Is the VQGAN model sensitive to the emotion or speaking style of the speaker? For example, a single speaker might express different emotions in different samples; will this influence the training results?

Optimize chinese text normalization

The current logic of text_normalize is not capable of correctly processing scenarios involving years.

def text_normalize(text):
    numbers = re.findall(r"\d+(?:\.?\d+)?", text)
    for number in numbers:
        text = text.replace(number, cn2an.an2cn(number), 1)
    return text

If you plan to continue using cn2an, you should consider using the transform method.

Alternatively, consider using WeTextProcessing for processing. From my perspective, it seems like WeTextProcessing appears to be more professional.

[BUG] Confusion caused by 'prompt-text/tokens' in inference stage

Thank you so much for your open-source initiative. The transition from a VITS structure to a transformer-based structure is exciting.

Describe the bug
It could be a mistake in the description of the introduction/inference, or it could be something else.

Details
I noticed that in the introduction stage, the inference part of the model was described as requiring only textual input. But under inference, the user is asked to enter 'prompt-text/tokens''.

Perhaps you can consider adding a simple inference code that requires only --text or explain why prompt input is still required during the inference phase.

Thank you for your contributions.

训练数据必须区分发音人吗?

您好:
我想重新训练一门新的语言,数据是从网络下载,没有发音人标签。请问需要先进行发音人聚类,然后进行模型训练吗?

     谢谢了~

[BUG] Generated speech is unclear

Describe the bug

Is the unclear pronunciation caused by the "half" parameter breaking something?

fake.wav.zip

Another question is how should I fill in the "prompt-text"? I noticed in the code that "prompt-text" is eventually concatenated with "text", so is it okay if I don't write anything for it?

To Reproduce

Here is the command I used.

python tools/vqgan/inference.py \
    -i "胡子v1_0.wav" \
    --checkpoint-path "checkpoints/vqgan-v1.pth"

python tools/llama/generate.py \
    --text "要转换的文本" \
    --prompt-text "你的参考文本" \
    --prompt-tokens "fake.npy" \
    --checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
    --compile \
    --half

python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/vqgan-v1.pth"

data

you said that the pretrained model was trained on
10k hours Chinese, 700 hours English, 300 hours Japanese.
is this data public ? where can I find those 11k hours ? and what khz are they 44.1 or 48khz ? also link them if possible

Thanks in Advance!

Question about ASR results.

Nice project! For time reasons, I did not read the specific code carefully. So I am confused with the ASR results.
The original ASR model here is openai whisper and I found that the outputs use space instead punctuation marks. But actually whisper performes not so good on madarin. So I use funasr instead. But the outputs of funasr model are with punctuation marks.

  1. Should I replace punctuation with Spaces?

  2. I found that the prompt text and text in your demo are with punctuation marks. So It makes confused.

[BUG] Corrupted generated speeches

My fine-tuned model will generate some corrupted speech audios. They all seems like have some words / sentences shuffled.

Here's an example of generated speech:

fake.zip

The expected speech text is:

这是嘟嘟可,是可莉很久以前就交到的好朋友,要记得他的名字哦!以后别叫他挂在你包上的玩偶了!

While the actual speech content in fake.wav is:

这是嘟嘟可,<inaudible>记得他的名字哦!以后别<inaudible>就交到的好朋友,要记得他的名字哦!以后别叫他挂在你包上的玩偶了!

The inference steps for this output:

python tools/vqgan/inference.py -i "/mnt/d/Projects/VocalAI/Datasets/GenshinImpact-Voices-Labeled/Klee/vo_JNJEQ003_6_klee_01.wav" --checkpoint-path "results/vqgan_finetune_GenshinImpact_Klee/checkpoints/step_000020000.ckpt"
python tools/llama/generate.py --text "这是嘟嘟可,是可莉很久以前就交到的好朋友,要记得他的名字哦!以后别叫他挂在你包上的玩偶了!" --prompt-text "妈妈说过,今天大家要迎接风神大人,要请他喝酒!如果风神大人高兴,就会变成风来祝福大家!" --prompt-tokens fake.npy --checkpoint-path "results/text2semantic_finetune_spk_GenshinImpact_Klee/checkpoints/step_000001000.ckpt" --config-name text2semantic_finetune_spk_GenshinImpact_Klee --speaker SPK1 --num-samples 2 --compile
python tools/vqgan/inference.py -i codes_0.npy --checkpoint-path "results/vqgan_finetune_GenshinImpact_Klee/checkpoints/step_000020000.ckpt"

Given that the VQGAN is relatively stable, and this kind of chaotic output is often seen in an LLM's output, I believe these corrupted speeches are caused by the Llama model.

Is this behavior expected from the model? Do you have any suggestions on how to avoid this kind of issue?

[BUG] VQGAN copysyn

When running the command python tools/vqgan/inference.py -i test/01040701.wav, the synthesized fake.wav file contains only NaN values. Upon checking fake.npy, I found that its values are all zeros. Could you please tell me what the value range should be for the input .wav file? The range of the .wav file I inputted was between -0.5 and 0.5.

Training issue

Hello, I have been deeply impressed by your project after reviewing it. Thank you for open-sourcing such a fantastic project. Additionally, I have two questions for you:

  1. Computing Power: Could you please share the GPU model you used to train the current results? Also, approximately how much time did it take for the training?

  2. Are you planning to release the training code?

If it is convenient for you, I would greatly appreciate it if you could clarify these questions for me. Once again, thank you very much.

[BUG] How to reproduce demo for English TTS

I am trying to use webui. I load the checkpoint and use the reference wav provided in demo page. But I can not reproduce.

Describe the bug
I am trying to reproduce this demo
image

model settings
image

I used same reference wav.
image

But the results are pretty bad
audio.zip

How could I reproduce the demo performance? Thanks!

OutOfMemoryError

Describe the bug
I followed your steps for fine-tuning. During the last step of training, this error occurred.

Screenshots / log

Additional context
My machine is a 32G V100.
The maximum audio length in my data set is 8 seconds with a sample rate of 22050.
This problem still occurs when the batch size is set to 2

Inquiry About Inference Speed in Git Repository with Specific CUDA and PyTorch Configuration

I have encountered an issue related to inference speed that I hope you can assist me with. Below are the details of my environment and the problem:

CUDA and PyTorch Version:
Due to the limitations of my environment, I am restricted to using CUDA version 117, which corresponds to PyTorch version 2.1.0.

Modification for Compatibility:
To make the code run smoothly, I had to comment out line 20 in the file tools/llama/generate.py, which is torch._inductor.config.coordinate_descent_tuning = True.

Issue with Inference Speed:
I observed that the inference speed has not significantly improved as expected. The speed seems to have increased only marginally from 30 tokens/second to 35 tokens/second, rather than reaching the anticipated 500 tokens/second.

Given this context, I have a couple of questions:

Have you conducted any experiments or do you have insights related to inference speed with this specific configuration (CUDA 117 and PyTorch 2.1.0)?
Is there a known issue or limitation with this setup that could explain the lack of expected acceleration in token processing speed?
Any guidance or information you can provide would be greatly appreciated.

Thank you for your time and assistance.

it can't reproduce demo's result?

i use default code and pretrained models from https://speech.fish.audio/zh/, but i don't generate result that https://speech.fish.audio/zh/samples/;The only difference is that I renamed the file https://huggingface.co/fishaudio/speech-lm-v1/blob/main/tokenizer_config.json to config.json, and found that the id corresponding to pad_token is 36407, not 32311. The audio it generates has a lot of repetitions.Below is my inference code.

text='人间灯火倒映湖中,她的渴望让静水泛起涟漪。若代价只是孤独,那就让这份愿望肆意流淌。流入她所注视的世间,也流入她如湖水般澄澈的目光。'
#<<8
python tools/vqgan/inference.py \
    -i "/tmp/xxx/naxita.wav" \
    --checkpoint-path "/tmp/xxx/fishaudio-speech-lm-v1-pretrained/vqgan-v1.pth"

#8
#<<8
python tools/llama/generate.py \
    --text ${text} \
    --prompt-text ${text}\
    --prompt-tokens "fake.npy" \
    --checkpoint-path "/tmp/xxx/fishaudio-speech-lm-v1-pretrained/text2semantic-400m-v0.2-4k.pth" \
    --tokenizer "/tmp/xxx/fishaudio-speech-lm-v1-pretrained" \
    --num-samples 2 \
    --compile
#8

#<<8
python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "/tmp/xxx/fishaudio-speech-lm-v1-pretrained/vqgan-v1.pth"
#8

[Help] "pip install -e" fails

Hello, here is the description:

Describe the bug
I cannot install it, this commands "pip install -e ." or (or with pip3), FAILS.

To Reproduce
create venv (pip -m venv venv)
activate it (venv\scripts/activate)
install torch etc (pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121)
run this: (pip3 install -e .)

Expected behavior
I expected it to conclude the installation normally but instead it stopped with an error, which is:

....
copying pynini\lib\py.typed -> build\lib.win-amd64-cpython-310\pynini\lib
      running build_ext
      building '_pywrapfst' extension
      creating build\temp.win-amd64-cpython-310
      creating build\temp.win-amd64-cpython-310\Release
      creating build\temp.win-amd64-cpython-310\Release\extensions
      "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\FishSPEECH1\fish-speech\venvFishSpeech\include -IC:\Users\aaa\AppData\Local\Programs\Python\Python310\include -IC:\Users\aaa\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" /EHsc /Tpextensions/_pywrapfst.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/_pywrapfst.obj -std=c++17 -Wno-register -Wno-deprecated-declarations -Wno-unused-function -Wno-unused-local-typedefs -funsigned-char
      clÿ: Ligne de commande error D8021ÿ: argument num‚rique non valide '/Wno-register'
      error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.38.33130\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pynini
  Running setup.py clean for pynini
Successfully built fish-speech
Failed to build pynini
ERROR: Could not build wheels for pynini, which is required to install pyproject.toml-based projects

Screenshots / log
https://imgur.com/m17l6OW

Additional context
before getting this error I had this one:

---------------------------------------- 196.4/196.4 kB ? eta 0:00:00
Collecting pynini==2.1.5
Downloading pynini-2.1.5.tar.gz (627 kB)
  ---------------------------------------- 627.6/627.6 kB 38.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
   Traceback (most recent call last):
     File "<string>", line 2, in <module>
     File "<pip-setuptools-caller>", line 34, in <module>
     File "C:\Users\Saaa\AppData\Local\Temp\pip-install-t0wvvm_h\pynini_460d91c97a9d40899ca2bd7ffa515beb\setup.py", line 22, in <module>
       from Cython.Build import cythonize
   ModuleNotFoundError: No module named 'Cython'
   [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

[notice] A new release of pip available: 22.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

So I did this:
py -m pip install --upgrade pip setuptools wheel

Which led me to the current error now.

I tried some solutions as:

pip install aiohttp
pip install aiohttp==3.9.0b0

Solutions I found somehwere on stackoverflow and a website called weoghnite and in microsoft forums,

Maybe I should my python? I am using 3.10.6 as seen in the first screenshot!?

[Feature] Add ASR utils

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Integrate Whisper and Fun ASR.

Describe alternatives you've considered
N/A

Additional context
N/A

Examples

Hi,
This project looks very promising! Would you mind providing some English samples?
Thank you!

[Feature] How to accelerate inference

image
Is your feature request related to a problem? Please describe.

Is the inference time normal? Not considering the model loading time, is it normal to take more than 30s to generate a speech with 60 english words. How could I accelerate it if I want to make the inference time in ~2s? Thanks!

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Model output keeps repeating itself, after certain input/max token combination was executed.

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
When certain audio and max tokens settings were used to generate the first audio, all subsequent generations will keep using the inputs of the first generation no matter what is given to the model.

To Reproduce
Steps to reproduce the behavior:
See video below:
https://www.awesomescreenshot.com/video/23829885?key=2c741f2c20efe2939f38085c73b3d740

Expected behavior
A clear and concise description of what you expected to happen.
Subsequent generations should generate what is given as the input, not what is given in the first generation.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.
https://www.awesomescreenshot.com/video/23829885?key=2c741f2c20efe2939f38085c73b3d740

Additional context
Created a PR to fix it. Possibly related setup_caches, reason unknown, maybe the repo owner can help investigate.

[Feature] llama model size selection

Thank you so much for your open-source project.

Have you considered using llama models of larger sizes?

After completing several finetuning tasks, I found that the current llama model, even after finetuning training, has difficulty learning the difference between synonyms in different contexts.

For example, the word '滚' in the Chinese blessing '财运滚滚' was read like a swearing.
Perhaps this is the limitation of small-size llama. Or I may need more complex training data to finetune. (currently using 20 hours)

Maybe offering different sizes like openai-whisper would be a good solution.

[Feature] Try Unsloth to optimize Llama fine-tuning performance

Fine-tuning the Llama model currently isn't achievable on a single A100 card machine using the default configuration, as it requires more than 80GB of vRAM. Outdated, see #41 .

It is mentioned that adjusting the configuration or using gradient checkpointing could potentially solve this issue, and I've noticed that you also have implemented FlashAttention to save resources and speed up fine-tuning.

However, Unsloth claims to reduce GPU memory usage by 40%, while also provides over 75% performance gain using its free "Unsloth Open" version, compared to FlashAttention (note that I think they are comparing with FlashAttention version 1, which is outdated). And it also have dedicated kernel optimization for LoRA.

Do you think it's worth giving it a try?

[Feature]LLAMA 微调所需数据集的格式

.
├── SPK1
│ ├── 21.15-26.44.lab
│ ├── 21.15-26.44.mp3
│ ├── 27.51-29.98.lab
│ ├── 27.51-29.98.mp3
│ ├── 30.1-32.71.lab
│ └── 30.1-32.71.mp3
└── SPK2
├── 38.79-40.85.lab
└── 38.79-40.85.mp3

这里面的"lab"文件,如果是使用"txt"文件,格式是什么样子的

[Feature] Add ms-stft and ms-sb-cqt discriminators

Thanks for very cool project.

This is the best and simple LLM-based TTS Implementation I have ever seen!

For audio quality, I highly recommend adding MS-STFT Discriminator of Encodec, and MS-SB-CQT Discriminator (https://arxiv.org/abs/2311.14957).

When using above discriminators, I have experienced a better audio quality for this kind of autoencoder model.

Thanks!

偶发 Invalid argument的问题,出现在inference.py -i的环节

环境:

Ubuntu 22.04

torch版本

torch                     2.1.2
torchaudio                2.1.2
torchmetrics              1.3.0.post0
torchvision               0.16.2

显卡:

A30

cuda_12.2.r12.2

Driver Version: 535.54.03

操作步骤

源音频阶段,-i wav文件,下面的lzl1音频,后续流程都无法进行下去。
而后续生成阶段(输入别的音频文件后),-i npy文件则会经常会触发,推理文本稍微长点,会必现

# 源音频输入阶段
python tools/vqgan/inference.py \
                                          -i "lzl1.wav" \
                                          --checkpoint-path "checkpoints/vqgan-v1.pth"


# 生成阶段
python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/vqgan-v1.pth"

音频:

17s
16bit,44100Hz,1411 kbps
lzl1.zip

文本:

从小到大,都是别人教你:该做什么,不该做什么,其实,人生,这么复杂,哪里是一句:一份耕耘、一份收获,就可以讲的清楚的呢?遵从内心一次。

错误详情

/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2024-01-25 04:19:39.209 | INFO     | __main__:main:51 - Restored model from checkpoint
2024-01-25 04:19:39.210 | INFO     | __main__:main:54 - Processing in-place reconstruction of lzl1.wav
2024-01-25 04:19:40.343 | INFO     | __main__:main:62 - Loaded audio with 17.06 seconds
2024-01-25 04:19:41.554 | INFO     | __main__:main:102 - Generated indices of shape torch.Size([4, 368])
2024-01-25 04:19:42.025 | INFO     | __main__:main:126 - VQ Encoded, indices: torch.Size([4, 1, 368, 1]) equivalent to 21.53 Hz
Traceback (most recent call last):
  File "/path/to/fish-speech/tools/vqgan/inference.py", line 147, in <module>
    main()
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/path/to/fish-speech/tools/vqgan/inference.py", line 135, in main
    fake_audios = model.generator(decoded_mels)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/fish-speech/fish_speech/models/vqgan/modules/decoder.py", line 76, in forward
    xs += self.resblocks[i * self.num_kernels + j](x)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/fish-speech/fish_speech/models/vqgan/modules/decoder.py", line 172, in forward
    xt = c1(xt)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/root/anaconda3/envs/fish-speech/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Invalid argument

which pytorch version do you use?

python tools/llama/generate.py \

--text "要转换的文本" \
--prompt-text "你的参考文本" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
--num-samples 2 

Traceback (most recent call last):
File "xxx/fish-speech/tools/llama/generate.py", line 22, in
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future
File "xxx/fish-speech/lib/python3.10/site-packages/torch/_dynamo/config_utils.py", line 72, in setattr
raise AttributeError(f"{self.name}.{name} does not exist")
AttributeError: torch._inductor.config.fx_graph_cache does not exist

which pytorch version do you use?

how many data

Thank you for the Greate project! May i ask some question. 1, How many data are needed to finetune vqgan and llama? 2, dou you have the experimental result about data size?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.