modelscope / funcodec Goto Github PK

FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al.

Home Page: https://funcodec.github.io/

License: MIT License

Shell 2.69% Python 96.07% Perl 1.24%

audio-generation audio-quantization codec encodec speech-synthesis speech-to-text tts voicecloning

funcodec's Introduction

English | 中文 | 日本語

Introduction

ModelScope is built upon the notion of “Model-as-a-Service” (MaaS). It seeks to bring together most advanced machine learning models from the AI community, and streamlines the process of leveraging AI models in real-world applications. The core ModelScope library open-sourced in this repository provides the interfaces and implementations that allow developers to perform model inference, training and evaluation.

In particular, with rich layers of API-abstraction, the ModelScope library offers unified experience to explore state-of-the-art models spanning across domains such as CV, NLP, Speech, Multi-Modality, and Scientific-computation. Model contributors of different areas can integrate models into the ModelScope ecosystem through the layered-APIs, allowing easy and unified access to their models. Once integrated, model inference, fine-tuning, and evaluations can be done with only a few lines of codes. In the meantime, flexibilities are also provided so that different components in the model applications can be customized wherever necessary.

Apart from harboring implementations of a wide range of different models, ModelScope library also enables the necessary interactions with ModelScope backend services, particularly with the Model-Hub and Dataset-Hub. Such interactions facilitate management of various entities (models and datasets) to be performed seamlessly under-the-hood, including entity lookup, version control, cache management, and many others.

Models and Online Accessibility

Hundreds of models are made publicly available on ModelScope (700+ and counting), covering the latest development in areas such as NLP, CV, Audio, Multi-modality, and AI for Science, etc. Many of these models represent the SOTA in their specific fields, and made their open-sourced debut on ModelScope. Users can visit ModelScope(modelscope.cn) and experience first-hand how these models perform via online experience, with just a few clicks. Immediate developer-experience is also possible through the ModelScope Notebook, which is backed by ready-to-use CPU/GPU development environment in the cloud - only one click away on ModelScope.

Some representative examples include:

LLM:

Multi-Modal:

CV:

Audio:

AI for Science:

Note: Most models on ModelScope are public and can be downloaded without account registration on modelscope website(www.modelscope.cn), please refer to instructions for model download, for dowloading models with api provided by modelscope library or git.

QuickTour

We provide unified interface for inference using pipeline, fine-tuning and evaluation using Trainer for different tasks.

For any given task with any type of input (image, text, audio, video...), inference pipeline can be implemented with only a few lines of code, which will automatically load the underlying model to get inference result, as is exemplified below:

>>> from modelscope.pipelines import pipeline
>>> word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
>>> word_segmentation('今天天气不错，适合出去游玩')
{'output': '今天 天气 不错 ， 适合 出去 游玩'}

Given an image, portrait matting (aka. background-removal) can be accomplished with the following code snippet:

>>> import cv2
>>> from modelscope.pipelines import pipeline

>>> portrait_matting = pipeline('portrait-matting')
>>> result = portrait_matting('https://modelscope.oss-cn-beijing.aliyuncs.com/test/images/image_matting.png')
>>> cv2.imwrite('result.png', result['output_img'])

The output image with the background removed is:

Fine-tuning and evaluation can also be done with a few more lines of code to set up training dataset and trainer, with the heavy-lifting work of training and evaluation a model encapsulated in the implementation of traner.train() and trainer.evaluate() interfaces.

For example, the gpt3 base model (1.3B) can be fine-tuned with the chinese-poetry dataset, resulting in a model that can be used for chinese-poetry generation.

>>> from modelscope.metainfo import Trainers
>>> from modelscope.msdatasets import MsDataset
>>> from modelscope.trainers import build_trainer

>>> train_dataset = MsDataset.load('chinese-poetry-collection', split='train'). remap_columns({'text1': 'src_txt'})
>>> eval_dataset = MsDataset.load('chinese-poetry-collection', split='test').remap_columns({'text1': 'src_txt'})
>>> max_epochs = 10
>>> tmp_dir = './gpt3_poetry'

>>> kwargs = dict(
     model='damo/nlp_gpt3_text-generation_1.3B',
     train_dataset=train_dataset,
     eval_dataset=eval_dataset,
     max_epochs=max_epochs,
     work_dir=tmp_dir)

>>> trainer = build_trainer(name=Trainers.gpt3_trainer, default_args=kwargs)
>>> trainer.train()

Why should I use ModelScope library

A unified and concise user interface is abstracted for different tasks and different models. Model inferences and training can be implemented by as few as 3 and 10 lines of code, respectively. It is convenient for users to explore models in different fields in the ModelScope community. All models integrated into ModelScope are ready to use, which makes it easy to get started with AI, in both educational and industrial settings.
ModelScope offers a model-centric development and application experience. It streamlines the support for model training, inference, export and deployment, and facilitates users to build their own MLOps based on the ModelScope ecosystem.
For the model inference and training process, a modular design is put in place, and a wealth of functional module implementations are provided, which is convenient for users to customize their own model inference, training and other processes.
For distributed model training, especially for large models, it provides rich training strategy support, including data parallel, model parallel, hybrid parallel and so on.

Installation

Docker

ModelScope Library currently supports popular deep learning framework for model training and inference, including PyTorch, TensorFlow and ONNX. All releases are tested and run on Python 3.7+, Pytorch 1.8+, Tensorflow1.15 or Tensorflow2.0+.

To allow out-of-box usage for all the models on ModelScope, official docker images are provided for all releases. Based on the docker image, developers can skip all environment installation and configuration and use it directly. Currently, the latest version of the CPU image and GPU image can be obtained from:

CPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py38-torch2.0.1-tf2.13.0-1.9.5

GPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.8.0-py38-torch2.0.1-tf2.13.0-1.9.5

Setup Local Python Environment

One can also set up local ModelScope environment using pip and conda. ModelScope supports python3.7 and above. We suggest anaconda for creating local python environment:

conda create -n modelscope python=3.8
conda activate modelscope

PyTorch or TensorFlow can be installed separately according to each model's requirements.

Install pytorch doc
Install tensorflow doc

After installing the necessary machine-learning framework, you can install modelscope library as follows:

If you only want to play around with the modelscope framework, of trying out model/dataset download, you can install the core modelscope components:

pip install modelscope

If you want to use multi-modal models:

pip install modelscope[multi-modal]

If you want to use nlp models:

pip install modelscope[nlp] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use cv models:

pip install modelscope[cv] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use audio models:

pip install modelscope[audio] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use science models:

pip install modelscope[science] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Notes:

Currently, some audio-task models only support python3.7, tensorflow1.15.4 Linux environments. Most other models can be installed and used on Windows and Mac (x86).
Some models in the audio field use the third-party library SoundFile for wav file processing. On the Linux system, users need to manually install libsndfile of SoundFile(doc link). On Windows and MacOS, it will be installed automatically without user operation. For example, on Ubuntu, you can use following commands:
```
sudo apt-get update
sudo apt-get install libsndfile1
```
Some models in computer vision need mmcv-full, you can refer to mmcv installation guide, a minimal installation is as follows:
```
pip uninstall mmcv # if you have installed mmcv, uninstall it
pip install -U openmim
mim install mmcv-full
```

Learn More

We provide additional documentations including:

License

This project is licensed under the Apache License (Version 2.0).

funcodec's People

Contributors

Stargazers

Watchers

funcodec's Issues

LauraTTS模型的训练花了多长时间？

Required features in Jan. 2024

Hi, all. I'm collecting the required features which will be considered implementing in Jan. 2024. Please let me know your concern and feel free to comment below. Thanks. To make FunCodec better!

Difference between Encodec and Funcodec

Hi,

first of all, thank you for making this toolkit publicly available.

I have a question regarding the difference between Encodec and Funcodec in your paper:
In Table 3, you list Encodec and Funcodec as different models. Initially I thought Funcodec refers to the frequency-domain model. However, on your demo page, the models "FunCodec" and "FunCodec-2x" are time-domain models, and I was unable to find a difference to the Encodec architecture (besides the training data and the increased stride for the 2x model).

I am probably missing something and would be grateful if you could clarify this.

how does it achieve zero-shot tts

Hi author, thanks for you sharing the creative project.
When I read the paper and code, I found that it is no needed speaker labels when training LauraTTS. The same as codes: dataset.py and other data_py_files show training only rely wav.scp and phoneme.list, and training data doesn't need to be spliced. So, I wonder that Funcodec and LauraTTS really supports zero-shot TTS? If my guess is wrong, thanks for your explain:)

TypeError: 'NoneType' object is not callable

I didn't run run.h, directly run codec_train.py, because I wanted to know the architecture of the whole program. But I have this problem. Do you know the reason?

Traceback (most recent call last):
File "E:\00\FunCodec-master\funcodec\bin\codec_train.py", line 48, in
main(args=args)
File "E:\00\FunCodec-master\funcodec\bin\codec_train.py", line 23, in main
GANSpeechCodecTask.main(args=args, cmd=cmd)
File "E:\00\FunCodec-master\funcodec\tasks\abs_task.py", line 1130, in main
cls.main_worker(args)
File "E:\00\FunCodec-master\funcodec\tasks\abs_task.py", line 1239, in main_worker
model = cls.build_model(args=args)
File "E:\00\FunCodec-master\funcodec\tasks\gan_speech_codec.py", line 310, in build_model
frontend = frontend_class(**args.frontend_conf)
TypeError: 'NoneType' object is not callable

如何仅用funcodec的4层或者8层量化器进行推理

您好，我想测试一下funcodec在4层或者8层codec下的性能，我尝试将配置文件中超参直接设置为8，但是这样修改后他仅仅会生成一层的codec。我又尝试在推理阶段的ddp_core_vq.py文件中，强制将32层codec的后24层设置为0，但是这样推理出来的音频重建效果很差，请问正确处理方式应该怎样呢

Stage 3

/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/conv.py:306: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:80.)
return F.conv1d(input, weight, bias, self.stride,
[DESKTOP-PQV8NDO] 2024-04-16 14:56:40,650 (codec_basic:648) INFO: Will update discriminator: forward_step=0, disc_loss=2.0000, gen_loss=0.0000
Traceback (most recent call last):
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/000/FunCodec-master/funcodec/bin/codec_train.py", line 48, in
main(args=args)
File "/mnt/e/000/FunCodec-master/funcodec/bin/codec_train.py", line 23, in main
GANSpeechCodecTask.main(args=args, cmd=cmd)
File "/mnt/e/000/FunCodec-master/funcodec/tasks/abs_task.py", line 1130, in main
cls.main_worker(args)
File "/mnt/e/000/FunCodec-master/funcodec/tasks/abs_task.py", line 1431, in main_worker
cls.trainer.run(
File "/mnt/e/000/FunCodec-master/funcodec/train/trainer.py", line 308, in run
all_steps_are_invalid, max_update_stop = cls.train_one_epoch(
File "/mnt/e/000/FunCodec-master/funcodec/train/gan_trainer.py", line 185, in train_one_epoch
retval = model(turn == "generator", batch)
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/e/000/FunCodec-master/funcodec/models/codec_basic.py", line 324, in forward
return self._forward_generator(
File "/mnt/e/000/FunCodec-master/funcodec/models/codec_basic.py", line 528, in _forward_generator
orig_mel, recon_mel = map(mel_transform, (orig_speech, recon_speech))
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/e/000/FunCodec-master/funcodec/models/codec_basic.py", line 66, in forward
mel_output = torch.matmul(self.mel_basis, power_spec)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x513 and 4x513)

The effect of this model is amazing, is there any performance in music?

NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so)

您好，当我在尝试多个gpu上训练时会遇到nccl库没有libnccl-net.so的报错，我已经确定我安装了版本为2.19.3的nccl库。而当我用单个gpu的训练时会遇到core dumped的问题。我目前用的是train-other-500 dev-other test-other这些数据集，请问是否对于这些报错有头绪

Out of memory when train on large dataset (librilight)

Inconsistency in Encode Results with Different Batch Sizes

I have noticed that when using different batch sizes for the encode inference, the same data yields different results. Specifically, changing the batch_size parameter seems to affect the outcome even when the input data remains consistent.
I am unsure if this behavior is expected or indicative of a bug. It would be greatly appreciated if you could provide some insights or guidance on this matter. Understanding the expected behavior when varying batch sizes would be crucial for my continued use and trust in the tool's reliability.
Thank you for your attention to this matter and for your continued support of the community with Funcodec.

Feature to resume training after stopping?

Hi @ZhihaoDU, does this repository have a feature to resume training after stopping after a certain number of epochs or in the middle of a epoch? Thanks in advance!

Discriminator loss？

As far as I know, should the loss in the figure above be the generator loss? What does the discriminator loss look like?

zipfile.BadZipFile: File is not a zip file

Issue Decription : I followed the instructions in the README.md step by step, until I encountered the following problem when I executed the command in the order described under “Use LauraTTS to synthesize speech”:
bash demo.sh --stage 1 --model_name ${model_name} --output_dir results --text "nothing was to be done but to put about, and return in disappointment towards the north."

I encountered the following:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 561, in <module>
    main()
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 557, in main
    inference(**kwargs)
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 381, in inference
    inference_pipeline = inference_func(
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 287, in inference_func
    my_model = Text2Audio.from_pretrained(
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 227, in from_pretrained
    return Text2Audio(**kwargs)
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 75, in __init__
    from funcodec.text.phoneme_tokenizer import G2p_en
  File "/root/FunCodec/funcodec/text/phoneme_tokenizer.py", line 10, in <module>
    import g2p_en
  File "/root/miniconda3/lib/python3.8/site-packages/g2p_en/__init__.py", line 1, in <module>
    from .g2p import G2p
  File "/root/miniconda3/lib/python3.8/site-packages/g2p_en/g2p.py", line 26, in <module>
    nltk.data.find('corpora/cmudict.zip')
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/root/miniconda3/lib/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/root/miniconda3/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

question : I have already downloaded the relevant two models, but encountered an error during the process of executing speech synthesis. I am eager to experience your project and would greatly appreciate your guidance or suggestions for resolving this issue.

Relation between bitrate and token ratio

Hi,

reading your paper, it was unclear to me how exactly the token ratio (TKR) relates to the bitrate.
Initially, I thought this meant the number of frames per second at 16kHz, where 1 codebook index would be generated per frame. But then I realized this can't be right because in Table 3, different TKRs are shown for the same stride.

Could you further explain the relation between TKR and bitrate, maybe with an example, e.g. for one of the FreqCodec models?

Has lauraGPT been released？

for some basic tasks（e.g. AAC)
Thanks for great work!!

Can the `run.sh` support for the training of Chinese TTS model?

Your README.md said that we can train from scratch using your run.sh script, can it support for Chinese?

ERROR Generating with prompt text and prompt audio

Hi, thank you for sharing FunCodec, this is really awesome work!

I ran into the following issue when trying to generate audio using my own prompt audio and prompt text. Please let me know what the nature of this error is and how it can be fixed. Thank you very much!

File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 617, in <module>
    main()
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 613, in main
    inference(**kwargs)
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 454, in inference
    return inference_pipeline(data_path_and_name_and_type, raw_inputs=kwargs.get("raw_inputs", None))
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 400, in _forward
    ret_val, _ = my_model(*model_inputs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 218, in __call__
    gen_speech = self.model.syn_audio(
  File "/home/____/FunCodec/funcodec/models/audio_generation/laura_model.py", line 565, in syn_audio
    _, _, recon_wav, _ = codec_model(codec_emb[:, continual_length:], run_mod="decode_emb")
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/bin/codec_inference.py", line 119, in __call__
    ret_dict = self.model.inference_decoding_emb(*batch)
  File "/home/____/FunCodec/funcodec/models/codec_basic.py", line 829, in inference_decoding_emb
    recon_speech = self._decode(codes)
  File "/home/____/FunCodec/funcodec/models/codec_basic.py", line 390, in _decode
    return self._decode_frame(encoded_frames[0])
  File "/home/____/FunCodec/funcodec/models/codec_basic.py", line 401, in _decode_frame
    out = self.decoder(emb)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/models/decoder/seanet_decoder.py", line 179, in forward
    y = self.model(z.permute(0, 2, 1))
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/modules/normed_modules/conv.py", line 259, in forward
    x = self.conv(x)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/modules/normed_modules/conv.py", line 157, in forward
    x = self.conv(x)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

[bug] encoding阶段生成的codec.txt, 无法直接读取？

按照提供的encoding_decoding.sh脚本，encoding阶段会生成codec.txt文件

这个文件的形式类似于：
utts_id "空格" json.dumps(codecs)

这个形式无法被read_text.py直接读取，需要改写“load_jsonl_trans_int”函数，如下

def load_jsonl_trans_int(path: Union[Path, str]) -> Dict[str, np.ndarray]: d = read_2column_text(path) retval = {} for k, v in d.items(): try: value = json.loads(v) if isinstance(value, dict): retval[k] = np.array(value["trans"], dtype=int) elif isinstance(value, list): retval[k] = np.array(value, dtype=int) else: raise TypeError except TypeError: logging.error(f'Error happened with path="{path}", id="{k}", value="{v}"') raise return retval

Stage 1 can only be run on one gpu card 0

When i run stage 1

bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0" \
  --model_dir exp/${model_name} --bit_width 16000 \
  --wav_scp input_wav.scp  --out_dir outputs/codecs/

It seems that the gpu_devices can only be 0. Other number or array than 0 will give a cuda ordinal error.

Getting error while testing LauraTTS

Hi @ZhihaoDU while running LauraTTS from egs/LibriTTS/text2speech_laura README I am getting error which described below.

When using ModelScope, getting following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rishikesh/.local/lib/python3.10/site-packages/modelscope/pipelines/builder.py", line 170, in pipeline
    return build_pipeline(cfg, task_name=task)
  File "/home/rishikesh/.local/lib/python3.10/site-packages/modelscope/pipelines/builder.py", line 65, in build_pipeline
    return build_from_cfg(
  File "/home/rishikesh/.local/lib/python3.10/site-packages/modelscope/utils/registry.py", line 198, in build_from_cfg
    raise KeyError(
KeyError: 'laura-codec-tts-inference is not in the pipelines registry group text-to-speech. Please make sure the correct version of ModelScope library is used.'

~~While running bash command, getting following error:~~

File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 561, in <module>
    main()
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 557, in main
    inference(**kwargs)
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 381, in inference
    inference_pipeline = inference_func(
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 287, in inference_func
    my_model = Text2Audio.from_pretrained(
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 227, in from_pretrained
    return Text2Audio(**kwargs)
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 53, in __init__
    model, model_args = Text2AudioGenTask.build_model_from_file(
  File "/home/rishikesh/code/FunCodec/funcodec/tasks/abs_task.py", line 1928, in build_model_from_file
    model = cls.build_model(args)
  File "/home/rishikesh/code/FunCodec/funcodec/tasks/text2audio_generation.py", line 206, in build_model
    if args.text_encoder is not None:
AttributeError: 'Namespace' object has no attribute 'text_encoder'

FYI : Bash issue is ressolved

Bug in gpu device setting

https://github.com/alibaba-damo-academy/FunCodec/blob/9944f8157cc5ae01126b59eb4b79d34bdcb60983/funcodec/bin/codec_inference.py#L548C42-L548C42

if CUDA_VISIBLE_DEVICES is set to other than 0, calling torch.cuda.set_device would result in CUDA error: invalid device ordinal. Maybe remove this line or just call torch.cuda.set_device(0)

Inquiry about Future Plans for Funcodec with Fewer nq Options

I hope this message finds you well. I am reaching out to commend the exceptional work on Funcodec; it has proven to be a remarkable asset in the community. Currently, I notice that all the available checkpoints are for 32 nq. I am curious to know if there are any plans to release versions with fewer nqs, such as 8 or 12, in the future.
Additionally, I would be interested to learn if there have been any experiments or considerations regarding the impact of a higher number of nqs (like 32) on models similar to valle and whether it affects their performance or efficiency. Your insights on these matters would be greatly appreciated.

Thank you for your dedication to advancing this field. I look forward to your response.

Best regards,

funcodec finetune

how can we adapt training the codec for smaller dataset?

Differences with encodec？

请问大佬，时域的funcodec和encodec区别在什么地方呢？

Questions about training from scratch

Hello, I followed the steps in run.sh to train with the LibriTTS-R dataset. Below is the training process loss. When I use the current checkpoint to synthesize speech, it is almost noise. Based on the loss, does the training of the model appear normal? Thank you!

Loss:

nll_loss:

reg_l1_loss:

reg_l2_loss:

Has Funcodec-based TTS been released yet？

when would Funcodec-based text-to-speech be relesead？

Encodec模型中基于transformer的LMModel应该怎么训练和应用？

您好，我想请问一下，encodec原论文中，量化器后可选的LM模型，应该如何训练和应用，是需要在encodec模型的config文件中进行一系列配置，还是需要重新写一个train文件呢？如果需要重新写一个train文件的话，应该怎么写呢？非常期待您的解答！

run.sh: 34: utils/parse_options.sh: Syntax error: Bad for loop variable

在text2speech_laura文件夹下，输入sh run.sh进行模型训练时，报了如题所示的错误，请问是还需要添加什么参数吗？还有下载的预训练模型audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch文件夹中的config: conf/encodec_lstm_16k_n32_600k_step_rmseg_use_power_ds640.yaml是在项目的哪里？在项目中没有找到，我想查看模型的结构进行修改，谢谢。

How to check progress？

Hi @ZhihaoDU
The stage3 has started training, but where can I see the progress bar?
I am a 4080 card, and the training parameters have not changed, and I think it is impossible to keep the same parameters as your A800 card, such as batch_size and num_workers. I usually check the progress bar and change the appropriate batch_size, but it cannot be realized now.

LauraTTS推理生成的音频是杂音？

model_name="speech_synthesizer-laura-en-libritts-16k-codec_nq2-pytorch"
bash demo.sh --stage 1 --model_name ${model_name} --output_dir results --text "nothing was to be done but to put about, and return in disappointment towards the north."

utt1_gen.mp4

utt1_gen_only_lm.mp4

A typo in 'nomalized_txt'

https://github.com/alibaba-damo-academy/FunCodec/blob/3435d870126a48fe61a87fac1af3b6530be388ea/egs/LibriTTS/text2speech_laura/run.sh#L178C33-L178C33

Release of CN TTS model

Looks like LauraTTS have Chinese demo, will consider opensource the pretrained model?

Training Funcodec: Data Sources and Recommendations for Starting From Scratch

In your Funcodec paper, you mentioned that you used 25k hours of data for training the codec. Does this data include open-source datasets like Gigaspeech and WenetSpeech?
If we want to train Funcodec from scratch, do you have any suggestions?
Is it better to use more clean data without background noise or more data with noise?

Low-complexity FreqCodec requires a lot of VRAM

Hi again,

I'm currently trying to retrain FreqCodec models using the configurations released by you (audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorch and audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorch).
Using an A100 (40GB) GPU, I am able to train the larger model (4.50M params, 2.18 GFlops) without issues at batch size 32.
However, the smaller model (0.52M params, 0.34 GFlops) causes a Cuda OOM error at batch size 32 (and at 24).

As far as I can tell, in terms of architecture differences, the smaller model uses more groups in the depthwise convolutions, 3 residual layers (instead of 1), and a dilation base of 3 (instead of 2).
I thought depthwise convolutions with more groups would reduce the required memory rather than increase it. Is this a misconception, or do you have any idea what the reason for this could be?

LauraTTS: no attribute of "text_encoder"

Description: Directly running demo.sh as guided in the Readme will throw an error as :

AttributeError: 'Namespace' object has no attribute 'text_encoder'

LauraTTS: _pickle.UnpicklingError: invalid load key, 'v'.

Environment

PyTorch version: 1.12.0
Python version: 3.8

Issue Description

I believe I have correctly installed the required PyTorch version as per the README instructions and have also executed pip install --editable ./ to install the necessary requirements. However, while trying to run the "Use LauraTTS to synthesize speech" example, executing the following command:
bash demo.sh --stage 1 --model_name ${model_name} --output_dir results --text "nothing was to be done but to put about, and return in disappointment towards the north."
I encountered the following error:

Traceback (most recent call last):
  File "/root/miniconda3/envs/lg/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/lg/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 561, in <module>
    main()
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 557, in main
    inference(**kwargs)
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 381, in inference
    inference_pipeline = inference_func(
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 287, in inference_func
    my_model = Text2Audio.from_pretrained(
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 227, in from_pretrained
    return Text2Audio(**kwargs)
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 53, in __init__
    model, model_args = Text2AudioGenTask.build_model_from_file(
  File "/root/autodl-tmp/FunCodec/funcodec/tasks/abs_task.py", line 1941, in build_model_from_file
    src_state = torch.load(model_file, map_location=device)
  File "/root/miniconda3/envs/lg/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/root/miniconda3/envs/lg/lib/python3.8/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Question:

Is there an issue with how I am using the program? I am eager to experience your project and would greatly appreciate your guidance or suggestions for resolving this issue.

About model details

i want to know how the input shape changes from 257 to 256, thanks?
and it seems that the config in this repo is different with pretrained models?

TKR?

The first row in this table, 400 200.... 50 TKR, I think is the sampling rate divided by the stride, and then multiplied by the token, right?
just like: 16000/320*8=400TKR,
I guess that the number of tokens in each of the first four rows in the table is the same, and they are [8,4,2,1] and the same sampling rate 16000.
But, in the last two lines: I wonder that you get the same TKR by changing the sampling rate? Or, by changing tokens?

Funcodec进行解码处理时，波形幅值不一致

感谢您的分享，我在尝试进行Funcodec解码时，忽略了_scale_输出后，波形结构相近，但是幅值差异较大，请问我该如何修改代码。

speech2token = Speech2Token("egs/LibriTTS/codec/exp/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch/config.yaml", "egs/LibriTTS/codec/exp/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch/model.pth", sampling_rate=16000)

audio, rate = librosa.load("egs/LibriTTS/codec/test_wav/BAC009S0002W0122.wav", sr=16000) 
audio_32 = np.reshape(audio, (1,1,-1))
output = speech2token(audio_32, bit_width=16000, run_mod="encode")
tokens = output[0][0]
tokens_t = tokens.permute(1, 2, 0)
audio_re = speech2token(tokens_t, bit_width=16000, run_mod="decode")

Fail to train in multi gpu

Hi,

i can try the codec model in single gpu, but i cannot train it in multigpu mode. the log is

the env is

CUDA Version: 12.2 
alias-free-torch         0.0.6
pytorch-wpe              0.0.1
torch                    1.13.1
torch-complex            0.4.3
torchaudio               0.13.1
torchvision              0.14.1

the main log is

./run_freqcodec.sh: gpu_num: 2
stage 3: Training
log can be found at ./exp/freqcodec_mag_phase_16k_n32_600k_step_ds640/log/train.log.0

the detail log is

-rw-rw-r-- 1 test test   0 Jan 18 11:07 train.log.0
-rw-rw-r-- 1 test test 884 Jan 18 11:07 train.log.1

cat train.log.1
Traceback (most recent call last):
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/test/code/enhance/FunCodec/funcodec/bin/codec_train.py", line 32, in <module>
    torch.cuda.set_device(args.gpu_id)
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

LauraTTS的GPT没有任何优化吗? 直接用的torch.matmul?

如题... 这样batchsize能撑起来吗?

not support Chinise-raw-txt input

bash demo.sh --stage 2 --model_name ${model_name} --output_dir results --text "你好"
--prompt_text "one of these is context" --prompt_audio "demo/8230_279154_000013_000003.wav"
not support，
self.phoneme_tokenizer uses g2p_en to convert english word into phn，is the model support Chinese inputs

运行run.sh在stage 4时报错

报错信息如下所示：
run.pl: job failed, log is in /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//logdir/inference.1.log
cat: '/mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//logdir/output./codecs.txt': No such file or directory
Codes are saved to /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//logdir/output./codecs.txt and collected to /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//codecs.txt.
codec scp files are collected into /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codec_token.scp

log里看起来是路径错误，但是前几步，stage1到3都没错：

2024-04-15 15:52:57,508 (codec_inference:233) INFO: param_dict: None
Traceback (most recent call last):
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 584, in
main()
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 580, in main
inference(**kwargs)
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 425, in inference
return inference_pipeline(data_path_and_name_and_type, raw_inputs=None)
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 313, in _forward
for keys, batch in loader:
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/home/zz/work/FunCodec/funcodec/datasets/iterable_dataset.py", line 260, in iter
array = func(value)
File "/home/zz/work/FunCodec/funcodec/datasets/iterable_dataset.py", line 28, in load_kaldi
retval = kaldiio.load_mat(input)
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/kaldiio/matio.py", line 239, in load_mat
with open_like_kaldi(ark, "rb") as fd:
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/kaldiio/utils.py", line 207, in open_like_kaldi
return io.open(name, mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'dump/libritts/train/arks/wav.00.ark'