yangdongchao / academicodec Goto Github PK

View Code? Open in Web Editor NEW

499.0 499.0 70.0 259 KB

AcademiCodec: An Open Source Audio Codec Model for Academic Research

Python 94.74% Shell 3.56% Jupyter Notebook 1.70%

academicodec's People

Contributors

Stargazers

Watchers

Forkers

ishine whitefu tageao460 shaun95 entn-at chenchy diggerdu coder-drinker awekling closegoingaway oytunturk farmingtong moguijoe chenjiasheng rishikksh20 hs991023 sanqianyuejia windb3ll d3p10y n0wwa spicyguml minisoco fskeo pandorals monsterdove arloi maigone s8xy alpinewin mengting7tw gityesm bismarckbamfo pan-yangxu liusongxiang xzwy wangtianrui wyj1996 babysor xelibrion yt605155624 saber5433 liujingxiu23 duan-yingjie maoshuiyang swigls zhaomingwork rechawine zhikangniu dy2009 sunnnnnnnny kingfener jacquelm zhuleiustc1983 codykuo 0417keito ethanyhzhang spiralanch ywk991112 boostpapa forwiat fish23 shetu1994 mchinen zhaopufeng haoheliu sunxh16

academicodec's Issues

取消 OMP_NUM_THREADS 的注释有可能加速 encodec 训练

解开 launch.py 里面关于 OMP_NUM_THREADS 的注释可以加速训练，也能提高 GPU 利用率，因为默认会使用所有核心（对于核心数很多的机器如 A100），多核心之间的交互可能有耗时，如果觉得 1 太小，可以额外在 train.sh 前面控制(如使用 8)，LibriTTS 的训练尚未测试

AcademiCodec/academicodec/models/encodec/distributed/launch.py

Line 34 in a496082

# if "OMP_NUM_THREADS" not in os.environ:

also see yangdongchao/SoundStorm#34

在该仓库中暂未验证

missed json

missed json for HiFi-Codec

FileNotFoundError: [Errno 2] No such file or directory: 'logs/config.json'

--config_path ${log_root}/config.json

Some minor bugs inside Hifi-Codec code

As I am analyzing new HiFi-codec code I encountered three small bugs:

Torchaudio Melspectrogram :
Here :

AcademiCodec/HiFi-Codec/train.py

Line 31 in 3ee7baf

melspec = MelSpectrogram(sample_rate=24000, n_fft=s, hop_length=s//4, n_mels=64, wkwargs={"device": device}).to(device)

MelSpectrogram not imported before use :

from torchaudio.transforms import MelSpectrogram

Modules not present inside HiFi-Codec:
Here :

AcademiCodec/HiFi-Codec/msstftd.py

Line 16 in 3ee7baf

from modules import NormConv2d

modules not present inside HiFi-Codec folder. So, neede to copy or change modules reference from other model's modules implementation.
Shape of input tensor x, here :

AcademiCodec/HiFi-Codec/vqvae.py

Line 33 in 3ee7baf

c = self.encoder(x.unsqueeze(1))

While my testing with 24 khz mono channel wav shape of x before line 33 comes out -> [Batch, Samples, 1] and after .unsqueeze(1) operation at line 33 it becomes [batch, 1, samples, 1] a 4D tensor which supposed to be 3D tensor. So shape of x needed to be check before line 33 and if it has 3 dimensions and last dimension is 1 then we needed to squeeze last dimension.
After modifying and correct the shape of x, code is working fine without an error, and I am able to get desired output.

Thanks @yangdongchao .

training loss

hello, The final amount of training loss of Encodec_24k_240d model ?

HiFi-codec 的 VQVAE 这个类没有看到 decode 这个函数

HiFi-codec 的 VQVAE 这个类没有看到 decode 这个函数，是不是直接用 VQVAE 的 forward 函数就代表 decode 过程，因为我看

AcademiCodec/HiFi-Codec-24k-320d/vqvae_tester.py

Line 32 in d03142b

syn = self.vqvae(vq_codes)

的逻辑是这样的

额，好像理解错误，按照

AcademiCodec/HiFi-Codec-24k-320d/vqvae.py

Line 32 in 4e277f4

return acoustic_tokens

的写法，decode 是不是调用 self.generator 即可，因为 acoustic_tokens 应该是 quant 之后的，所以
https://github.com/yangdongchao/AcademiCodec/blob/d03142b05be6d1023080cb42416f0c4b227e5342/HiFi-Codec-24k-320d/vqvae_tester.py#LL31C1-L31C1
中的 vq_code 并不是 acoustic_tokens，需要过一下 self.quantizer.embed() 才是 acoustic_tokens，不知道这样理解对不对？

Training Soundstream on Single GPU

Hi @yangdongchao
I am planning to training SoundStream codec from this repo to clean version of Libri light dataset + VCTK datasets and will open source the checkpoint, but I have single A100 for that, is it possible to train Soundstream on single A100 with lower batch size for longer time period?

请问有训练好的discriminator模型吗

我目前正在学习这个模型的训练流程，但是我看论文说训练到收敛需要8张卡训练一个多月，所以说对我而言短时间内肯定是训练不好的，我想知道有没有训练好的discriminator模型，谢谢。

What datasets are specifically mixed in the HIFI-CODEC paper？

Hello, I mixed the three data sets libritts, aishell and vctk according to the dataset set in the HIFI-CODEC paper, which lasted about 400 hours, but the performance of the trained model could not reach the performance of the pretrained model you gave. May I ask what datasets are specifically mixed in your paper "and more, with a total duration of over 1000 hours"?

HiFi-Codec 训练时 0 卡被其他卡的线程占用少量显存

如图：

同样，在训练 SoundStorm 的时候，用了 HiFi-Codec 合成音频，也有类似问题

推测是 HiFi-Codec 在组网的时候可能某个模块的显存分配有问题

可能是这里（尚未进一步定位）

AcademiCodec/academicodec/models/hificodec/models.py

Line 513 in df5f3e4

quantized_out = torch.tensor(0.0, device=x.device)

保存权重可以保存成统一的格式

可以例如下面这种格式保存，要不然单机保存的模型根据索引会出现问题，我会在后面提交修复的版本

if epoch % config.common.save_interval == 0:
            model_to_save = model.module if config.distributed.data_parallel else model
            disc_model_to_save = disc_model.module if config.distributed.data_parallel else disc_model 
            if not config.distributed.data_parallel or dist.get_rank() == 0:  
                save_master_checkpoint(epoch, model_to_save, optimizer, scheduler, f'{config.checkpoint.save_location}epoch{epoch}_lr{config.optimization.lr}.pt')  
                save_master_checkpoint(epoch, disc_model_to_save, optimizer_disc, disc_scheduler, f'{config.checkpoint.save_location}epoch{epoch}_disc_lr{config.optimization.lr}.pt')

Encodec's training speed

I am training encodec on my own dataset (300+ hours, 1.2 million samples), it takes 1.7s for one iteration (8 V100, the batch size for one GPU is 28). It totally takes 30days to train 300 epoches. 😱
I am not sure the speed is okay?

HiFi-Codec-16k bitrate options

Have you ever evaluated it in case of higher bitrate? It seems the HiFi-Codec-16k only supports two kinds of code rate 1kbps using 1 layer quantization and 2kbps using 2 layer. By the way, the training code seems only training using 2 layer quantization.

to merge encodec_16k_lanch into academicodec

encodec_16k_lanch 解决了以下问题：

from feiteng
https://github.com/yangdongchao/AcademiCodec/blob/master/academicodec/quantization/core_vq.py#L149 这个不应该被注释掉

注释掉的话，多卡训练会效果差些, 单卡不影响，可以认为，目前放出来的 encodec 的权重，用最新的代码训练的话，可以拿到更好的效果

dongchao
嗯，我后面更新一下，如果把这个注释调，现在的代码没法跑多卡。我现在有一版不注释也能跑多卡的代码

但是 encodec_16k_lanch 没有合并到仓库的 academicodec 目录

are there demo results for models included in arxiv paper?

are there demo results for models included in Arxiv paper?

e.g. for results in Table 1.

is it 'audio' and are there audio results?
there are too many typos...

config file for HiFi-Codec-24k-240d

Hi,

Thanks for your great work!
can you share the config file for HiFi-Codec-24k-240d?

When I use the pre-trained model to inference about Encodec and HiFi-Codec, an identical error occurs

(soundstream) root@autodl-container-1cb1119f52-820c06c3:~/autodl-tmp/paper/HiFi-Codec# bash test.sh
checkpoint path: ./checkpoint/HiFi-Codec-24k-240d
Init model and load weights
Traceback (most recent call last):
File "./vqvae_copy_syn.py", line 35, in
model = VqvaeTester(args)
File "/root/autodl-tmp/paper/HiFi-Codec/vqvae_tester.py", line 20, in init
self.vqvae = VQVAE(hp.config_path, hp.model_path, with_encoder=True)
File "/root/autodl-tmp/paper/HiFi-Codec/vqvae.py", line 12, in init
ckpt = torch.load(ckpt_path)
File "/root/miniconda3/envs/soundstream/lib/python3.8/site-packages/torch/serialization.py", line 815, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/root/miniconda3/envs/soundstream/lib/python3.8/site-packages/torch/serialization.py", line 1033, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

The Encodec 24k_240 training loss are very large ！

Hi yangdongchao！
When I train the Encodec 24k_240 in 1kbps during the early stages, the model exhibits very high loss and significant oscillation. Is this a normal phenomenon?

The train process as follows：

<epoch:8, iter:8250, total_loss_g:20.7092, adv_g_loss:2.1068, feat_loss:15.4339, rec_loss:3.1594, commit_loss:0.0000, loss_d:1.2053>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▋                                                            | 8259/18075 [1:34:07<1:51:14,  1.47it/s]<epoch:8, iter:8260, total_loss_g:1448.0029, adv_g_loss:2.0795, feat_loss:1439.2244, rec_loss:6.6940, commit_loss:0.0000, loss_d:0.5836>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▊                                                            | 8269/18075 [1:34:15<1:51:27,  1.47it/s]<epoch:8, iter:8270, total_loss_g:588.6943, adv_g_loss:2.1234, feat_loss:577.0657, rec_loss:9.4847, commit_loss:0.0000, loss_d:0.8170>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▊                                                            | 8279/18075 [1:34:21<1:51:37,  1.46it/s]<epoch:8, iter:8280, total_loss_g:316.6624, adv_g_loss:2.1950, feat_loss:306.5796, rec_loss:7.8813, commit_loss:0.0000, loss_d:0.8256>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▉                                                            | 8289/18075 [1:34:29<1:51:56,  1.46it/s]<epoch:8, iter:8290, total_loss_g:6425.9717, adv_g_loss:2.1269, feat_loss:6398.3364, rec_loss:25.5026, commit_loss:0.0000, loss_d:0.9661>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▉                                                            | 8299/18075 [1:34:36<1:52:12,  1.45it/s]<epoch:8, iter:8300, total_loss_g:2867.6846, adv_g_loss:2.2306, feat_loss:2847.7778, rec_loss:17.6676, commit_loss:0.0000, loss_d:0.1482>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████                                                            | 8309/18075 [1:34:41<1:52:00,  1.45it/s]<epoch:8, iter:8310, total_loss_g:4510.4780, adv_g_loss:1.9837, feat_loss:4476.9551, rec_loss:31.5352, commit_loss:0.0000, loss_d:1.1329>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████                                                            | 8319/18075 [1:34:47<1:51:03,  1.46it/s]<epoch:8, iter:8320, total_loss_g:3507.8118, adv_g_loss:1.9984, feat_loss:3480.6077, rec_loss:25.1733, commit_loss:0.0000, loss_d:1.0020>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▏                                                           | 8329/18075 [1:34:56<1:50:40,  1.47it/s]<epoch:8, iter:8330, total_loss_g:17506.3809, adv_g_loss:1.9943, feat_loss:17494.1309, rec_loss:10.2544, commit_loss:0.0000, loss_d:0.8280>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▏                                                           | 8339/18075 [1:35:01<1:50:31,  1.47it/s]<epoch:8, iter:8340, total_loss_g:30781.5254, adv_g_loss:2.1298, feat_loss:30761.4688, rec_loss:17.8869, commit_loss:0.0000, loss_d:0.4086>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▎                                                           | 8349/18075 [1:35:08<1:50:59,  1.46it/s]<epoch:8, iter:8350, total_loss_g:361517.0312, adv_g_loss:2.1185, feat_loss:361338.4688, rec_loss:176.4266, commit_loss:0.0000, loss_d:0.2256>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▎                                                           | 8359/18075 [1:35:15<1:49:04,  1.48it/s]<epoch:8, iter:8360, total_loss_g:32.4452, adv_g_loss:2.1076, feat_loss:28.3426, rec_loss:1.9913, commit_loss:0.0000, loss_d:1.3850>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▍                                                           | 8369/18075 [1:35:23<1:50:03,  1.47it/s]<epoch:8, iter:8370, total_loss_g:304.8588, adv_g_loss:2.2852, feat_loss:299.7329, rec_loss:2.8386, commit_loss:0.0000, loss_d:1.0175>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▍                                                           | 8379/18075 [1:35:30<1:50:01,  1.47it/s]<epoch:8, iter:8380, total_loss_g:34873.7617, adv_g_loss:2.1054, feat_loss:34844.2266, rec_loss:27.4251, commit_loss:0.0000, loss_d:0.3069>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▌                                                           | 8389/18075 [1:35:37<1:50:02,  1.47it/s]<epoch:8, iter:8390, total_loss_g:40341.5039, adv_g_loss:2.2593, feat_loss:40214.2148, rec_loss:125.0235, commit_loss:0.0000, loss_d:0.6393>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▌                                                           | 8399/18075 [1:35:43<1:50:03,  1.47it/s]<epoch:8, iter:8400, total_loss_g:184210.6719, adv_g_loss:2.0305, feat_loss:184145.6875, rec_loss:62.9335, commit_loss:0.0000, loss_d:1.0710>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▋                                                           | 8409/18075 [1:35:49<1:48:46,  1.48it/s]<epoch:8, iter:8410, total_loss_g:1336.8246, adv_g_loss:2.1409, feat_loss:1317.9712, rec_loss:16.7082, commit_loss:0.0000, loss_d:0.9688>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▋                                                           | 8419/18075 [1:35:57<1:49:31,  1.47it/s]<epoch:8, iter:8420, total_loss_g:13977.8945, adv_g_loss:2.2973, feat_loss:13938.0557, rec_loss:37.5274, commit_loss:0.0000, loss_d:0.2749>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▊                                                           | 8429/18075 [1:36:04<1:49:48,  1.46it/s]<epoch:8, iter:8430, total_loss_g:3301.4082, adv_g_loss:2.1330, feat_loss:3262.6450, rec_loss:36.6189, commit_loss:0.0000, loss_d:0.6580>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▊                                                           | 8438/18075 [1:36:10<1:49:44,  1.46it/s]

The valid process as follows：

2023-06-20-12-58: <epoch:0, total_loss_g_valid:155.6049, recon_loss_valid:21.3568, adversarial_loss_valid:1.6380, feature_loss_valid:132.6101, commit_loss_valid:0.0000, valid_loss_d:1.2365, best_epoch:0>
2023-06-20-16-30: <epoch:1, total_loss_g_valid:508.1316, recon_loss_valid:21.7350, adversarial_loss_valid:1.7627, feature_loss_valid:484.6339, commit_loss_valid:0.0000, valid_loss_d:1.0418, best_epoch:0>
2023-06-20-20-02: <epoch:2, total_loss_g_valid:302.2671, recon_loss_valid:20.5088, adversarial_loss_valid:2.1077, feature_loss_valid:279.6506, commit_loss_valid:0.0000, valid_loss_d:1.1599, best_epoch:2>
2023-06-20-23-34: <epoch:3, total_loss_g_valid:1090.3598, recon_loss_valid:20.4632, adversarial_loss_valid:2.0897, feature_loss_valid:1067.8068, commit_loss_valid:0.0000, valid_loss_d:0.9414, best_epoch:3>
2023-06-21-03-07: <epoch:4, total_loss_g_valid:1666.9553, recon_loss_valid:21.7679, adversarial_loss_valid:2.0294, feature_loss_valid:1643.1580, commit_loss_valid:0.0000, valid_loss_d:1.0660, best_epoch:3>
2023-06-21-06-39: <epoch:5, total_loss_g_valid:1438.0695, recon_loss_valid:21.1533, adversarial_loss_valid:2.1540, feature_loss_valid:1414.7622, commit_loss_valid:0.0000, valid_loss_d:1.1304, best_epoch:3>
2023-06-21-10-11: <epoch:6, total_loss_g_valid:918.1003, recon_loss_valid:21.4004, adversarial_loss_valid:2.1242, feature_loss_valid:894.5757, commit_loss_valid:0.0000, valid_loss_d:1.1136, best_epoch:3>
2023-06-21-13-43: <epoch:7, total_loss_g_valid:1691.1200, recon_loss_valid:20.3575, adversarial_loss_valid:2.1024, feature_loss_valid:1668.6601, commit_loss_valid:0.0000, valid_loss_d:0.9036, best_epoch:7>

请问如何理解 codes dimension

感谢开源精彩的工作！

我想确认一下我对输出的 codes 的 ordering 的理解：
VQVAE encode 函数的输出形状是 [B, T, 4]。
假设 B=1， T=2，codes 是
[[a,b,c,d]
[e,f,g,h]]

判断：
a 是 T=1 的feature 的前一半第一次quantize 得到的code，
b 是 T=1 的feature 的后一半第一次 quantize 得到的code，
c 是 quantize a 的 residual 得到的 code
...

h 是 quantize f 的 residual 得到的 code
请问这样的判断对吗？

谢谢
Puyuan

Data Augmentation in Soundstream

Hi, thanks for your great work. I notice that the NSynthDataset used in SoundStream contains data augmentation by adding two audio waveforms, which does not appear in Encodec. I wonder where is this technique proposed, and have you found that it helps the audio quality? Thanks.

About the migration of hificodec to vall-e

Excuse me, I want to ask if hifi-codec is used for vall-e, will it be similar to encodec, the first layer is used for AR, and the 2-4 layers are used for NAR?

it does not converge for valle training

The model does not converge when I use hifi-codec to train NAR of valle. The data i used is a chinese dataset while its duration is 5000 hours. How can I do to train valle with hificodec？

An error occurred while inferring the soundstream

the command :

python test.py "../datasets/Nsynth/nsynth-valid/audio/" "./audiofake"
--resume_path "./model_path/2023-05-10-08-55/best_1.pth"

The error as follows:
(soundstream) root@autodl-container-1cb1119f52-820c06c3:~/autodl-tmp/paper/SoundStream_24k_240d# bash test.sh
Traceback (most recent call last):
File "test.py", line 159, in
test_batch()
File "test.py", line 151, in test_batch
soundstream.load_state_dict(new_state_dict) # load model
File "/root/miniconda3/envs/soundstream/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SoundStream:
size mismatch for encoder.model.0.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.0.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.0.conv.conv.weight_v: copying a param with shape torch.Size([32, 1, 7]) from checkpoint, the shape in current model is torch.Size([48, 1, 7]).
size mismatch for encoder.model.1.block.1.conv.conv.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([24]).
size mismatch for encoder.model.1.block.1.conv.conv.weight_g: copying a param with shape torch.Size([16, 1, 1]) from checkpoint, the shape in current model is torch.Size([24, 1, 1]).
size mismatch for encoder.model.1.block.1.conv.conv.weight_v: copying a param with shape torch.Size([16, 32, 3]) from checkpoint, the shape in current model is torch.Size([24, 48, 3]).
size mismatch for encoder.model.1.block.3.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.1.block.3.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.1.block.3.conv.conv.weight_v: copying a param with shape torch.Size([32, 16, 1]) from checkpoint, the shape in current model is torch.Size([48, 24, 1]).
size mismatch for encoder.model.1.shortcut.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.1.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.1.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([32, 32, 1]) from checkpoint, the shape in current model is torch.Size([48, 48, 1]).
size mismatch for encoder.model.3.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.3.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.3.conv.conv.weight_v: copying a param with shape torch.Size([64, 32, 4]) from checkpoint, the shape in current model is torch.Size([96, 48, 4]).
size mismatch for encoder.model.4.block.1.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.4.block.1.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.4.block.1.conv.conv.weight_v: copying a param with shape torch.Size([32, 64, 3]) from checkpoint, the shape in current model is torch.Size([48, 96, 3]).
size mismatch for encoder.model.4.block.3.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.4.block.3.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.4.block.3.conv.conv.weight_v: copying a param with shape torch.Size([64, 32, 1]) from checkpoint, the shape in current model is torch.Size([96, 48, 1]).
size mismatch for encoder.model.4.shortcut.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.4.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.4.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([64, 64, 1]) from checkpoint, the shape in current model is torch.Size([96, 96, 1]).
size mismatch for encoder.model.6.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.6.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.6.conv.conv.weight_v: copying a param with shape torch.Size([128, 64, 8]) from checkpoint, the shape in current model is torch.Size([192, 96, 8]).
size mismatch for encoder.model.7.block.1.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.7.block.1.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.7.block.1.conv.conv.weight_v: copying a param with shape torch.Size([64, 128, 3]) from checkpoint, the shape in current model is torch.Size([96, 192, 3]).
size mismatch for encoder.model.7.block.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.7.block.3.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.7.block.3.conv.conv.weight_v: copying a param with shape torch.Size([128, 64, 1]) from checkpoint, the shape in current model is torch.Size([192, 96, 1]).
size mismatch for encoder.model.7.shortcut.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.7.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.7.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([128, 128, 1]) from checkpoint, the shape in current model is torch.Size([192, 192, 1]).
size mismatch for encoder.model.9.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for encoder.model.9.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for encoder.model.9.conv.conv.weight_v: copying a param with shape torch.Size([256, 128, 10]) from checkpoint, the shape in current model is torch.Size([384, 192, 10]).
size mismatch for encoder.model.10.block.1.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.10.block.1.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.10.block.1.conv.conv.weight_v: copying a param with shape torch.Size([128, 256, 3]) from checkpoint, the shape in current model is torch.Size([192, 384, 3]).
size mismatch for encoder.model.10.block.3.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for encoder.model.10.block.3.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for encoder.model.10.block.3.conv.conv.weight_v: copying a param with shape torch.Size([256, 128, 1]) from checkpoint, the shape in current model is torch.Size([384, 192, 1]).
size mismatch for encoder.model.10.shortcut.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for encoder.model.10.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for encoder.model.10.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([256, 256, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for encoder.model.12.conv.conv.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for encoder.model.12.conv.conv.weight_g: copying a param with shape torch.Size([512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1, 1]).
size mismatch for encoder.model.12.conv.conv.weight_v: copying a param with shape torch.Size([512, 256, 12]) from checkpoint, the shape in current model is torch.Size([768, 384, 12]).
size mismatch for encoder.model.13.lstm.weight_ih_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.weight_hh_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.bias_ih_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.13.lstm.bias_hh_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.13.lstm.weight_ih_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.weight_hh_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.bias_ih_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.13.lstm.bias_hh_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.15.conv.conv.weight_v: copying a param with shape torch.Size([512, 512, 7]) from checkpoint, the shape in current model is torch.Size([512, 768, 7]).
size mismatch for decoder.model.0.conv.conv.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for decoder.model.0.conv.conv.weight_g: copying a param with shape torch.Size([512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1, 1]).
size mismatch for decoder.model.0.conv.conv.weight_v: copying a param with shape torch.Size([512, 512, 7]) from checkpoint, the shape in current model is torch.Size([768, 512, 7]).
size mismatch for decoder.model.1.lstm.weight_ih_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.weight_hh_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.bias_ih_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.1.lstm.bias_hh_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.1.lstm.weight_ih_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.weight_hh_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.bias_ih_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.1.lstm.bias_hh_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.3.convtr.convtr.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for decoder.model.3.convtr.convtr.weight_g: copying a param with shape torch.Size([512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1, 1]).
size mismatch for decoder.model.3.convtr.convtr.weight_v: copying a param with shape torch.Size([512, 256, 12]) from checkpoint, the shape in current model is torch.Size([768, 384, 12]).
size mismatch for decoder.model.4.block.1.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.4.block.1.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.4.block.1.conv.conv.weight_v: copying a param with shape torch.Size([128, 256, 3]) from checkpoint, the shape in current model is torch.Size([192, 384, 3]).
size mismatch for decoder.model.4.block.3.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for decoder.model.4.block.3.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for decoder.model.4.block.3.conv.conv.weight_v: copying a param with shape torch.Size([256, 128, 1]) from checkpoint, the shape in current model is torch.Size([384, 192, 1]).
size mismatch for decoder.model.4.shortcut.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for decoder.model.4.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for decoder.model.4.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([256, 256, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for decoder.model.6.convtr.convtr.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.6.convtr.convtr.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for decoder.model.6.convtr.convtr.weight_v: copying a param with shape torch.Size([256, 128, 10]) from checkpoint, the shape in current model is torch.Size([384, 192, 10]).
size mismatch for decoder.model.7.block.1.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.7.block.1.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.7.block.1.conv.conv.weight_v: copying a param with shape torch.Size([64, 128, 3]) from checkpoint, the shape in current model is torch.Size([96, 192, 3]).
size mismatch for decoder.model.7.block.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.7.block.3.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.7.block.3.conv.conv.weight_v: copying a param with shape torch.Size([128, 64, 1]) from checkpoint, the shape in current model is torch.Size([192, 96, 1]).
size mismatch for decoder.model.7.shortcut.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.7.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.7.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([128, 128, 1]) from checkpoint, the shape in current model is torch.Size([192, 192, 1]).
size mismatch for decoder.model.9.convtr.convtr.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.9.convtr.convtr.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.9.convtr.convtr.weight_v: copying a param with shape torch.Size([128, 64, 8]) from checkpoint, the shape in current model is torch.Size([192, 96, 8]).
size mismatch for decoder.model.10.block.1.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.10.block.1.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for decoder.model.10.block.1.conv.conv.weight_v: copying a param with shape torch.Size([32, 64, 3]) from checkpoint, the shape in current model is torch.Size([48, 96, 3]).
size mismatch for decoder.model.10.block.3.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.10.block.3.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.10.block.3.conv.conv.weight_v: copying a param with shape torch.Size([64, 32, 1]) from checkpoint, the shape in current model is torch.Size([96, 48, 1]).
size mismatch for decoder.model.10.shortcut.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.10.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.10.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([64, 64, 1]) from checkpoint, the shape in current model is torch.Size([96, 96, 1]).
size mismatch for decoder.model.12.convtr.convtr.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.12.convtr.convtr.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.12.convtr.convtr.weight_v: copying a param with shape torch.Size([64, 32, 4]) from checkpoint, the shape in current model is torch.Size([96, 48, 4]).
size mismatch for decoder.model.13.block.1.conv.conv.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([24]).
size mismatch for decoder.model.13.block.1.conv.conv.weight_g: copying a param with shape torch.Size([16, 1, 1]) from checkpoint, the shape in current model is torch.Size([24, 1, 1]).
size mismatch for decoder.model.13.block.1.conv.conv.weight_v: copying a param with shape torch.Size([16, 32, 3]) from checkpoint, the shape in current model is torch.Size([24, 48, 3]).
size mismatch for decoder.model.13.block.3.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.13.block.3.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for decoder.model.13.block.3.conv.conv.weight_v: copying a param with shape torch.Size([32, 16, 1]) from checkpoint, the shape in current model is torch.Size([48, 24, 1]).
size mismatch for decoder.model.13.shortcut.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.13.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for decoder.model.13.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([32, 32, 1]) from checkpoint, the shape in current model is torch.Size([48, 48, 1]).
size mismatch for decoder.model.15.conv.conv.weight_v: copying a param with shape torch.Size([1, 32, 7]) from checkpoint, the shape in current model is torch.Size([1, 48, 7]).

STOI and PESQ computation

Hello,
I am wondering how you computed STOI and PESQ, which repositories were used?

About Fine-tuning pre-trained instructions

Sorry Yang Sir , I can't find about the command instructions about Fine-tuning the pre-trained model , cound you provide some information?

自定义库distributed/launch.py的import问题

运行egs/SoundStream_24k_240d/main3_ddp.py时，当运行到第9行，也就是导入自定义库academicodec/models/encodec/distributed/launch.py时，launch.py会在第5行报错，说找不到库。

这里只需要把launch.py第5行改写成from . import distributed as dist_fn就可以了。

请问 Encodec_24k_32d 和 Encodec_16k_320 其实是 SoundStream 嘛

Hi, dongchao
我最近在调研 AudioLM 系列的文章，发现了你复现的 SoundStorm 版本比较完整打算进一步复现（因为现在 https://github.com/yangdongchao/SoundStorm 只有 S2 没有 S1），然后又看到了 AcademiCodec 这个仓库，我查看 Encodec_24k_32d 和 Encodec_16k_320 的 test.py 和训练文件 main3_ddp.py，发现加载的模型是 SoundStream

AcademiCodec/Encodec_16k_320/main3_ddp.py

Line 10 in d03142b

from net3 import SoundStream

所以是不是这两个文件本质是 SoundStream 模型，只有 Encodec_24k_240d 才是 EnCodec 模型？

License missing

There is no LICENSE file.
What is the license for this project and the pretrained models?

Hi，Do you have any plans to open source the pre-trained model of soundstream?

I want to train soundstorm, but it needs soundstream. Do you have any plans to open source the pre-trained model of soundstream? I can't find it in the Hugging Face branch.

Command line to train EnCodec

Hi! Thanks for work.
Do you have an example command line to train the EnCodec?

The resample in PESQ calculation maybe error

the second param is not resample rate, but is the resample num

Release the pretrained discriminator?

Thanks for open source your wonderful work!
I was trying to finetune on my own dataset, however, I found there is only pretrained generator, no pretrained discriminator.
So, would you please release your pretrained discriminator?
Thanks!

Error in "DiscriminatorSTFT"

122 z = self.spec_transform(x) # [B, 2, Freq, Frames, 2]

But when I try to train the mode, z : torch.Size([8, 1, 513, 43, 2]),
the second dim is 1 not 2.
And errors when run z = torch.cat([z.real, z.imag], dim=1)

RuntimeError: real is not implemented for tensors with non-complex dtypes.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.