Code Monkey home page Code Monkey logo

academicodec's People

Contributors

babysor avatar liusongxiang avatar rishikksh20 avatar rongjiehuang avatar yangdongchao avatar yt605155624 avatar ywk991112 avatar zhaomingwork avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

academicodec's Issues

取消 OMP_NUM_THREADS 的注释有可能加速 encodec 训练

解开 launch.py 里面关于 OMP_NUM_THREADS 的注释可以加速训练,也能提高 GPU 利用率,因为默认会使用所有核心(对于核心数很多的机器如 A100),多核心之间的交互可能有耗时,如果觉得 1 太小,可以额外在 train.sh 前面控制(如使用 8),LibriTTS 的训练尚未测试

# if "OMP_NUM_THREADS" not in os.environ:

also see yangdongchao/SoundStorm#34

在该仓库中暂未验证

missed json

missed json for HiFi-Codec

FileNotFoundError: [Errno 2] No such file or directory: 'logs/config.json'

--config_path ${log_root}/config.json

Some minor bugs inside Hifi-Codec code

As I am analyzing new HiFi-codec code I encountered three small bugs:

  1. Torchaudio Melspectrogram :
    Here :
    melspec = MelSpectrogram(sample_rate=24000, n_fft=s, hop_length=s//4, n_mels=64, wkwargs={"device": device}).to(device)

    MelSpectrogram not imported before use :
from torchaudio.transforms import MelSpectrogram
  1. Modules not present inside HiFi-Codec:
    Here :

    from modules import NormConv2d

    modules not present inside HiFi-Codec folder. So, neede to copy or change modules reference from other model's modules implementation.

  2. Shape of input tensor x, here :

    c = self.encoder(x.unsqueeze(1))

    While my testing with 24 khz mono channel wav shape of x before line 33 comes out -> [Batch, Samples, 1] and after .unsqueeze(1) operation at line 33 it becomes [batch, 1, samples, 1] a 4D tensor which supposed to be 3D tensor. So shape of x needed to be check before line 33 and if it has 3 dimensions and last dimension is 1 then we needed to squeeze last dimension.
    After modifying and correct the shape of x, code is working fine without an error, and I am able to get desired output.

Thanks @yangdongchao .

training loss

hello, The final amount of training loss of Encodec_24k_240d model ?

HiFi-codec 的 VQVAE 这个类没有看到 decode 这个函数

HiFi-codec 的 VQVAE 这个类没有看到 decode 这个函数,是不是直接用 VQVAE 的 forward 函数就代表 decode 过程,因为我看

syn = self.vqvae(vq_codes)

的逻辑是这样的


额,好像理解错误,按照

return acoustic_tokens
的写法,decode 是不是调用 self.generator 即可,因为 acoustic_tokens 应该是 quant 之后的,所以
https://github.com/yangdongchao/AcademiCodec/blob/d03142b05be6d1023080cb42416f0c4b227e5342/HiFi-Codec-24k-320d/vqvae_tester.py#LL31C1-L31C1
中的 vq_code 并不是 acoustic_tokens, 需要过一下 self.quantizer.embed() 才是 acoustic_tokens,不知道这样理解对不对?

Training Soundstream on Single GPU

Hi @yangdongchao
I am planning to training SoundStream codec from this repo to clean version of Libri light dataset + VCTK datasets and will open source the checkpoint, but I have single A100 for that, is it possible to train Soundstream on single A100 with lower batch size for longer time period?

请问有训练好的discriminator模型吗

我目前正在学习这个模型的训练流程,但是我看论文说训练到收敛需要8张卡训练一个多月,所以说对我而言短时间内肯定是训练不好的,我想知道有没有训练好的discriminator模型,谢谢。

What datasets are specifically mixed in the HIFI-CODEC paper?

Hello, I mixed the three data sets libritts, aishell and vctk according to the dataset set in the HIFI-CODEC paper, which lasted about 400 hours, but the performance of the trained model could not reach the performance of the pretrained model you gave. May I ask what datasets are specifically mixed in your paper "and more, with a total duration of over 1000 hours"?

保存权重可以保存成统一的格式

可以例如下面这种格式保存,要不然单机保存的模型根据索引会出现问题,我会在后面提交修复的版本

if epoch % config.common.save_interval == 0:
            model_to_save = model.module if config.distributed.data_parallel else model
            disc_model_to_save = disc_model.module if config.distributed.data_parallel else disc_model 
            if not config.distributed.data_parallel or dist.get_rank() == 0:  
                save_master_checkpoint(epoch, model_to_save, optimizer, scheduler, f'{config.checkpoint.save_location}epoch{epoch}_lr{config.optimization.lr}.pt')  
                save_master_checkpoint(epoch, disc_model_to_save, optimizer_disc, disc_scheduler, f'{config.checkpoint.save_location}epoch{epoch}_disc_lr{config.optimization.lr}.pt') 

Encodec's training speed

I am training encodec on my own dataset (300+ hours, 1.2 million samples), it takes 1.7s for one iteration (8 V100, the batch size for one GPU is 28). It totally takes 30days to train 300 epoches. 😱
I am not sure the speed is okay?

HiFi-Codec-16k bitrate options

Have you ever evaluated it in case of higher bitrate? It seems the HiFi-Codec-16k only supports two kinds of code rate 1kbps using 1 layer quantization and 2kbps using 2 layer. By the way, the training code seems only training using 2 layer quantization.

to merge encodec_16k_lanch into academicodec

encodec_16k_lanch 解决了以下问题:

from feiteng
https://github.com/yangdongchao/AcademiCodec/blob/master/academicodec/quantization/core_vq.py#L149 这个不应该被注释掉
image

注释掉的话,多卡训练会效果差些, 单卡不影响,可以认为,目前放出来的 encodec 的权重,用最新的代码训练的话,可以拿到更好的效果

dongchao
嗯,我后面更新一下,如果把这个注释调,现在的代码没法跑多卡。我现在有一版不注释也能跑多卡的代码

但是 encodec_16k_lanch 没有合并到仓库的 academicodec 目录

When I use the pre-trained model to inference about Encodec and HiFi-Codec, an identical error occurs

(soundstream) root@autodl-container-1cb1119f52-820c06c3:~/autodl-tmp/paper/HiFi-Codec# bash test.sh
checkpoint path: ./checkpoint/HiFi-Codec-24k-240d
Init model and load weights
Traceback (most recent call last):
File "./vqvae_copy_syn.py", line 35, in
model = VqvaeTester(args)
File "/root/autodl-tmp/paper/HiFi-Codec/vqvae_tester.py", line 20, in init
self.vqvae = VQVAE(hp.config_path, hp.model_path, with_encoder=True)
File "/root/autodl-tmp/paper/HiFi-Codec/vqvae.py", line 12, in init
ckpt = torch.load(ckpt_path)
File "/root/miniconda3/envs/soundstream/lib/python3.8/site-packages/torch/serialization.py", line 815, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/root/miniconda3/envs/soundstream/lib/python3.8/site-packages/torch/serialization.py", line 1033, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

The Encodec 24k_240 training loss are very large !

Hi yangdongchao!
When I train the Encodec 24k_240 in 1kbps during the early stages, the model exhibits very high loss and significant oscillation. Is this a normal phenomenon?

The train process as follows:

<epoch:8, iter:8250, total_loss_g:20.7092, adv_g_loss:2.1068, feat_loss:15.4339, rec_loss:3.1594, commit_loss:0.0000, loss_d:1.2053>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▋                                                            | 8259/18075 [1:34:07<1:51:14,  1.47it/s]<epoch:8, iter:8260, total_loss_g:1448.0029, adv_g_loss:2.0795, feat_loss:1439.2244, rec_loss:6.6940, commit_loss:0.0000, loss_d:0.5836>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▊                                                            | 8269/18075 [1:34:15<1:51:27,  1.47it/s]<epoch:8, iter:8270, total_loss_g:588.6943, adv_g_loss:2.1234, feat_loss:577.0657, rec_loss:9.4847, commit_loss:0.0000, loss_d:0.8170>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▊                                                            | 8279/18075 [1:34:21<1:51:37,  1.46it/s]<epoch:8, iter:8280, total_loss_g:316.6624, adv_g_loss:2.1950, feat_loss:306.5796, rec_loss:7.8813, commit_loss:0.0000, loss_d:0.8256>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▉                                                            | 8289/18075 [1:34:29<1:51:56,  1.46it/s]<epoch:8, iter:8290, total_loss_g:6425.9717, adv_g_loss:2.1269, feat_loss:6398.3364, rec_loss:25.5026, commit_loss:0.0000, loss_d:0.9661>, d_weight: 1.0000
 46%|██████████████████████████████████████████████████▉                                                            | 8299/18075 [1:34:36<1:52:12,  1.45it/s]<epoch:8, iter:8300, total_loss_g:2867.6846, adv_g_loss:2.2306, feat_loss:2847.7778, rec_loss:17.6676, commit_loss:0.0000, loss_d:0.1482>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████                                                            | 8309/18075 [1:34:41<1:52:00,  1.45it/s]<epoch:8, iter:8310, total_loss_g:4510.4780, adv_g_loss:1.9837, feat_loss:4476.9551, rec_loss:31.5352, commit_loss:0.0000, loss_d:1.1329>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████                                                            | 8319/18075 [1:34:47<1:51:03,  1.46it/s]<epoch:8, iter:8320, total_loss_g:3507.8118, adv_g_loss:1.9984, feat_loss:3480.6077, rec_loss:25.1733, commit_loss:0.0000, loss_d:1.0020>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▏                                                           | 8329/18075 [1:34:56<1:50:40,  1.47it/s]<epoch:8, iter:8330, total_loss_g:17506.3809, adv_g_loss:1.9943, feat_loss:17494.1309, rec_loss:10.2544, commit_loss:0.0000, loss_d:0.8280>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▏                                                           | 8339/18075 [1:35:01<1:50:31,  1.47it/s]<epoch:8, iter:8340, total_loss_g:30781.5254, adv_g_loss:2.1298, feat_loss:30761.4688, rec_loss:17.8869, commit_loss:0.0000, loss_d:0.4086>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▎                                                           | 8349/18075 [1:35:08<1:50:59,  1.46it/s]<epoch:8, iter:8350, total_loss_g:361517.0312, adv_g_loss:2.1185, feat_loss:361338.4688, rec_loss:176.4266, commit_loss:0.0000, loss_d:0.2256>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▎                                                           | 8359/18075 [1:35:15<1:49:04,  1.48it/s]<epoch:8, iter:8360, total_loss_g:32.4452, adv_g_loss:2.1076, feat_loss:28.3426, rec_loss:1.9913, commit_loss:0.0000, loss_d:1.3850>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▍                                                           | 8369/18075 [1:35:23<1:50:03,  1.47it/s]<epoch:8, iter:8370, total_loss_g:304.8588, adv_g_loss:2.2852, feat_loss:299.7329, rec_loss:2.8386, commit_loss:0.0000, loss_d:1.0175>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▍                                                           | 8379/18075 [1:35:30<1:50:01,  1.47it/s]<epoch:8, iter:8380, total_loss_g:34873.7617, adv_g_loss:2.1054, feat_loss:34844.2266, rec_loss:27.4251, commit_loss:0.0000, loss_d:0.3069>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▌                                                           | 8389/18075 [1:35:37<1:50:02,  1.47it/s]<epoch:8, iter:8390, total_loss_g:40341.5039, adv_g_loss:2.2593, feat_loss:40214.2148, rec_loss:125.0235, commit_loss:0.0000, loss_d:0.6393>, d_weight: 1.0000
 46%|███████████████████████████████████████████████████▌                                                           | 8399/18075 [1:35:43<1:50:03,  1.47it/s]<epoch:8, iter:8400, total_loss_g:184210.6719, adv_g_loss:2.0305, feat_loss:184145.6875, rec_loss:62.9335, commit_loss:0.0000, loss_d:1.0710>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▋                                                           | 8409/18075 [1:35:49<1:48:46,  1.48it/s]<epoch:8, iter:8410, total_loss_g:1336.8246, adv_g_loss:2.1409, feat_loss:1317.9712, rec_loss:16.7082, commit_loss:0.0000, loss_d:0.9688>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▋                                                           | 8419/18075 [1:35:57<1:49:31,  1.47it/s]<epoch:8, iter:8420, total_loss_g:13977.8945, adv_g_loss:2.2973, feat_loss:13938.0557, rec_loss:37.5274, commit_loss:0.0000, loss_d:0.2749>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▊                                                           | 8429/18075 [1:36:04<1:49:48,  1.46it/s]<epoch:8, iter:8430, total_loss_g:3301.4082, adv_g_loss:2.1330, feat_loss:3262.6450, rec_loss:36.6189, commit_loss:0.0000, loss_d:0.6580>, d_weight: 1.0000
 47%|███████████████████████████████████████████████████▊                                                           | 8438/18075 [1:36:10<1:49:44,  1.46it/s]

The valid process as follows:

2023-06-20-12-58: <epoch:0, total_loss_g_valid:155.6049, recon_loss_valid:21.3568, adversarial_loss_valid:1.6380, feature_loss_valid:132.6101, commit_loss_valid:0.0000, valid_loss_d:1.2365, best_epoch:0>
2023-06-20-16-30: <epoch:1, total_loss_g_valid:508.1316, recon_loss_valid:21.7350, adversarial_loss_valid:1.7627, feature_loss_valid:484.6339, commit_loss_valid:0.0000, valid_loss_d:1.0418, best_epoch:0>
2023-06-20-20-02: <epoch:2, total_loss_g_valid:302.2671, recon_loss_valid:20.5088, adversarial_loss_valid:2.1077, feature_loss_valid:279.6506, commit_loss_valid:0.0000, valid_loss_d:1.1599, best_epoch:2>
2023-06-20-23-34: <epoch:3, total_loss_g_valid:1090.3598, recon_loss_valid:20.4632, adversarial_loss_valid:2.0897, feature_loss_valid:1067.8068, commit_loss_valid:0.0000, valid_loss_d:0.9414, best_epoch:3>
2023-06-21-03-07: <epoch:4, total_loss_g_valid:1666.9553, recon_loss_valid:21.7679, adversarial_loss_valid:2.0294, feature_loss_valid:1643.1580, commit_loss_valid:0.0000, valid_loss_d:1.0660, best_epoch:3>
2023-06-21-06-39: <epoch:5, total_loss_g_valid:1438.0695, recon_loss_valid:21.1533, adversarial_loss_valid:2.1540, feature_loss_valid:1414.7622, commit_loss_valid:0.0000, valid_loss_d:1.1304, best_epoch:3>
2023-06-21-10-11: <epoch:6, total_loss_g_valid:918.1003, recon_loss_valid:21.4004, adversarial_loss_valid:2.1242, feature_loss_valid:894.5757, commit_loss_valid:0.0000, valid_loss_d:1.1136, best_epoch:3>
2023-06-21-13-43: <epoch:7, total_loss_g_valid:1691.1200, recon_loss_valid:20.3575, adversarial_loss_valid:2.1024, feature_loss_valid:1668.6601, commit_loss_valid:0.0000, valid_loss_d:0.9036, best_epoch:7>

请问如何理解 codes dimension

感谢开源精彩的工作!

我想确认一下我对输出的 codes 的 ordering 的理解:
VQVAE encode 函数的输出形状是 [B, T, 4]。
假设 B=1, T=2,codes 是
[[a,b,c,d]
[e,f,g,h]]

判断:
a 是 T=1 的feature 的前一半 第一次quantize 得到的code,
b 是 T=1 的feature 的后一半 第一次 quantize 得到的code,
c 是 quantize a 的 residual 得到的 code
...

h 是 quantize f 的 residual 得到的 code
请问这样的判断对吗?

谢谢
Puyuan

Data Augmentation in Soundstream

Hi, thanks for your great work. I notice that the NSynthDataset used in SoundStream contains data augmentation by adding two audio waveforms, which does not appear in Encodec. I wonder where is this technique proposed, and have you found that it helps the audio quality? Thanks.

it does not converge for valle training

The model does not converge when I use hifi-codec to train NAR of valle. The data i used is a chinese dataset while its duration is 5000 hours. How can I do to train valle with hificodec?

An error occurred while inferring the soundstream

the command :

python test.py "../datasets/Nsynth/nsynth-valid/audio/" "./audiofake"
--resume_path "./model_path/2023-05-10-08-55/best_1.pth"

The error as follows:
(soundstream) root@autodl-container-1cb1119f52-820c06c3:~/autodl-tmp/paper/SoundStream_24k_240d# bash test.sh
Traceback (most recent call last):
File "test.py", line 159, in
test_batch()
File "test.py", line 151, in test_batch
soundstream.load_state_dict(new_state_dict) # load model
File "/root/miniconda3/envs/soundstream/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SoundStream:
size mismatch for encoder.model.0.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.0.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.0.conv.conv.weight_v: copying a param with shape torch.Size([32, 1, 7]) from checkpoint, the shape in current model is torch.Size([48, 1, 7]).
size mismatch for encoder.model.1.block.1.conv.conv.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([24]).
size mismatch for encoder.model.1.block.1.conv.conv.weight_g: copying a param with shape torch.Size([16, 1, 1]) from checkpoint, the shape in current model is torch.Size([24, 1, 1]).
size mismatch for encoder.model.1.block.1.conv.conv.weight_v: copying a param with shape torch.Size([16, 32, 3]) from checkpoint, the shape in current model is torch.Size([24, 48, 3]).
size mismatch for encoder.model.1.block.3.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.1.block.3.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.1.block.3.conv.conv.weight_v: copying a param with shape torch.Size([32, 16, 1]) from checkpoint, the shape in current model is torch.Size([48, 24, 1]).
size mismatch for encoder.model.1.shortcut.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.1.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.1.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([32, 32, 1]) from checkpoint, the shape in current model is torch.Size([48, 48, 1]).
size mismatch for encoder.model.3.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.3.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.3.conv.conv.weight_v: copying a param with shape torch.Size([64, 32, 4]) from checkpoint, the shape in current model is torch.Size([96, 48, 4]).
size mismatch for encoder.model.4.block.1.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for encoder.model.4.block.1.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for encoder.model.4.block.1.conv.conv.weight_v: copying a param with shape torch.Size([32, 64, 3]) from checkpoint, the shape in current model is torch.Size([48, 96, 3]).
size mismatch for encoder.model.4.block.3.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.4.block.3.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.4.block.3.conv.conv.weight_v: copying a param with shape torch.Size([64, 32, 1]) from checkpoint, the shape in current model is torch.Size([96, 48, 1]).
size mismatch for encoder.model.4.shortcut.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.4.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.4.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([64, 64, 1]) from checkpoint, the shape in current model is torch.Size([96, 96, 1]).
size mismatch for encoder.model.6.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.6.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.6.conv.conv.weight_v: copying a param with shape torch.Size([128, 64, 8]) from checkpoint, the shape in current model is torch.Size([192, 96, 8]).
size mismatch for encoder.model.7.block.1.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for encoder.model.7.block.1.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for encoder.model.7.block.1.conv.conv.weight_v: copying a param with shape torch.Size([64, 128, 3]) from checkpoint, the shape in current model is torch.Size([96, 192, 3]).
size mismatch for encoder.model.7.block.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.7.block.3.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.7.block.3.conv.conv.weight_v: copying a param with shape torch.Size([128, 64, 1]) from checkpoint, the shape in current model is torch.Size([192, 96, 1]).
size mismatch for encoder.model.7.shortcut.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.7.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.7.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([128, 128, 1]) from checkpoint, the shape in current model is torch.Size([192, 192, 1]).
size mismatch for encoder.model.9.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for encoder.model.9.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for encoder.model.9.conv.conv.weight_v: copying a param with shape torch.Size([256, 128, 10]) from checkpoint, the shape in current model is torch.Size([384, 192, 10]).
size mismatch for encoder.model.10.block.1.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for encoder.model.10.block.1.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for encoder.model.10.block.1.conv.conv.weight_v: copying a param with shape torch.Size([128, 256, 3]) from checkpoint, the shape in current model is torch.Size([192, 384, 3]).
size mismatch for encoder.model.10.block.3.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for encoder.model.10.block.3.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for encoder.model.10.block.3.conv.conv.weight_v: copying a param with shape torch.Size([256, 128, 1]) from checkpoint, the shape in current model is torch.Size([384, 192, 1]).
size mismatch for encoder.model.10.shortcut.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for encoder.model.10.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for encoder.model.10.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([256, 256, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for encoder.model.12.conv.conv.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for encoder.model.12.conv.conv.weight_g: copying a param with shape torch.Size([512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1, 1]).
size mismatch for encoder.model.12.conv.conv.weight_v: copying a param with shape torch.Size([512, 256, 12]) from checkpoint, the shape in current model is torch.Size([768, 384, 12]).
size mismatch for encoder.model.13.lstm.weight_ih_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.weight_hh_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.bias_ih_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.13.lstm.bias_hh_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.13.lstm.weight_ih_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.weight_hh_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for encoder.model.13.lstm.bias_ih_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.13.lstm.bias_hh_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for encoder.model.15.conv.conv.weight_v: copying a param with shape torch.Size([512, 512, 7]) from checkpoint, the shape in current model is torch.Size([512, 768, 7]).
size mismatch for decoder.model.0.conv.conv.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for decoder.model.0.conv.conv.weight_g: copying a param with shape torch.Size([512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1, 1]).
size mismatch for decoder.model.0.conv.conv.weight_v: copying a param with shape torch.Size([512, 512, 7]) from checkpoint, the shape in current model is torch.Size([768, 512, 7]).
size mismatch for decoder.model.1.lstm.weight_ih_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.weight_hh_l0: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.bias_ih_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.1.lstm.bias_hh_l0: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.1.lstm.weight_ih_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.weight_hh_l1: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for decoder.model.1.lstm.bias_ih_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.1.lstm.bias_hh_l1: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for decoder.model.3.convtr.convtr.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for decoder.model.3.convtr.convtr.weight_g: copying a param with shape torch.Size([512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1, 1]).
size mismatch for decoder.model.3.convtr.convtr.weight_v: copying a param with shape torch.Size([512, 256, 12]) from checkpoint, the shape in current model is torch.Size([768, 384, 12]).
size mismatch for decoder.model.4.block.1.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.4.block.1.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.4.block.1.conv.conv.weight_v: copying a param with shape torch.Size([128, 256, 3]) from checkpoint, the shape in current model is torch.Size([192, 384, 3]).
size mismatch for decoder.model.4.block.3.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for decoder.model.4.block.3.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for decoder.model.4.block.3.conv.conv.weight_v: copying a param with shape torch.Size([256, 128, 1]) from checkpoint, the shape in current model is torch.Size([384, 192, 1]).
size mismatch for decoder.model.4.shortcut.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for decoder.model.4.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for decoder.model.4.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([256, 256, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for decoder.model.6.convtr.convtr.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.6.convtr.convtr.weight_g: copying a param with shape torch.Size([256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]).
size mismatch for decoder.model.6.convtr.convtr.weight_v: copying a param with shape torch.Size([256, 128, 10]) from checkpoint, the shape in current model is torch.Size([384, 192, 10]).
size mismatch for decoder.model.7.block.1.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.7.block.1.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.7.block.1.conv.conv.weight_v: copying a param with shape torch.Size([64, 128, 3]) from checkpoint, the shape in current model is torch.Size([96, 192, 3]).
size mismatch for decoder.model.7.block.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.7.block.3.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.7.block.3.conv.conv.weight_v: copying a param with shape torch.Size([128, 64, 1]) from checkpoint, the shape in current model is torch.Size([192, 96, 1]).
size mismatch for decoder.model.7.shortcut.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for decoder.model.7.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.7.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([128, 128, 1]) from checkpoint, the shape in current model is torch.Size([192, 192, 1]).
size mismatch for decoder.model.9.convtr.convtr.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.9.convtr.convtr.weight_g: copying a param with shape torch.Size([128, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 1, 1]).
size mismatch for decoder.model.9.convtr.convtr.weight_v: copying a param with shape torch.Size([128, 64, 8]) from checkpoint, the shape in current model is torch.Size([192, 96, 8]).
size mismatch for decoder.model.10.block.1.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.10.block.1.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for decoder.model.10.block.1.conv.conv.weight_v: copying a param with shape torch.Size([32, 64, 3]) from checkpoint, the shape in current model is torch.Size([48, 96, 3]).
size mismatch for decoder.model.10.block.3.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.10.block.3.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.10.block.3.conv.conv.weight_v: copying a param with shape torch.Size([64, 32, 1]) from checkpoint, the shape in current model is torch.Size([96, 48, 1]).
size mismatch for decoder.model.10.shortcut.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for decoder.model.10.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.10.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([64, 64, 1]) from checkpoint, the shape in current model is torch.Size([96, 96, 1]).
size mismatch for decoder.model.12.convtr.convtr.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.12.convtr.convtr.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1]).
size mismatch for decoder.model.12.convtr.convtr.weight_v: copying a param with shape torch.Size([64, 32, 4]) from checkpoint, the shape in current model is torch.Size([96, 48, 4]).
size mismatch for decoder.model.13.block.1.conv.conv.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([24]).
size mismatch for decoder.model.13.block.1.conv.conv.weight_g: copying a param with shape torch.Size([16, 1, 1]) from checkpoint, the shape in current model is torch.Size([24, 1, 1]).
size mismatch for decoder.model.13.block.1.conv.conv.weight_v: copying a param with shape torch.Size([16, 32, 3]) from checkpoint, the shape in current model is torch.Size([24, 48, 3]).
size mismatch for decoder.model.13.block.3.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.13.block.3.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for decoder.model.13.block.3.conv.conv.weight_v: copying a param with shape torch.Size([32, 16, 1]) from checkpoint, the shape in current model is torch.Size([48, 24, 1]).
size mismatch for decoder.model.13.shortcut.conv.conv.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for decoder.model.13.shortcut.conv.conv.weight_g: copying a param with shape torch.Size([32, 1, 1]) from checkpoint, the shape in current model is torch.Size([48, 1, 1]).
size mismatch for decoder.model.13.shortcut.conv.conv.weight_v: copying a param with shape torch.Size([32, 32, 1]) from checkpoint, the shape in current model is torch.Size([48, 48, 1]).
size mismatch for decoder.model.15.conv.conv.weight_v: copying a param with shape torch.Size([1, 32, 7]) from checkpoint, the shape in current model is torch.Size([1, 48, 7]).

自定义库distributed/launch.py的import问题

运行egs/SoundStream_24k_240d/main3_ddp.py时,当运行到第9行,也就是导入自定义库academicodec/models/encodec/distributed/launch.py时,launch.py会在第5行报错,说找不到库。

这里只需要把launch.py第5行改写成from . import distributed as dist_fn就可以了。

请问 Encodec_24k_32d 和 Encodec_16k_320 其实是 SoundStream 嘛

Hi, dongchao
我最近在调研 AudioLM 系列的文章,发现了你复现的 SoundStorm 版本比较完整打算进一步复现(因为现在 https://github.com/yangdongchao/SoundStorm 只有 S2 没有 S1),然后又看到了 AcademiCodec 这个仓库,我查看 Encodec_24k_32d 和 Encodec_16k_320 的 test.py 和训练文件 main3_ddp.py,发现加载的模型是 SoundStream

from net3 import SoundStream

所以是不是这两个文件本质是 SoundStream 模型,只有 Encodec_24k_240d 才是 EnCodec 模型?

License missing

There is no LICENSE file.
What is the license for this project and the pretrained models?

Release the pretrained discriminator?

Thanks for open source your wonderful work!
I was trying to finetune on my own dataset, however, I found there is only pretrained generator, no pretrained discriminator.
So, would you please release your pretrained discriminator?
Thanks!

Error in "DiscriminatorSTFT"

122 z = self.spec_transform(x) # [B, 2, Freq, Frames, 2]

But when I try to train the mode, z : torch.Size([8, 1, 513, 43, 2]),
the second dim is 1 not 2.
And errors when run z = torch.cat([z.real, z.imag], dim=1)

RuntimeError: real is not implemented for tensors with non-complex dtypes.

Is this a work in progress?

Hello! Do you have any plans of uploading the code for SoundStream and Encodec in the near future? Thanks in advance!

About VQ in Encodec model

I found that the Encodec project does not use LM for the codebook compared to Facebook's Encodec. Have you made any attempts to do this?

Validation set

Hello,
I could not figure out which validation set was used for results in the paper?

Could you help?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.