- Streamable high-fidelity audio codec for 48 kHz mono speech with 12.8 kbps bitrate.
- Very low decoding latency on GPU (~6 ms) and CPU (~10 ms) with 4 threads.
- Efficient two-stage training (with the pre-trained models, training an encoder for new applications takes only a few hours)
A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e. the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e. encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48 kHz speech signals while operating at only 12 kbps and running with less than 6 ms (GPU)/10 ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. [paper] [demo]
- AutoEncoder (symmetric AudioDec, symAD)
1-1. Train an AutoEncoder-based codec model from scratch with only metric loss(es) for the first 200k iterations.
1-2. Fix the encoder, projector, quantizer, and codebook, and train the decoder with the discriminators for the following 500k iterations. - AutoEncoder + Vocoder (AD v0,1,2) (recommended!)
2-1. Extract the stats (global mean and variance) of the codes extracted by the trained Encoder.
2-2. Train the vocoder with the trained Encoder and stats for 500k iterations.
- 2023/5/13: 1st version is released
This repository is tested on Ubuntu 20.04 using a V100 and the following settings.
- Python 3.8+
- Cuda 11.0+
- PyTorch 1.10+
- bin: The folder of training, stats extraction, testing, and streaming templates.
- config: The folder of config files (.yaml).
- dataloader: The source codes of data loading.
- exp: The folder for saving models.
- layers: The source codes of basic neural layers.
- losses: The source codes of losses.
- models: The source codes of models.
- slurmlogs: The folder for saving slurm logs.
- stats: The folder for saving stats.
- trainer: The source codes of trainers.
- utils: The source codes of utils for the demo.
- Please download the whole exp folder and put it in the AudioDec project directory.
- Get the list of all I/O devices
$ python -m sounddevice
- Run the demo
# The LibriTTS model is recommended for arbitrary microphones because of the robustness of microphone channel mismatches.
# Set up the I/O devices according to the list of I/O devices
# w/ GPU
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1
# w/ CPU
$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym
# The input and out audios will be dumped into input.wav and output.wav
- Please download the whole exp folder and put it in the AudioDec project directory.
- Run the demo
## VCTK 48000Hz models
$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav
## LibriTTS 24000Hz model
$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav
- Prepare the training/validation/test utterances and put them in three different folders
ex: corpus/train, corpus/dev, and corpus/test - Modify the paths (ex: /mnt/home/xxx/datasets) in
submit_codec_vctk.sh
config/autoencoder/symAD_vctk_48000_hop300.yaml
config/statistic/symAD_vctk_48000_hop300_clean.yaml
config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml - Assign corresponding
analyzer
andstats
in config/statistic/symAD_vctk_48000_hop300_clean.yaml
config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml - Follow the usage instructions in submit_codec_vctk.sh to run the training and testing
# stage 0: training autoencoder from scratch
# stage 1: extracting statistics
# stage 2: training vocoder from scratch
# stage 3: testing (symAE)
# stage 4: testing (AE + Vocoder)
# Run stages 0-4
$ bash submit_codec.sh --start 0 --stop 4 \
--autoencoder "autoencoder/symAD_vctk_48000_hop300" \
--statistic "stati/symAD_vctk_48000_hop300_clean" \
--vocoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"
- Prepare the training/validation/test utterances and modify the paths
- Follow the usage instructions in submit_autoencoder.sh to run the training and testing
# Train AutoEncoder from scratch
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
# Resume AutoEncoder from previous iterations
$ bash submit_autoencoder.sh --stage 1 \
--tag_name "autoencoder/symAD_vctk_48000_hop300" \
--resumepoint 200000
# Test AutoEncoder
$ bash submit_autoencoder.sh --stage 2 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
--subset "clean_test"
All pre-trained models can be accessed via exp (only the generators are provided).
AutoEncoder | Corpus | Fs | Bitrate | Path |
---|---|---|---|---|
symAD | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symAD_vctk_48000_hop300 |
symAD_univ | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symADuniv_vctk_48000_hop300 |
symAD | LibriTTS | 24 kHz | 6.4 kbps | exp/autoencoder/symAD_libritts_24000_hop300 |
Vocoder | Corpus | Fs | Path |
---|---|---|---|
AD v0 | VCTK | 48 kHz | exp/vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean |
AD v1 | VCTK | 48 kHz | exp/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean |
AD v2 | VCTK | 48 kHz | exp/vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean |
AD_univ | VCTK | 48 kHz | exp/vocoder/AudioDec_v3_symADuniv_vctk_48000_hop300_clean |
AD v1 | LibriTTS | 24 kHz | exp/vocoder/AudioDec_v1_symAD_libritts_24000_hop300_clean |
- It is easy to perform denoising by just updating the encoder using noisy-clean pairs (keeping the decoder/vocoder the same).
- Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing
# Update the Encoder for denoising
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "denoise/symAD_vctk_48000_hop300"
# Denoise
$ bash submit_autoencoder.sh --stage 2 \
--encoder "denoise/symAD_vctk_48000_hop300"
--decoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"
--encoder_checkpoint 200000
--decoder_checkpoint 500000
--subset "noisy_test"
# Stream demo w/ GPU
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise
# Codec demo w/ files
$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise
If you find the code helpful, please cite the following article.
@INPROCEEDINGS{10096509,
author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec},
year={2023},
doi={10.1109/ICASSP49357.2023.10096509}}
The AudioDec repository is developed based on the following repositories.
The majority of "AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec" is licensed under CC-BY-NC, however, portions of the project are available under separate license terms: https://github.com/kan-bayashi/ParallelWaveGAN, https://github.com/lucidrains/vector-quantize-pytorch, https://github.com/jik876/hifi-gan, and https://github.com/r9y9/wavenet_vocoder are licensed under the MIT license.