sibozhang / text2video Goto Github PK

ICASSP 2022: "Text2Video: text-driven talking-head video synthesis with phonetic dictionary".

Home Page: https://sites.google.com/view/sibozhang/text2video

Python 80.25% Shell 0.09% Makefile 0.09% C 9.31% TeX 2.99% M4 0.02% C++ 5.95% Fortran 0.03% MATLAB 0.01% Cuda 0.71% JavaScript 0.01% CSS 0.01% Roff 0.01% Cython 0.52%

vid2vid video gan metaverse deep-learning avatar virtual-humans aigc digital-humanities generative-ai

text2video's Introduction

Text2Video

This is code for ICASSP 2022: "Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary". Project Page

Introduction

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more ﬂexible and not subject to vulnerability due to speaker variation; 3) It signiﬁcantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

Data / Preprocessing

Set up

Git clone repo

git clone [email protected]:sibozhang/Text2Video.git

Download and install modified vid2vid repo vid2vid
Download Trained model

Please build 'checkpoints' folder in vid2vid folder and put trained model in it.

VidTIMIT fadg0 (English, Female)

Dropbox: https://www.dropbox.com/sh/lk6et49v2uyfzjx/AADAFAp02_b3FQchaYxOZ0EMa?dl=0

百度云链接: https://pan.baidu.com/s/1SSkMKOK9LhClW2JvDCSiLg?pwd=bevj 提取码: bevj

Xuesong (Chinese, Male)

Dropbox: https://www.dropbox.com/sh/qz3zoma5ac9mw5p/AAARiR8xKvATN4CBSyjWt_uOa?dl=0

百度云链接: 链接: https://pan.baidu.com/s/1DvuBbThYo4n5RIZsc-92rg?pwd=am7d 提取码: am7d

Prepare data and folder in the following order

Text2Video
├── *phoneme_data
├── model
├── ...
vid2vid
├── ...
venv
├── vid2vid

Setup env

sudo apt-get install sox libsox-fmt-mp3
pip install zhon
pip install moviepy
pip install ffmpeg
pip install dominate
pip install pydub

For Chinese, we use vosk to get timestamp of each words. Please install vosk from https://alphacephei.com/vosk/install and unpack as 'model' in the current folder. or install:

pip install vosk
pip install cn2an
pip install pypinyin

Testing

Activate vitrual environment vid2vid

source ../venv/vid2vid/bin/activate

Generate video with real audio in English

sh text2video_audio.sh $1 $2

Generate video with TTS audio in English

sh text2video_tts.sh $1 $2 $3

Generate video with TTS audio in Chinese

sh text2video_tts.sh $1 $2 $3

$1: "input text" $2: person $3: fill f for female or m for male (gender)

Example 1. test VidTIMIT data with real audio.

sh text2video_audio.sh "She had your dark suit in greasy wash water all year." fadg0 f

Example 2. test VidTIMIT data with TTS audio.

sh text2video_tts.sh "She had your dark suit in greasy wash water all year." fadg0 f

Example 3. test with Chinese female TTS audio.

sh text2video_tts_chinese.sh "正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。" henan f

Training with your own data

English Phoneme / Chinese Pinyin model:

Modeling

1.1 Video recording:

Read prompts cover all phonemes or pinyin, please refer to prompts.docx for phoneme or all_pinyin.txt for pinyin under ./prompts folder. Pause 0.5 second between each pronunciation. Use the camera to record at least 1280x720 video.

1.2 Phoneme-Mouth/ Pinyin-Mouth Shape Dictionary:

Use montreal-forced-aligner (google STT) for Phoneme/ vosk for pinyi to get timestamp of each words and put it in a dictionary file, which stores the data structure is as follows, each line saves a pair of [phoneme/ pinyin, frame number]:

phoneme, frame: AA 52 AA0 52 AA1 52 AA2 52 AE 90 AE0 90 AE1 90 AE2 90 AH 127 AH0 127 AH1 127 AH2 127 AO 146 AO0 146 AO1 146 AO2 146 AW 227 AW0 227 AW1 227 AW2 227 ...

pinyin, frame: ba 61 bo 86 bi 540 bu 110 bai 130 bao 154 ban 178 bang 202 ou 225 pa 272 po 298 ...

1.3 Openpose:

Use openpose to fit each frame of the video, find out the skeleton result of the human body, and save it to a separate folder. The code of openpose can be downloaded at: https://github.com/CMU-Perceptual-Computing-Lab/openpose After the compilation is successful, run:

./build/examples/openpose/openpose.bin --image_dir ./images --face --hand --write_json ./keypoints

1.4 Use the generated human skeleton model and the corresponding video to train the vid2vid model, which is used to generate a realistic human skeleton model portrait video. The code for vid2vid can be downloaded from here: https://github.com/NVIDIA/vid2vid

Train:

python train.py --name xx --dataroot datasets/xx --dataset_mode pose --input_nc 3
--openpose_only --num_D 2 --resize_or_crop randomScaleHeight_and_scaledCrop
--loadSize 544 --fineSize 512 --gpu_ids 0,1,2,3,4,5,6,7 --batchSize 8
--max_frames_per_gpu 2 --niter 500 --niter_decay 5 --no_first_img --n_frames_total 12
--max_t_step 4 --niter_step 100 --save_epoch_freq 100 --add_face_disc
--random_drop_prob 0

Video generation

2.1 Generate audio files from text:

Use Baidu TTS cloud service to generate the required voice. Baidu voice cloud service address: http://wiki.baidu.com/pages/viewpage.action?pageId=342334101. Please refer to the specific call tts_request.py

2.2 Analyze the audio and find out the time stamp of each text:

use the Chinese version of VOSK speech recognition to process the speech generated by TTS Line recognition, and generate the following <frame number, pinyin> file. The VOSK code can be downloaded here: https://github.com/alphacep/vosk-api. For specific calls, please refer to pinyin_timestamping.py

25 xu 29 yao 38 zuo 46 Geng 53 jia 60 chong 65 fen 70 de 75 zun 80 bei 100 yin 105 ci 111 ne 116 wei 118 le

2.3 For each text in the audio, use its pinyin to find the corresponding 2D skeleton model, and splice it into a dynamic 2D skeleton. For video, intermediate frames are obtained by interpolation. Please refer to interp_landmarks_motion.py for details.

2.4 Use the vid2vid model to generate the final portrait video from the 2D skeleton video. The example is as follows:

CUDA_VISIBLE_DEVICES=1 python test.py --name xx --dataroot datasets/xx
--dataset_mode pose --input_nc 3 --resize_or_crop scaleHeight --loadSize 512
--openpose_only --how_many 1200 --no_first_img --random_drop_prob 0

Citation

Please cite our paper in your publications.

Sibo Zhang, Jiahong Yuan, Miao Liao, Liangjun Zhang. PDF Result Video

@INPROCEEDINGS{9747380,  
author={Zhang, Sibo and Yuan, Jiahong and Liao, Miao and Zhang, Liangjun},  
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},   
title={Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary},   
year={2022},  
volume={},  
number={},  
pages={2659-2663},  
doi={10.1109/ICASSP43922.2022.9747380}
}

@article{zhang2021text2video,
  title={Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary},
  author={Zhang, Sibo and Yuan, Jiahong and Liao, Miao and Zhang, Liangjun},
  journal={arXiv preprint arXiv:2104.14631},
  year={2021}
}

Appendices

ARPABET

Ackowledgements

This code is based on the vid2vid framework.

text2video's People

Contributors

Stargazers

Watchers

Forkers

8secz-johndpope aggreybosire simulai liujingxiu23 janfschr maddigit iloveoreo pradeeppd monk0062006 rexryu yslion arrowluo ishine cheaterscript wnma3mz kimhyeonhaa lily11223344 mayorsoftware oussamatoumirt uyrong dashoreswastik ifgcguitarclub tianhaoyue jecky100000 road2018 zestfulcitrus dfqytcom sankarmurugan01 wonwizard sunny635 trellixvulnteam hbcbh1999 maxmax2016 abhibunt capesepias adambear importante chhaviilli himanshumoliya jackstephen slives-lab assassindesign sairam-create eqgis sjim mrm202 dongmaicle mwsssxu ybinu brunoscaglione yingchaoji kirinmin inopenspace cephdon lcsouzamenezes bw-huangxiaohui zhangziliang04 ruoyuchern carloszhang999 ddarkwing zoudong bc96 johnteddy3 summithwangcn aspnetcs singhkundan fitzxvii jh-001 nq239 suvi-dha loken14 paperwave baloo1972 jeb0813 vegadome minkhant1996 lihuibng steveefemsc jojono0 timkar164 hectorta1989 aicodedev lowkeyloki101 keyzf bcl200n alvinzheng kashishnaqvi10 shreeshreee

text2video's Issues

English pretrained model not downloadable

Can you please provide the pretrained model via a different platform, like Google drive or something more accessible? I am not able to download it normally on Baidu.

AssertionError: datasets/fadg0/test_img is not a valid directory

AssertionError: datasets/fadg0/test_img is not a valid directory I can't found that folder in both this repo either the vid2vid repo

clonning and downloading errors

error extracting donwloaded files, when downloading the zip file, and when cloning:

git clone https://github.com/sibozhang/Text2Video.git
Cloning into 'Text2Video'...
remote: Enumerating objects: 24795, done.
remote: Counting objects: 100% (99/99), done.
remote: Compressing objects: 100% (98/98), done.
remote: Total 24795 (delta 57), reused 3 (delta 1), pack-reused 24696
Receiving objects: 100% (24795/24795), 213.80 MiB | 8.19 MiB/s, done.
Resolving deltas: 100% (1048/1048), done.
error: invalid path '*phoneme_data/VidTIMIT/fadg0.txt'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Dataset related issues

May I ask what dataset you are using and the download address related to the dataset

chinese error

(vid2vid) root@ecs-5dea:~/Text2Video# sh text2video_tts_chinese.sh "正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。" henan f
正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。
henan
f
input 正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。
stripped_input 正在为您查询合肥的天气情况今天是2020年2月24日合肥市今天多云最低温度9摄氏度最高温度15摄氏度微风
person henan
Traceback (most recent call last):
File "tts_request.py", line 54, in
sound = AudioSegment.from_mp3('./input_audio/{person}/{file_name}.mp3'.format(person=person, file_name=file_name))
File "/root/venv/vid2vid/lib/python3.8/site-packages/pydub/audio_segment.py", line 796, in from_mp3
return cls.from_file(file, 'mp3', parameters=parameters)
File "/root/venv/vid2vid/lib/python3.8/site-packages/pydub/audio_segment.py", line 773, in from_file
raise CouldntDecodeError(
pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1

Output from ffmpeg/avlib:

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
[mp3 @ 0x55e641b4f700] Failed to read frame size: Could not seek to 1160.
./input_audio/henan/正在为您查询合肥的天.mp3: Invalid argument

stripped_chn_punct 正在为您查询合肥的天气情况今天是2020年2月24日合肥市今天多云最低温度9摄氏度最高温度15摄氏度微风
stripped_eng_punct 正在为您查询合肥的天气情况今天是2020年2月24日合肥市今天多云最低温度9摄氏度最高温度15摄氏度微风
len_stripped 52
Please download the model from https://github.com/alphacep/vosk-api/blob/master/doc/models.md and unpack as 'model' in the current folder.
Traceback (most recent call last):
File "interp_landmarks_motion.py", line 48, in
first_didx = int(pinyin_ts[0][0])
IndexError: index 0 is out of bounds for axis 0 with size 0
------------ Options -------------
add_face_disc: False
aspect_ratio: 1.0
basic_point_only: False
batchSize: 1
checkpoints_dir: ./checkpoints
dataroot: datasets/henan
dataset_mode: pose
debug: False
densepose_only: False
display_id: 0
display_winsize: 512
feat_num: 3
fg: False
fg_labels: [26]
fineSize: 512
fp16: False
gpu_ids: [0]
how_many: 1200
input_nc: 3
isTrain: False
label_feat: False
label_nc: 0
loadSize: 512
load_features: False
load_pretrain:
local_rank: 0
max_dataset_size: inf
model: vid2vid
nThreads: 2
n_blocks: 9
n_blocks_local: 3
n_downsample_E: 3
n_downsample_G: 3
n_frames_G: 3
n_gpus_gen: 1
n_local_enhancers: 1
n_scales_spatial: 1
name: henan
ndf: 64
nef: 32
netE: simple
netG: composite
ngf: 128
no_canny_edge: False
no_dist_map: False
no_first_img: True
no_flip: False
no_flow: False
norm: batch
ntest: inf
openpose_only: True
output_nc: 3
phase: test
random_drop_prob: 0.0
random_scale_points: False
remove_face_labels: False
resize_or_crop: scaleHeight
results_dir: ./results/
serial_batches: False
start_frame: 0
tf_log: False
use_instance: False
use_real_img: False
use_single_G: False
which_epoch: latest
-------------- End ----------------
CustomDatasetDataLoader
dataset [PoseDataset] was created
Traceback (most recent call last):
File "test.py", line 23, in
data_loader = CreateDataLoader(opt)
File "/root/vid2vid/data/data_loader.py", line 6, in CreateDataLoader
data_loader.initialize(opt)
File "/root/vid2vid/data/custom_dataset_data_loader.py", line 33, in initialize
self.dataset = CreateDataset(opt)
File "/root/vid2vid/data/custom_dataset_data_loader.py", line 23, in CreateDataset
dataset.initialize(opt)
File "/root/vid2vid/data/pose_dataset.py", line 19, in initialize
self.img_paths = sorted(make_grouped_dataset(self.dir_img))
File "/root/vid2vid/data/image_folder.py", line 38, in make_grouped_dataset
assert os.path.isdir(dir), '%s is not a valid directory' % dir
AssertionError: datasets/henan/test_img is not a valid directory
Moviepy - Building video ./results/henan/shared_video_output/henan_smooth_正在为您查询合肥的天.mp4.
Moviepy - Writing video ./results/henan/shared_video_output/henan_smooth_正在为您查询合肥的天.mp4

t: 0%| | 0/85 [00:00<?, ?it/s, now=None]Traceback (most recent call last):
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/video/io/ffmpeg_writer.py", line 136, in write_frame
self.proc.stdin.write(img_array.tobytes())
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "image2video.py", line 70, in
my_clip.write_videofile('./results/{person}/shared_video_output/{person}smooth{audio}.mp4'.format(person=person, audio=file_name),
File "", line 2, in write_videofile
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/decorators.py", line 54, in requires_duration
return f(clip, *a, **k)
File "", line 2, in write_videofile
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/decorators.py", line 135, in use_clip_fps_by_default
return f(clip, *new_a, **new_kw)
File "", line 2, in write_videofile
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/decorators.py", line 22, in convert_masks_to_RGB
return f(clip, *a, **k)
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/video/VideoClip.py", line 300, in write_videofile
ffmpeg_write_video(self, filename, fps, codec,
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/video/io/ffmpeg_writer.py", line 228, in ffmpeg_write_video
writer.write_frame(frame)
File "/root/venv/vid2vid/lib/python3.8/site-packages/moviepy/video/io/ffmpeg_writer.py", line 180, in write_frame
raise IOError(error)
OSError: [Errno 32] Broken pipe

MoviePy error: FFMPEG encountered the following error while writing file ./results/henan/shared_video_output/henan_smooth_正在为您查询合肥的天.mp4:

b'[mp3 @ 0x6836840] Failed to read frame size: Could not seek to 1160.\n../Text2Video/input_audio/henan/\xe6\xad\xa3\xe5\x9c\xa8\xe4\xb8\xba\xe6\x82\xa8\xe6\x9f\xa5\xe8\xaf\xa2\xe5\x90\x88\xe8\x82\xa5\xe7\x9a\x84\xe5\xa4\xa9.mp3: Invalid argument\n'

请问这个问题，怎么解决？

Ho do I train with my own data ?

Hi, Your work seems amazing. I was looking to train it for my own dataset of only one character.
Can you please share detailed instructions for training ?

Thank You.

keypoints file not found error

Hi, Thanks for your great work!

I encounter a "file not found" error when I run the program with the following command:

sh text2video_tts_chinese.sh "正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。" xuesong m

I find the directory *pinyin_data/xuesong/keypoints_xuesong" do exist, and the file "03958_keypoints.json" and "03960_keypoints.json" do exist in the directory. However, "03959_keypoints.json" file does not exist, where can I find this file? Or is any other way to deal with it?

Model Page Not Found

Thank you for sharing your code.
https://github.com/alphacep/vosk-api/blob/master/doc/models.md for downloading the model is not found.

Will quality improve much when the dataset for getting the dict become larger?

Hi, thank you very much for your sharing!

I have two questions to consult.
Question1 :
In the paper, you use very few data to establish the phoneme-pose dict.For example, "8 min" for Mandarin. Though for common methods, I mean common NN-nets, larger trainning dataset may improve the performace. But since you use "dict" here, will the quality imporve much when the dataset getting larger? Have you do any test and get any conclusion? Or will you give any pre-judgment?

Qestions2: I am a newer in the subject of Audio-Visual problem. In the "Text-Driven Video Generation" in the section of "Related Works" in your paper, it seems few work directly use text as driven? Could you recommend any other papers or methods that do text-driven taking head synthesis?

No checkpoints included in the Dropbox links

Hi sibozhang,

Could you please add the checkpoints to the dropbox folder (again):
Dropbox: https://www.dropbox.com/sh/lk6et49v2uyfzjx/AADAFAp02_b3FQchaYxOZ0EMa?dl=0

Many thanks!

vid2vid chinese model release

hi, i find that you only release one model for fadg0 in vid2vid, what about other people? can you share them?

pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1

hi
i can't run this program
i have an gpu server with ubuntu 20 and cuda 11.6
i don't have permission for increase or decrease cuda version
this is my approach:

apt-get update && \
  apt-get install -y nano rsync htop git openssh-server python3-pip python3-venv ninja-build sox libsox-fmt-mp3 ffmpeg && \
  ln -s /usr/bin/python3 /usr/bin/python && \
  rm -rf /var/lib/apt/lists/*



git clone https://github.com/sibozhang/vid2vid.git
git clone https://github.com/sibozhang/Text2Video.git

python3 -m venv ./venv/vid2vid
source ./venv/vid2vid/bin/activate
pip3 install --upgrade pip
pip3 install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install numpy dominate requests  pillow scipy pytz dominate pydub
pip install opencv-python
pip install zhon moviepy ffmpeg

then i downloaded fadg0 model files from dropbox and put in here:
vid2vid/checkpoints
this is list of files:

vid2vid/checkpoints/web
vid2vid/checkpoints/iter.txt
vid2vid/checkpoints/latest_net_D.pth
vid2vid/checkpoints/latest_net_D_f.pth
vid2vid/checkpoints/latest_net_D_T0.pth
vid2vid/checkpoints/latest_net_D_T1.pth
vid2vid/checkpoints/latest_net_D_T2.pth
vid2vid/checkpoints/latest_net_G0.pth
vid2vid/checkpoints/loss_log.txt
vid2vid/checkpoints/opt.txt

then i changed
cxx_args = ['-std=c++11']
to
cxx_args = ['-std=c++14']
in these files:

vid2vid/models/flownet2_pytorch/networks/channelnorm_package/setup.py
vid2vid/models/flownet2_pytorch/networks/correlation_package/setup.py
vid2vid/models/flownet2_pytorch/networks/resample2d_package/setup.py

then i run this file:
vid2vid/models/flownet2_pytorch/install.sh

at the end i run this command:
sh Text2Video/text2video_tts.sh "hi how are you" fadg0 f

and get this error:

hi how are you
fadg0
f
input hi how are you
stripped_input hihowareyou
person fadg0
Traceback (most recent call last):
  File "tts_request.py", line 54, in <module>
    sound = AudioSegment.from_mp3('./input_audio/{person}/{file_name}.mp3'.format(person=person, file_name=file_name))
  File "/b/venv/vid2vid/lib/python3.8/site-packages/pydub/audio_segment.py", line 796, in from_mp3
    return cls.from_file(file, 'mp3', parameters=parameters)
  File "/b/venv/vid2vid/lib/python3.8/site-packages/pydub/audio_segment.py", line 773, in from_file
    raise CouldntDecodeError(
pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1

Output from ffmpeg/avlib:

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable
-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-lib
jack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --en
able-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --ena
ble-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
[mp3 @ 0x5572708cd700] Failed to read frame size: Could not seek to 1160.
./input_audio/fadg0/hihowareyo.mp3: Invalid argument

file_name hihowareyo
Traceback (most recent call last):
  File "align_english.py", line 212, in <module>
    tmpbase = '/tmp/' + os.environ['USER'] + '_' + str(os.getpid())
  File "/usr/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'USER'

would you please guide me how solve the problem ?

KeyError: 'result'

when I run chinese example i will get KeyError: 'result'.

about fetching model

Hi, it's really a great work of your job for converting text to audio. However, I can not download the model from the address you published(https://www.dropbox.com/sh/lk6et49v2uyfzjx/AADAFAp02_b3FQchaYxOZ0EMa?dl=0). Could you put your model on a place that we can easily get(such as baidu network disk or other network dist we can easily accessed from China)? Thanks a lot!

Chinese model not downloadable

The below link is not working...
百度云链接:https://pan.baidu.com/s/1lhYRakZLnkQ8nqMuLJt_dA 密码:40ob

已创建中文的讨论组想加入的请添加微信xaaheng

KeyError: 'result

run ：python pinyin_timestamping.py 正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。 henan

Traceback (most recent call last):
File "pinyin_timestamping.py", line 91, in
for item in res['result']:
KeyError: 'result

Could you publish the supplemental materials？

I want to read the phoneme-pose dictionary

frozen image

when driven by text (both Chinese and English), the output video freezes after a few seconds while the audio keeps going

audio and video to be out of sync

Hi
This is an interesting project.
I have a question. I saw video samples, why audio and video to be out of sync?

running tts_request.py

CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1
how could this will be solved .
i have installed ffmpeg-1.4 still it can't segment the audio file.
error shows :sound = AudioSegment.from_mp3('./input_audio/{person}/{file_name}.mp3'.format(person=person, file_name=file_name))

我们创建了一个中文讨论组，有需要的加我微信douzijun1999

1705126444.mp4