rudrabha / wav2lip Goto Github PK

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs

Home Page: https://synclabs.so

Python 99.79% Shell 0.21%

wav2lip's People

Contributors

Stargazers

Watchers

Forkers

janfschr chenlijn chenchy c00renut joodykimdev linzai1992 ggsonic vladimirgl xyq1996 neocats avatarworld tg-bomze aihill maciejmacko cuijianzhu wonwizard greedisgoood hadryan lzhbrian prpankajsingh sivakaman scpark20 dieptran43 aodamiaomiao mohitsethi maxcodextc ashiquebiniqbal mrex-tech shawn-zsy bellyfat 0x4mio vaishakik mehulpancholi rajesh16702 askmetoo jonathanablanida orlgln kousu shaunstanislauslau recorsa daryl149 vinaypn kws4679 stark-akib husainkapadia amshb001 elviric acbdef123 mahaoyang donnie sadam1195 cloudzombie seraphyx zkevinbai mark-mucchetti greatfeel zhangkai2017 woomurf deevanshuguru prashant118 melhadf wintdkyo deepchatterjeevns vikingspy bonedaddy zerrui helixngc7293 king52311 ds10 fordi lsheiba kevinelgan liyi0621 dunnousername noirmist kpjcoffey flashlightet dliu742 gar1t tuapuikia narumiruna dacer250 mannykayy chenyang918 assassindesign dawsonf hengtuibabai orrrrtem trumedicines dbklim xxxvincent xclee001 ramiro-gm mocuto devenlu usingcolor seantempesta konstantingolub kenny-ngo grigorilab

wav2lip's Issues

Cuda out of memory error when preprocessing data for training

I am trying to train a model with my own data. I have the following directory structure:

Wav2Lip
    |____training_data
               |_______*.mp4 files

I've change the line in preprocess.py from

filelist = glob(path.join(args.data_root, '*/*.mp4'))

filelist = glob(path.join(args.data_root, '*.mp4'))

for my directory structure. However, when I run the command given in the readme, I get the following error for every video:

Traceback (most recent call last):
  File "preprocess.py", line 85, in mp_handler
    process_video_file(vfile, args, gpu_id)
  File "preprocess.py", line 59, in process_video_file
    preds = fa[gpu_id].get_detections_for_batch(np.asarray(fb))
  File "/storage/Wav2Lip/face_detection/api.py", line 66, in get_detections_for_batch
    detected_faces = self.face_detector.detect_from_batch(images.copy())
  File "/storage/Wav2Lip/face_detection/detection/sfd/sfd_detector.py", line 42, in detect_from_batch
    bboxlists = batch_detect(self.face_detector, images, device=self.device)
  File "/storage/Wav2Lip/face_detection/detection/sfd/detect.py", line 68, in batch_detect
    olist = net(imgs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/storage/Wav2Lip/face_detection/detection/sfd/net_s3fd.py", line 71, in forward
    h = F.relu(self.conv1_1(x))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 343, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 340, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 15.82 GiB (GPU 0; 15.90 GiB total capacity; 847.43 MiB already allocated; 14.48 GiB free; 14.57 MiB cached)

My videos are all 1080p

I'm using Paperspace with a P5000 GPU, 8 CPUs and 30 Gb Ram. Can you specify what computing power did you use to train and how can I use the one I have available to train my own model?
Thanks

Choice of hyperparameters

Hi, first of all this is a really amazing projects!

I was wondering how you decided to set:

sw, sync penalty weight = 0.03
sg, adversarial loss = 0.07

Also how did you choose:

learning rate
any weight decay (does not seem to be any)
any other important hyper-parameter that had a large effect

GAN's are really hard to train in my experience so I'm always interested in knowing the story of successful experiments!

preprocess code issue

Wav2Lip/preprocess.py

Line 52 in 3baacda

fulldir = path.join(args.preprocessed_root, dirname, vidname)

Can we use lrs3 dataset

Hi, @prajwalkr
can we use the lrs3 dataset if yes what's the preprocessing?
and how does fps matters

No module named 'tensorflow.contrib'

I am having this problem when running on Windows. can anybody help me?

(base) C:\Users\XXXXXX>conda activate WAV

(WAV) C:\Users\XXXXXX>cd C:\Users\XXXXXX\wav

(WAV) C:\Users\XXXXXX\wav>python inference.py --checkpoint_path checkpoint/wav2lip.pth --face video/drive.mp4 --audio audio/audio.wav
2020-09-10 05:14:40.379812: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-09-10 05:14:40.383346: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "inference.py", line 3, in <module>
    import scipy, cv2, os, sys, argparse, audio
  File "C:\Users\XXXXXX\wav\audio.py", line 7, in <module>
    from hparams import hparams as hp
  File "C:\Users\XXXXXX\wav\hparams.py", line 1, in <module>
    from tensorflow.contrib.training import HParams
ModuleNotFoundError: No module named 'tensorflow.contrib'

(WAV) C:\Users\XXXXXX\wav>

Exported Video Shorter Than Imported Video

I believe this happens every time, or at least most... The outputted video is just shorter than what goes into it. BTW, when I create my input_audio.wav and my input_vid.mp4, I create them in premiere and the have the exact same time span. If the video is 4.8 seconds, the audio is as well. Output vid is shorter though. Thanks!

Training issues in Syncnet

Hi, thanks for sharing. When I tried to train the dataset (made by myself followed by LRS2 dataset) with 'wav2lip_train.py', it failed. And it told me that 'Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])' . Could you please help me to address this issue? The following image shows this error.

Preprocessing cannot be completed with preprocess.py

Sorry for bothering again.
But when I use preprocess.py to process LSR2 dataset (or my own videos), I only get a lot of white-black square images with size 1x4 (as *.jpg in the preprocessed_root).
I couldn't figure out what went wrong, and I want to know what a *.jpg should look like.
Thank you for your attention.

Question regarding sr in audio.load_wav

In line 221 in inference.py, sr has been set to 16000, and mel_idx_multiplier has been set to 80./fps . Is there any particular reason you chose these numbers??
Thank you very much

basic information about synthetic lip

This is a very good project. Thanks for your sharing. When I use my own video and voice synthesis video, the characters in the generated video feel less lips? Will there be ghosts? How to adjust this?

Issue with inference.py

Using the example command for inference with pretrained models in the README, I get a ModuleNotFoundError, saying: No module named 'numba.decorators', even though I installed all of the requirements via pip3 install -r requirements.txt in a virtual environment. Am I doing something wrong? The exact command I'm running looks like this:

python3 inference.py --checkpoint_path checkpoints --face image.png --audio audio.mp3

Where image.png and audio.mp3 are the pairs I want to sync.

I have the model (Wav2Lip + GAN) in checkpoints/ and face_detection/detection/sfd/s3fd.pth in it's proper place.

I'm on Linux Mint 19.1, using Python 3.6.9 and Pip 20.2.1

Weights of the perceptual discriminator

Can you share the weights of the pretrained perceptual discriminator for finetune?

ValueError: Face not detected! Ensure the video contains a face in all the frames.

is it possible to run inference on non faces similar to this paper
https://arxiv.org/abs/2004.12992v1

Mouth Position Delay / Lag when motion is faster

My first few tries where on some random videos and I noticed that the Mouth not always placed on the correct spot.
So I did few experiments trying to replicate the issue before I post it here.

I've downloaded a Head made in Gan2 (random) and made a simple motion with different speeds as a video, so it's supposed to be easier to track compare to a video with angles and what not since it's just a still image moving around in different speeds so you can SEE the visual issue.

ISSUE Description:

It seems like when there is a fast motion the mouth will be "behind" the frame like a delay.
I don't think it's a track detection issue because it looks like a frame ahead the "position" of the mouth is almost there.

The above are only guess, I only have some experience with other Deep Fake such as:
FaceSwap and DFL so I know that even blurry fast motion could be detected, I just don't know how it works here, probably in a different way but I hope that there is a solution to make even FAST motions trackable to sit on the right position without a lag / delay.

I hope to see this fixed if possible it will be great, please keep up the good work! :)

To keep things clear and simple, the example video is 640x360 - 24fps

Download Video Example Attached as .ZIP file:

result_video.zip

Few Screenshots from the video, shows the mouth not accurate on where it should be:

eval loss fluctuation

The sync loss as shows:

4000step: current averaged_loss is ---------- 1.1633416414260864
5000step: current averaged_loss is ---------- 1.9757428169250488
6000step: current averaged_loss is ---------- 1.9490289688110352
7000step: current averaged_loss is ---------- 2.3177950382232666
8000step: current averaged_loss is ---------- 1.6252386569976807
9000step: current averaged_loss is ---------- 3.818169593811035
10000step: current averaged_loss is ---------- 1.719498872756958
11000step: current averaged_loss is ---------- 1.8442809581756592
12000step: current averaged_loss is ---------- 2.4841384887695312
13000step: current averaged_loss is ---------- 2.462939977645874
14000step: current averaged_loss is ---------- 3.738591432571411
15000step: current averaged_loss is ---------- 2.688401222229004
16000step: current averaged_loss is ---------- 3.177443027496338
17000step: current averaged_loss is ---------- 1.7362573146820068
18000step: current averaged_loss is ---------- 3.5759496688842773
19000step: current averaged_loss is ---------- 3.8388853073120117
20000step: current averaged_loss is ---------- 4.14736270904541
For this model, the training loss has decreased which is around 0.1 - 0.2.

The wav2lip model eval loss:
1800step: L1: 0.019517337334755483, Sync loss: 5.680909522589875
2700step:L1: 0.01795881875151617, Sync loss: 5.4678046358124845
3600step:L1: 0.01703862974103992, Sync loss: 5.786964012620793
4500step:L1: 0.016784275337235307, Sync loss: 5.638755851397331
5400step:L1: 0.016678210001135944, Sync loss: 5.832544412830587
6300step:L1: 0.016361638768104446, Sync loss: 5.650567727150149
7200step:L1: 0.016196514041390213, Sync loss: 5.742747967151364
8100step: L1: 0.016216407553923878, Sync loss: 5.588838182910533
9000step: L1: 0.01602265675194806, Sync loss: 5.688869654707154
9900step: 0.016125425466531607, Sync loss: 5.708734381215889
10000step: L1: 0.016125425466531607, Sync loss: 5.708734381215889
10800step: L1: 0.01588278780883967, Sync loss: 5.918756739389199
11700step: L1: 0.01574758412622011, Sync loss: 5.581946962059989
12600step: L1: 0.015821209518815497, Sync loss: 5.620685570930449
13500step: L1: 0.015698263344598055, Sync loss: 5.617209954880784
14400step: L1: 0.015831564212969895, Sync loss: 5.579334572446499
15300step: L1: 0.015908794453667847, Sync loss: 5.662705282341907
16200step: L1: 0.01584938615055678, Sync loss: 5.67902198072507
17100step: L1: 0.015664026094666987, Sync loss: 5.836531450847076
1800step: L1: 0.01570050628954138, Sync loss: 5.806963780977246
1800step: L1: 0.015791057227724118, Sync loss: 5.494967527464351
1800step: L1: 0.015707670103827658, Sync loss: 5.7215739446087674
1800step: L1: 0.015890353739251, Sync loss: 5.7707554375734205
21600step: L1: 0.015616239360752867, Sync loss: 5.709768658187692
18000step: L1: 0.01574522866395843, Sync loss: 5.753696662893309
18900step: L1: 0.015643829487784953, Sync loss: 5.498267574079026
19800step: L1: 0.015661220601660208, Sync loss: 5.759692171500855
20700step: L1: 0.015491276214194195, Sync loss: 5.577403137075068
21600step: L1: 0.015616239360752867, Sync loss: 5.709768658187692
22500step: L1: 0.01574522866395843, Sync loss: 5.753696662893309
23400step: L1: 0.015643829487784953, Sync loss: 5.498267574079026
24300step: L1: 0.015661220601660208, Sync loss: 5.759692171500855
25200step: L1: 0.015491276214194195, Sync loss: 5.577403137075068
26100step: L1: 0.01578893579181268, Sync loss: 5.619578842939902

For this model, the training loss has decreased which is around 0.004.

Did you think it works well? Is it overfitting? @prajwalkr

You shouldn't open this to the public

This is only my thought, but it is too powerful for fake news generation and other misuse.
You should whitelist people at least or cancel your "easy demo website"

In any case your work is wonderful, and will be a huge hit !!!

cffi library has no function, constant or global variable named 'sf_wchar_open'

I can't fix an error when I run the last step. I have installed all the dependencies indicated in the requirements file.
//////////

(Wav2Lip) D:\Users\Documents\Deepfakes\Wav2lip\Wav2Lip-master>python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face video.mp4 --audio audio.wav

Using cpu for inference.
Reading video frames...
Number of frames available for inference: 856
Traceback (most recent call last):
File "inference.py", line 276, in
main()
File "inference.py", line 221, in main
wav = audio.load_wav(args.audio, 16000)
File "D:\Users\Documents\Deepfakes\Wav2lip\Wav2Lip-master\audio.py", line 10, in load_wav
return librosa.core.load(path, sr=sr)[0]
File "C:\Users.conda\envs\snek\lib\site-packages\librosa\core\audio.py", line 127, in load
with sf.SoundFile(path) as sf_desc:
File "C:\Users.conda\envs\snek\lib\site-packages\soundfile.py", line 627, in init
self._file = self._open(file, mode_int, closefd)
File "C:\Users.conda\envs\snek\lib\site-packages\soundfile.py", line 1170, in _open
openfunction = _snd.sf_wchar_open
AttributeError: cffi library 'C:\Users.conda\envs\snek\Library\bin\sndfile.dll' has no function, constant or global variable named 'sf_wchar_open'

SyncNet fails if batch_size set to 1 in hparams.py

raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

input to syncnet:
torch.Size([1, 15, 48, 96]) torch.Size([1, 1, 80, 16])

Does the syncnet batch_size have to be a specific ratio of batch_size used for the GAN?

Output audio shorter than input audio

In my tests the generated audio output was 0.05 seconds shorter than the input. Input: 01:04.16, Output 01:04.11 .
Not much but enough to disturb a seamless integration. Is this possibly due to a ffmpeg setting?

How to shut it up?

I used the colab and it worked beautifully, THANKS!
I have a problem. Whenever the audio has a silence piece the original mouth movement remains unchanged.
I've tried with "mmm" to "shutup". But it had no effect. So I used "ha ha ha" to minimize the lip movement.
Is there a way to close the mouth for silence?

Pre-trained face detection model not available

Hi! Thanks for releasing your code, however, the pretrained model used for face detection is no longer available from the link you posted. Do you maybe have another place we can download it?

Thanks!

Inference on high res video giving low res output

Hello,
Thank you for this great repository. I am trying it on a high resolution video and I noticed that the output is low resolution even if I don't use the --resize_factor flag. What's the reason for this and how can I get high resolution output?
Thanks

can not install requirements.txt on windows 10

I get this error message

Could not find a version that satisfies the requirement torch==1.1.0 (from -r requirements.txt (line 6)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
No matching distribution found for torch==1.1.0 (from -r requirements.txt (line 6))

if I try to install one of the other versions (0.1.2.post2) the script fails later on

wav2lip_gan.pth no file in the repo (

Training Question

I have done all the preprocessing and started the training but after 37 epoch Real: is 0 is this normal? or I have done something wrong

Starting Epoch: 31
L1: 0.1545860916376114, Sync: 0.0, Percep: 0.6312605142593384 
| Fake: 0.75914174L1: 0.1545860916376114, Sync: 0.0, Percep: 0.6312605142593384
| Fake: 0.75914174L1: 0.15869291126728058, Sync: 0.0, Percep: 0.6291199922561646 
| Fake: 0.7615769L1: 0.15869291126728058, Sync: 0.0, Percep: 0.6291199922561646
|Fake: 0.7615769L1: 0.15976321697235107, Sync: 0.0, Percep: 0.624918540318807 
|Fake: 0.76644182L1: 0.15976321697235107, Sync: 0.0, Percep: 0.624918540318807 
|Fake: 0.76644182L1: 0.15976321697235107, Sync: 0.0, Percep: 0.624918540318807 
| Fake: 0.766441822052002, Real: 0.3562496453523636: : 3it [00:02,  1.07it/s]
Starting Epoch: 32

after 37 epoch Real: is 0

Starting Epoch: 37
L1: 0.1491541564464569, Sync: 0.0, Percep: 0.0
 | Fake: 27.63102149963379, Real: L1: 0.1491541564464569, Sync: 0.0, Percep: 0.0
 | Fake: 27.63102149963379, Real: L1: 0.15397701412439346, Sync: 0.0, Percep: 0.0
 | Fake: 27.63102149963379, Real:L1: 0.15397701412439346, Sync: 0.0, Percep: 0.0
 | Fake: 27.63102149963379, Real:L1: 0.15176475048065186, Sync: 0.0, Percep: 0.0
 | Fake: 27.63102086385091, Real:L1: 0.15176475048065186, Sync: 0.0, Percep: 0.0
 | Fake: 27.63102086385091, Real:L1: 0.15176475048065186, Sync: 0.0, Percep: 0.0 
| Fake: 27.63102086385091, Real: 0.0: : 3it [00:03,  1.04s/it]
Starting Epoch: 38

Training not working

Hi, @prajwalkr I am trying to train hq_wav2lip_train but I have waited nearly 1 hour but nothing happened my GPU is using only 984mb and all my CPUs are been used.
here is the command that I have run.
python hq_wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir train_checkpoint --syncnet_checkpoint_path checkpoints/lipsync_expert.pth
Output

  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

use_cuda: True
total trainable params 36298035
total DISC trainable params 14113793
Load checkpoint from: checkpoints/lipsync_expert.pth
Starting Epoch: 0
0it [00:00, ?it/s]

CPU Usage

GPU Usage

high memory usage but low volatile gpu-util

Have you ever encountered this kind of situation when you train models？

@ @ @prajwalkr

Colab seems not working anymore

Hello! Got version conflict errors like this...
Is there any way to work around this?
Thank you!

Building wheels for collected packages: librosa
Building wheel for librosa (setup.py) ... done
Created wheel for librosa: filename=librosa-0.7.0-cp36-none-any.whl size=1598345 sha256=0ae5d8d2902c4310f2c50dd05a2eaec382bf31e30461ae2ce3cc203f286d16b1
Stored in directory: /root/.cache/pip/wheels/49/1d/38/c8ad12fcad67569d8e730c3275be5e581bd589558484a0f881
Successfully built librosa
ERROR: xarray 0.15.1 has requirement setuptools>=41.2, but you'll have setuptools 39.1.0 which is incompatible.
ERROR: google-auth 1.17.2 has requirement setuptools>=40.3.0, but you'll have setuptools 39.1.0 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.
ERROR: tensorflow 1.10.0 has requirement numpy<=1.14.5,>=1.13.3, but you'll have numpy 1.17.1 which is incompatible.
Installing collected packages: numpy, soundfile, librosa, opencv-contrib-python, opencv-python, setuptools, tensorboard, tensorflow, torch, torchvision, tqdm
Found existing installation: numpy 1.18.5
Uninstalling numpy-1.18.5:
Successfully uninstalled numpy-1.18.5
Found existing installation: librosa 0.6.3
Uninstalling librosa-0.6.3:
Successfully uninstalled librosa-0.6.3
Found existing installation: opencv-contrib-python 4.1.2.30

Retrain / Finetuning

Hello,

Is it possible to retrain / finetune the pretrained models to a specific voice / person combination and would this enhance the results?

Shared notebook fails in last step

Hello! After processing the video it fails, I tried several videos including one of your own videos (the dictator one).

`/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Using cuda for inference.
Reading video frames...
Number of frames available for inference: 210
(80, 1595)
Length of mel chunks: 594
0% 0/5 [00:00<?, ?it/s]
0% 0/14 [00:00<?, ?it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=11 : invalid argument
0% 0/14 [00:00<?, ?it/s]
Recovering from OOM error; New batch size: 8

0% 0/27 [00:00<?, ?it/s]
4% 1/27 [00:04<01:56, 4.50s/it]
7% 2/27 [00:05<01:23, 3.32s/it]
11% 3/27 [00:05<00:59, 2.50s/it]
15% 4/27 [00:06<00:44, 1.92s/it]
19% 5/27 [00:06<00:33, 1.51s/it]
22% 6/27 [00:07<00:25, 1.23s/it]
26% 7/27 [00:07<00:20, 1.03s/it]
30% 8/27 [00:08<00:17, 1.12it/s]
33% 9/27 [00:09<00:14, 1.22it/s]
37% 10/27 [00:09<00:12, 1.33it/s]
41% 11/27 [00:10<00:11, 1.43it/s]
44% 12/27 [00:10<00:09, 1.51it/s]
48% 13/27 [00:11<00:08, 1.56it/s]
52% 14/27 [00:12<00:08, 1.60it/s]
56% 15/27 [00:12<00:07, 1.62it/s]
59% 16/27 [00:13<00:06, 1.66it/s]
63% 17/27 [00:13<00:05, 1.67it/s]
67% 18/27 [00:14<00:05, 1.69it/s]
70% 19/27 [00:14<00:04, 1.69it/s]
74% 20/27 [00:15<00:04, 1.66it/s]
78% 21/27 [00:16<00:03, 1.65it/s]
81% 22/27 [00:16<00:03, 1.64it/s]
85% 23/27 [00:17<00:02, 1.60it/s]
89% 24/27 [00:18<00:01, 1.61it/s]
93% 25/27 [00:18<00:01, 1.62it/s]
96% 26/27 [00:19<00:00, 1.63it/s]
100% 27/27 [00:21<00:00, 1.23it/s]
Load checkpoint from: checkpoints/wav2lip_gan.pth
0% 0/5 [00:26<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 277, in
main()
File "inference.py", line 249, in main
model = load_model(args.checkpoint_path)
File "inference.py", line 166, in load_model
checkpoint = _load(path)
File "inference.py", line 157, in _load
checkpoint = torch.load(checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 581, in _load
deserialized_objects[key].set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 2577609 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
what(): owning_ptr == NullType::singleton() || owning_ptr->refcount.load() > 0 ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:350, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /pytorch/c10/util/intrusive_ptr.h:350)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fb2ddb06441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fb2ddb05d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: THStorage_free + 0xca (0x7fb2deaa029a in /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2.so)
frame #3: + 0x53a157 (0x7fb31d20f157 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: python3() [0x54f226]
frame #5: python3() [0x572cd0]
frame #6: python3() [0x4b18a8]
frame #7: python3() [0x588a98]
frame #8: python3() [0x5ad558]
frame #9: python3() [0x5ad56e]
frame #10: python3() [0x5ad56e]
frame #11: python3() [0x5ad56e]
frame #12: python3() [0x5ad56e]
frame #13: python3() [0x5ad56e]
frame #14: python3() [0x56b636]

frame #20: __libc_start_main + 0xe7 (0x7fb3673dfb97 in /lib/x86_64-linux-gnu/libc.so.6)
`

The specific meaning of i-2 in wav2lip_train.py

Hi, thank you for sharing this code.
Why is i-2 used in the following code.

def get_segmented_mels(self, spec, start_frame):
        mels = []
        assert syncnet_T == 5
        start_frame_num = self.get_frame_id(start_frame) + 1 # 0-indexing ---> 1-indexing
        if start_frame_num - 2 < 0: return None
        for i in range(start_frame_num, start_frame_num + syncnet_T):
            m = self.crop_audio_window(spec, i - 2)
            if m.shape[0] != syncnet_mel_step_size:
                return None
            mels.append(m.T)

        mels = np.asarray(mels)

        return mels

Could you release the FID code for your generated video?

The FID is used to evaluate the single image, but for the generated video, how to implement video FID? As we know, video differs from a single image in that it contains timing consistency.

Help required on line #3 in Colab Notebook

Hi Sir,

Very interesting project :)
I am getting below error on line 5 while executing colab notebook. I got below error -
cp: cannot stat '/content/gdrive/My Drive/Wav2lip/wav2lip_gan.pth': No such file or directory

I tried to search wav2lip_gan.pth file but couldn't find...please suggest
Thanks in advance.

Video looping for videos smaller than audios

Hello,
I noticed that if the audio is longer than the video, the video loops from the start point. This seems to have an effect of a break in between and does not look natural. Do you have any tips on how to over come this?

training step

When I trained the wav2lip model, I tried to restore the model you provided. I saw the step is around 250,000 steps. But now my model is still in training and the step is 40,000+ steps which did not restore the model you provided. The training loss is nearly 0.02 and the sync loss is about 0.4. Then I tried to utilize the latest model I saved to run the 'inference.py', and the result is not good. The fusion and the lip seem not good. Should I continue training?

The environment mentioned by the author cannot be installed. There are conflicts. My environment is as follows

The environment mentioned by the author cannot be installed. There are conflicts. My environment is as follows：

numba 0.48.0
numpy 1.17.1
opencv-contrib-python 4.2.0.34
opencv-python 4.1.0.25
openssl 1.0.2s
pillow 7.2.0
pip 19.1.1
protobuf 3.13.0
pycparser 2.20
python 3.6.9
tensorflow 1.12.0
torch 1.1.0
torchvision 0.3.0
tqdm 4.45.0

Make face detection a generator

face_detect() is by far the slowest part of inference.py, and it also buffers all its results. This means getting results is slow and RAM is wasted. You could make face_detect() a generator instead and the output video would start being written immediately and your RAM usage would be bounded (by the batch size).

Link to pretrained of Expert discriminator not exist

I can't access to link pretrained of expert discriminator,
Please show me the link.

Thanks,

Sync loss cannot be reduced

excuse me,I would like to ask how to solve the problem that sync loss cannot be reduced.
@prajwalkr

GAN training from scratch too erratic

Is there a random seed that works well consistently.

During training, the loss gets stuck on either Real / Fake.
This was trained from scratch on a dataset of my own. Single Identity with about ~3 hr videos

L1: 0.003016742644831538, Sync: 0.0, Percep: 0.7017908990383148 | Fake: 0.6846120357513428, Real: 0.7016448080539703: : 2it [00:03,  1.65s/it]Starting Epoch: 19022

L1: 0.003017835319042206, Sync: 0.0, Percep: 14.170929253101349 | Fake: 0.3378826081752777, Real: 14.172042548656464: : 2it [00:03,  1.66s/it]Starting Epoch: 19023

L1: 0.003155902144499123, Sync: 0.0, Percep: 27.63102149963379 | Fake: 0.0, Real: 27.63102149963379: : 2it [00:03,  1.65s/it]Starting Epoch: 19024

L1: 0.003000790602527559, Sync: 0.0, Percep: 27.63102149963379 | Fake: 0.0, Real: 27.63102149963379: : 2it [00:03,  1.66s/it]Starting Epoch: 19025

I'hv also seen this happening once the syncnet_wt gets set to 0.03 ;

How to get higher resolution?

Hi @Rudrabha ,

What a great work and thanks for sharing it.
I am wondering

How to generate higher resolution output?
By cropping face with square area (not resize to 96x96), Will it improve visual quality?

Thank you.

ImportError: No module named 'cv2'"

I tried the following command : "python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face video.mp4 --audio audio.wav" but I'm getting this error :

"Traceback (most recent call last):
File "inference.py", line 3, in
import scipy, cv2, os, sys, argparse, audio
ImportError: No module named 'cv2'"

What should I do to fix this ? I'm on python 3.5.6.

not sure about crop_audio_window in color_syncnet_train.py

About crop_audio_window in color_syncnet_train.py. I'm not unsure if the 0-based indexing to 1-based indexing is done correctly. For example, if the file name of the frame is 0.jpg (beginning of the video), current implementation would give a non-zero start_idx for spec, which I think is wrong. It seems to me that for 0.jpg, the start_idx for spec should be 0.

How to use the GPU ?

Hello everyone!

After many difficulties, I finally managed to get Wav2Lip working. However, the rendering is extremely slow and makes my computer unusable for one hour. How can I activate the GPU to accelerate the computing speed?

How to improve?

Lip movement does not match speech very well. Gan and not-gan models and two parameters have been used, but they are still inaccurate. How to improve the effect

Hello, you can tell me train.txt How is it made?

Hello, you can tell me train.txt How is it made?
@prajwalkr

video quality differ from the demo page and this github

Hello, first I am amazed by this marvelous work. I appreciate your effort for building such a great project.

I have one question though.
I generated a video using your demo page (https://bhaasha.iiit.ac.in/lipsync/), and generated a video using the same video/audio in using this github page, and I found the quality in the demo page was much superior.

Is this an issue regarding some parameters or the checkpoint?

Thank you very much

Increase image size

Hi, if I want to increase the image size, what should I do?
The default image size is 96, but I want to get a clearer face.
Do I need to redesign the network and train?

By the way, your work is excellent.

Increase the resolution

Hi, thank you for sharing this code. The result is pretty amazing.
But I found that when used a high resolution video such as 1920 * 1080 as input, the output video becomes blurred. So can we increse the resolution at the training step?

How much more training do we have to train with the new dataset?

Thank you for providing a really easy to understand and neat code.
Using the model save file you provided, we are applying it to the video file we have,
but some videos do not produce the desired quality.
So, we are doing a warm start from the save file you provided,
and training using our additional dataset and LRS2.
Can you give me any tips on how much more training we need to do?
I'll wait for your answer.