yuanxunlu / livespeechportraits Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 197.0 1.65 MB

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

License: MIT License

Python 100.00%

livespeechportraits's People

Contributors

Stargazers

Watchers

Forkers

awokeknowing ishine psalmreimu cedro3 georgebregman c00renut forest520 mombin liaozz-ustc corlangerak zbzbzzz cleancoindev yangfit cwqstudio robotpin saulocatharino johndpope zbdehh rutulpatel7077 marcelgoya asdlei99 dwctod azuredsky luckily-lzy visionu pegahs1993 einstein33 lit1088 frankhoang tian-kk dupotato baldrlector enoch9x9 chesterliliang tlz4320 hiekay 727369862 assassindesign susa zrl4836 kelly-1206 ninjasstudio simonking200 edclol liangsen-zju sssupertian gaojie-wang nkiiiiid zhengguorong ericustc justinjohn0306 xyhshell winterxx yuppyboy arcadianer marceloeatworld linhong00316 52damimi stevenhailin callmekofi sizzles andchir ramizf limzh00 maozhiqiang peterzhousz sbmdw kingsunzhang2022 curiszhou gengcauwong zhanghm1995 oieieio sstzal fuaiguo wanghaisheng sunshine866 xinzhang-ops kuangfa wendonggan suzhenwang86 annalinlin maddigit wmjhome sidriaz lbxcfx vovkinson xiyi-666 orion-speech zeke code-mirror hologerry liuqinglong110 reloadbrain excurl poem4love felixchan9527 tongbook sal-dti abdm357 krompro

livespeechportraits's Issues

Generate live speech for person arbitrary!!

Thanks for sharing such nice work
Can I run this code for person arbitrary? How the information in the folder is obtained for each person?

Thank you for your help!

Train method and code

Thanks to the authors for providing a very interesting paper and open your code.

From your repo, I could find any method and code of training.
If there is, please let me know.
If there is not, could you provide the training method and code?

How to generate APC_feature_base.npy of each person?

dear fellow
I found manifold projection use APC_feature_base.npy each person, but not clear how to generate this file.
Is that use target person voice to train audio2feature_model?

大佬您好，又来打扰您了，非常抱歉。
想请问一下，后续是否会考虑分享训练相关的代码？以及tensorrt加速的教程？
还有一些疑问，
1、这里的 fps 为啥设置为 60，一般视频都是25，不知道影响会不会很大？（或者说fps25的视频应该做如何调整）
2、采用 73 pre-defined facial landmarks 作为中间件方式，是否可以通过编辑这些 landmarks 来控制输出结果（一些模型泛化效果不好，如果编辑后的landmarks不在训练的数据集中，输出的结果并不理想）

is it possible to apply custom faces?

Hello, thank you for the amazing work!

Is there any possible way to have custom faces?
From what I understand, there's only a few faces that are available right now. Are we able to have custom inputs? Or generate the right inputs ourselves?

Thanks again

About trian model on own data

谢谢大佬的开源代码.
关于采用自己的数据进行训练，有一些问题想询问一下。
目前我的理解是模型分为audio2feature.audio2headpose和feature2face，训练自己的模型的话需要重新训练audio2headpose和feature2face。
关于audio2headpose模型的数据集需要每一帧的2d_landmark，3d_landmark，trans, headpose，从论文sec4.1可知，
1.2d_landmark,来自开源工具检测的73点landmark
2,3d_landmark，tran，headpose来自重建3dface
问题1，dataset分为audiovisual_dataset，face_dataset，其中audiovisual_dataset用于audio2feature和audio2headpose的训练，face_dataset用于feature2face的训练？
问题2，3d_landmark存在于3d_fit_data.npz和tracked3D_normalized_pts_fix_contour.npy其有什么区别？3d normalized和fix contour是怎样做的。
问题3，3d_landmark中存在负数，是以图片中点为原点？这样的话change_paras.npz中的scale，xc,yc分布代表什么含义呢？
问题4，tracked2D_normalized_pts_fix_contour.npy 中的数据是直接由开源工具检测的的的吗？其值是大于1的，好像并没有做归一化，其值与3d_landmark的关系是一个存在于像素坐标系（2d），相机坐标系（3d）吗？
还是希望大佬可以出一个关于制作数据集的详细文档。

build failure

This is really an interesting paper.. thanks for the implementation
I tried to follow the documentation to see the demo however i am hitting at the following issue, I am using OpenCV 4.5.3 which is latest, kindly please let me know if i am using the right version.. or what might be an resolution for this issue. thank you..
Image2Image translation & Saving results...
Image2Image translation inference: 0%| | 0/672 [00:00<?, ?it/s]
Traceback (most recent call last):
File "./demo.py", line 256, in
facedataset.dataset.image_pad)
File "/home/ranganaths/Documents/ai-proj/LiveSpeechPortraits/datasets/face_dataset.py", line 280, in get_data_test_mode
feature_map = torch.from_numpy(self.get_feature_image(landmarks, (self.opt.loadSize, self.opt.loadSize), shoulder, pad)[np.newaxis, :].astype(np.float32)/255.)
File "/home/ranganaths/Documents/ai-proj/LiveSpeechPortraits/datasets/face_dataset.py", line 287, in get_feature_image
im_edges = self.draw_face_feature_maps(landmarks, size)
File "/home/ranganaths/Documents/ai-proj/LiveSpeechPortraits/datasets/face_dataset.py", line 317, in draw_face_feature_maps
im_edges = cv2.line(im_edges, tuple(keypoints[edge[i]]), tuple(keypoints[edge[i+1]]), 255, 2)
cv2.error: OpenCV(4.5.3) 👎 error: (-5:Bad argument) in function 'line'

Overload resolution failed:

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

about face tracking

first of all thank you so much for your marvelous work.
second, regarding face tracking why do use it why don't you just extract the landmarks from every frame by the landmark detector
thanks in advance

Can you provide the links for the video sequences used in your paper

Hi, I saw you put 8 video sequences in the APPENDIX . But I cannot find what are the web links to the videos to get the exact dataset as you used. Can you provide the eight links? Thanks.

APC model

你好，请问您提供的apc模型的权重是在中文语料上预训练的嘛？

More details about the clip separation for training

Hi,
Recently I'm trying to repeat the training part of audio2feature, could u please share more details about the dataset construction, e.g., the duration of each clip, and the overlap of two clips?

Thanks and looking for your reply😊.

Yunfei

Error while computing GMM Log Loss when predict_length = 5

When training the audio2headpose model using the default options provided in the code (predict_length = 5 ), the code fail when computing the GMM Log Loss.
The error is the following:
RuntimeError: The expanded size of the tensor (12) must match the existing size (60) at non-singleton dimension 3. Target sizes: [32, 100, 1, 12]. Tensor sizes: [32, 100, 1, 60]

The GMMLogLoss takes as input two tensors:

output with a size of [32, 100, 25] -> [batch_size, time_frame_length, (2 * A2H_GMM_ndim + 1) * A2H_GMM_ncenter]
target with a size of [32, 100, 60] -> [batch_size, time_frame_length, predict_length * 6]

Can you please explain why the output models outputs 25 values? I understand that you want to output 12 values (pose and velocity) and for each you want to output mu and sigma (24 predictions), but why do you predict an extra feature?

How do you fix this issue with predict_length=5? Using predict_length = 1 obviously fixes the issue.

About training the custom data!

Thanks for your great job! Can you please give the training stage in detail, as it will be huge support for others in their researches? My researches also is related to talking face generation and I would love to do cooperation research in this field!

Error when run the demo

cv2.error: OpenCV(4.5.4) 👎 error: (-5:Bad argument) in function 'line'

Overload resolution failed:

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Can't parse 'pt1'. Sequence item with index 0 has a wrong type
I got this error when run the demo. Any solutions, please?

FileNotFoundError: [Errno 2] No such file or directory: './data/APC_epoch_160.model'

---------- Loading Model: APC-------------
Traceback (most recent call last):
File "demo.py", line 146, in
APC_model.load_state_dict(torch.load(config['model_params']['APC']['ckp_path']), strict=False)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/APC_epoch_160.model'

Are there any instructions on how to train new models?

Thanks for your amazing job~Are there any instructions on how to train new models?

Two questions about `smooth_loss` in `audio2headpose_model`

Hi,
I'm trying to repeat the training part of audio2headpose these days. I have two questions about the implementation.

Is mu_gen=Sample_GMM ... (Line-103) in audio2headpose_model benefit to the performance? Besides, I have found 'We also tried with a Gaussian Mixture Model but found no obvious improvement' in the paper, but I am a little confused. Are those the same thing? It seems the implementation of Eq(8) is the Sample_GMM function (please correct me if I am wrong).
The computational efficiency of Sample_GMM is rather low. When using it (set smooth_loss > 0), it needs ~2h for one epoch. I find that there are too many for-loops (line-99) and CPU operation. Are there other alternatives?

Can you share the links of these original videos?

Thanks

about Image2Image translation inference issue

Image2Image translation & Saving results...
Image2Image translation inference: 0%| | 0/672 [00:00<?, ?it/s]
Traceback (most recent call last):
File "demo.py", line 264, in
facedataset.dataset.image_pad)
File "C:\Users\23046\LiveSpeechPortraits\datasets\face_dataset.py", line 280, in get_data_test_mode
feature_map = torch.from_numpy(self.get_feature_image(landmarks, (self.opt.loadSize, self.opt.loadSize), shoulder, pad)[np.newaxis, :].astype(np.float32)/255.)
File "C:\Users\23046\LiveSpeechPortraits\datasets\face_dataset.py", line 287, in get_feature_image
im_edges = self.draw_face_feature_maps(landmarks, size)
File "C:\Users\23046\LiveSpeechPortraits\datasets\face_dataset.py", line 317, in draw_face_feature_maps
im_edges = cv2.line(im_edges, tuple(keypoints[edge[i]]), tuple(keypoints[edge[i+1]]), 255, 2)
cv2.error: OpenCV(4.5.4) 👎 error: (-5:Bad argument) in function 'line'

Overload resolution failed:

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Audio to Mouth-related Motion

大佬您好，
看到论文中提到的“ Audio to Mouth-related Motion” ，这里的音频特征是使用基于中文数据集训练的（Mandarin Chinese part of the Common Voice dataset）APC模型，然后 audio2feature部分是要针对每个目标人物重新训练一遍是吗？
但是这里每个目标人物的数据可能只有几分钟（3-5分钟），这样子会不会对泛化效果影响很大，例如训练的时候使用的是一个女性角色，测试的时候使用男性去测试？
因为我目前尝试的一个方案，是基于ATnet在大量数据集上训练从音频到人脸关键点的映射，然后利用ATnet提取的音频特征，后面的操作与大佬论文中相似，也是用3-5分钟的视频微调音频特征到人脸表情相关参数的映射，但是目前测试的效果，感觉泛化能力并不理想。因此想请教大佬的看法

How can I use my own image to get the corresponding speech video? I would be grateful if you could elaborate or point out that it was mentioned in that section of the article.

Dear fellow
How can I use my own image to get the corresponding speech video? I would be grateful if you could elaborate or point out that it was mentioned in that section of the article.

Is it possible to replace landmark2face model with pix2pixHD?

As they are both image-to-image translation, what is the major difference between this model and pix2pixHD？
Thanks!

beta1 in train

default bata1 in feature2face training is 0.5, while the article says you train your model with beta1 = 0.9.
Is beta1 value critical in feature2face training?

Thanks!

Error

Thanks for sharing such nice work. I tried replicating the work however I got the following error. please guide me.
many thanks

**File "demo.py", line 264, in
facedataset.dataset.image_pad)
File "/content/LiveSpeechPortraits-main/datasets/face_dataset.py", line 280, in get_data_test_mode
feature_map = torch.from_numpy(self.get_feature_image(landmarks, (self.opt.loadSize, self.opt.loadSize), shoulder, pad)[np.newaxis, :].astype(np.float32)/255.)
File "/content/LiveSpeechPortraits-main/datasets/face_dataset.py", line 287, in get_feature_image
im_edges = self.draw_face_feature_maps(landmarks, size)
File "/content/LiveSpeechPortraits-main/datasets/face_dataset.py", line 317, in draw_face_feature_maps
im_edges = cv2.line(im_edges, tuple(keypoints[edge[i]]), tuple(keypoints[edge[i+1]]), 255, 2)
cv2.error: OpenCV(4.5.4-dev) 👎 error: (-5:Bad argument) in function 'line'

Overload resolution failed:

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Can't parse 'pt1'. Sequence item with index 0 has a wrong type**

Licensing Training Framework

Hi,

i didnt see the training part in this repo.

Is there a way to license the training code or framework?

Thx.

doc to train own model

Hi,
Thanks for this paper implementation, can you please add some documentation to train ownmodel, that gonna help ..

thank you

About how to choose candidate img set

Thank you for your outstanding work!
I don't know how do you understand this sentence in the article:
"Forthe rest, we sample x- and y-axis rotation by uniform intervals and choose the nearest samples from intervals. " in 3.4 'Candidate Image set'.
I do not understand how the last two images were chosen.Can you explain or provide reference code, please!
Thank U!

对于特定训练数据，head motion和candidate image是否有必要？

感谢大佬的优秀工作！
想请教下，如果训练数据是一个特别录制的视频，人物除了面部外，其他部位包括脖子、肩膀等基本不动，且背景是类似绿幕这种，upper body motion合成和candidate image set这两部分是否还有必要？谢谢!

About the 73 landmark detector

Thanks for the wonderful project. I want to ask if there is a public 73 landmark detector that you used. I look through the paper and issues but have not found the name of the detector so I am wondering is it a internal tool? Thank you very much.

Is it possible to use 2D landmark if not include head motion?

Dear Dr. Lu:
In the paper, sparse 3D landmark are used as an intermediate representation, and later project these 3D positions to the 2D image plane via pre-computed camera intrinsic parameters 𝐾 to get Conditional Feature Maps.
I wonder whether it is ok to use 2D landmark directly if I don't need head motion as mentioned in #11
Thanks again for your answer!

questions on feature matching loss. bugs?

Hello, thanks for the great job!

In your code, feature2face_model.py , feature matching loss is computed as part of the D loss, which is then used to optimize discriminator but usually, the feature matching loss is used to optimize the generator and your paper said so.

I wonder whether there is anything wrong with the code..

Thanks!

cpu inference

please add option for cpu inference in demo.py

Is landmarks + headpose = actually landmarks from detector?

I found you guys disentangled landmark and headpose, but general landmark detector detect actually landmarks and headpose, you guys get neutral landmark from this final actually landmarks and headpose by inverse final actually landmarks using headpose?

about Data preprocess

Impressive job!

I wonder how to preprocess the images. Specifically, could you please share the scripts on choosing the four candidate images from the sequences and how to draw the shoulder edges since the landmark detectors I have found are all face landmark detectors.

Thanks !

i want to try to train a model of personal speaker

Hi , i want to try to train a model of personal speaker. Can I train only the Audio2Feature model and Audio2Headpose model based on your APC model weights？（In other words, is the APC model generalizable?） Can you give me some advice, thanks a lot

ModuleNotFoundError: No module named 'datasets'

大佬您好，这边是不是缺少了一个文件？
ModuleNotFoundError: No module named 'datasets'

Audio to Mouth-related Motion

Hi, great work! As I read from your paper, one need to get the ground truth mouth displacements to train Audio2Feature model. You mentioned [Shi, 2014; Thies, 2016] to track 3D mouth shape and head pose. Are there any open source tools to help me get these data?

Questions about training audio2feature model

Hello, I am trying to reconstruct the training code and there are several questions I have:

From what I saw from audio2feature_model.py, in the forward module, the size for self.audio_feats is [b, 1, nfeats, nwins] while in audio2feature.py, the dimension for audio_features is [b, T, ndim]. From my understanding(correct me if I was wrong), for batch_size=32, T=240*2, ndim=512(the APC feature dimension), the input batch for Audio2Feature model should be [32, 480, 512] (480 because mel_frame is n_frames * 2) and output size is [32, 240, 75]. Is that right?
Furthermore, from your paper in section 3.2，a delay d=18 is added during training but not reflected in the code. How that works in training? For example, m0 is inferred by h0, h1,....h18?
In audiovisual_dataset.py, you seemed like clipping the audio into many pieces and extract APC feature for each audio. What is the number of clips for a certain dataset, eg. 4 mins 60fps video?

There might be some stupid questions as I am not very familiar with audio processing field, just correct me if I made mistakes, thanks!

Is my understanding correct?

Hello, my understanding of the division of the paper is that the third part is the practical application stage of adding an audio-driven portrait speech to the trained character image model, and the fourth part is to give a wild video and then train the corresponding model. May I ask if my understanding is correct? Thank you very much!

google colab

please add a google colab for inference

A weird error, And How to run with live speech?

I am wondering how to run it with realtime audio from a microphone like done in the demo video?
And is it possible to run with Pytorch CPU so without cuda?
Thanks for sharing!

AttributeError: module 'librosa' has no attribute 'output'

https://stackoverflow.com/questions/63997969/attributeerror-module-librosa-has-no-attribute-output

would suggest using soundfile instead in demo.py

sf.write(tmp_audio_path, tmp_audio_clip, sr)

Streaming audio data inference

Hey there! Many thanks for your work - it looks awesome!
I tried to implement LIVE Audio2Headpose only pipeline based on your work and on inferencing chunks I see big gaps in result values between end of previous chunks and start of next one(APC_model is using historic data). So the curve is not smooth. What is the best way to solve this issue? Can inferencing with sliding window help?

thank you

可以告知视频预处理的方法吗，我想把工程扩展到另一个人，应该怎样操作呢

about Audio2Mouth model using own datasets

now i want to train Audio2Lip. i just use the APC_encoder model to extract the audio features.Then I make a set of audio vectors corresponding to a frame of lip images by slicing the audio feature vectors in order. However, I found that there are no consecutive sets of audio vectors to predict a frame of lips image. Will this not be able to play the role of LSTM? The model I use is Audio2Mouth, but I changed 253 to 363 according to the actual situation. But the prediction effect is not ideal. And with muted audio, the lips
doesn't close.Can you tell me how long it takes to train a character, and how long can it be effective? thank you very much.

I used Chinese audio to predict the corresponding lips video
https://user-images.githubusercontent.com/41277638/156123996-e0e92826-de13-46b4-ba86-c729f46d8e42.mp4

As for 3D face tracking, whether it is obtained from the original image（original video is 1920*1080）and whether it is affected by camera calibration, If I using deca to calculate the 3D key points for each frame，what is the different with two methods?
What method is used to obtain GroundTruth of the HeadPose? （ I use OpenFace2.2.0 method to get the HeadPose GroundTruth ）

Thank you again for publishing the code. Thank you very much！