yuanxunlu / livespeechportraits Goto Github PK
View Code? Open in Web Editor NEWLive Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
License: MIT License
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
License: MIT License
Thanks for sharing such nice work
Can I run this code for person arbitrary? How the information in the folder is obtained for each person?
Thank you for your help!
Thanks to the authors for providing a very interesting paper and open your code.
From your repo, I could find any method and code of training.
If there is, please let me know.
If there is not, could you provide the training method and code?
dear fellow
I found manifold projection use APC_feature_base.npy each person, but not clear how to generate this file.
Is that use target person voice to train audio2feature_model?
大佬您好,又来打扰您了,非常抱歉。
想请问一下,后续是否会考虑分享训练相关的代码?以及tensorrt加速的教程?
还有一些疑问,
1、这里的 fps 为啥设置为 60,一般视频都是25,不知道影响会不会很大?(或者说fps25的视频应该做如何调整)
2、采用 73 pre-defined facial landmarks 作为中间件方式,是否可以通过编辑这些 landmarks 来控制输出结果(一些模型泛化效果不好,如果编辑后的landmarks不在训练的数据集中,输出的结果并不理想)
Hello, thank you for the amazing work!
Is there any possible way to have custom faces?
From what I understand, there's only a few faces that are available right now. Are we able to have custom inputs? Or generate the right inputs ourselves?
Thanks again
谢谢大佬的开源代码.
关于采用自己的数据进行训练,有一些问题想询问一下。
目前我的理解是模型分为audio2feature.audio2headpose和feature2face,训练自己的模型的话需要重新训练audio2headpose和feature2face。
关于audio2headpose模型的数据集需要每一帧的2d_landmark,3d_landmark,trans, headpose,从论文sec4.1可知,
1.2d_landmark,来自开源工具检测的73点landmark
2,3d_landmark,tran,headpose来自重建3dface
问题1,dataset分为audiovisual_dataset,face_dataset,其中audiovisual_dataset用于audio2feature和audio2headpose的训练,face_dataset用于feature2face的训练?
问题2,3d_landmark存在于3d_fit_data.npz和tracked3D_normalized_pts_fix_contour.npy其有什么区别?3d normalized和fix contour是怎样做的。
问题3,3d_landmark中存在负数,是以图片中点为原点?这样的话change_paras.npz中的scale,xc,yc分布代表什么含义呢?
问题4,tracked2D_normalized_pts_fix_contour.npy 中的数据是直接由开源工具检测的的的吗?其值是大于1的,好像并没有做归一化,其值与3d_landmark的关系是一个存在于像素坐标系(2d),相机坐标系(3d)吗?
还是希望大佬可以出一个关于制作数据集的详细文档。
This is really an interesting paper.. thanks for the implementation
I tried to follow the documentation to see the demo however i am hitting at the following issue, I am using OpenCV 4.5.3 which is latest, kindly please let me know if i am using the right version.. or what might be an resolution for this issue. thank you..
Image2Image translation & Saving results...
Image2Image translation inference: 0%| | 0/672 [00:00<?, ?it/s]
Traceback (most recent call last):
File "./demo.py", line 256, in
facedataset.dataset.image_pad)
File "/home/ranganaths/Documents/ai-proj/LiveSpeechPortraits/datasets/face_dataset.py", line 280, in get_data_test_mode
feature_map = torch.from_numpy(self.get_feature_image(landmarks, (self.opt.loadSize, self.opt.loadSize), shoulder, pad)[np.newaxis, :].astype(np.float32)/255.)
File "/home/ranganaths/Documents/ai-proj/LiveSpeechPortraits/datasets/face_dataset.py", line 287, in get_feature_image
im_edges = self.draw_face_feature_maps(landmarks, size)
File "/home/ranganaths/Documents/ai-proj/LiveSpeechPortraits/datasets/face_dataset.py", line 317, in draw_face_feature_maps
im_edges = cv2.line(im_edges, tuple(keypoints[edge[i]]), tuple(keypoints[edge[i+1]]), 255, 2)
cv2.error: OpenCV(4.5.3) 👎 error: (-5:Bad argument) in function 'line'
Overload resolution failed:
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
first of all thank you so much for your marvelous work.
second, regarding face tracking why do use it why don't you just extract the landmarks from every frame by the landmark detector
thanks in advance
Hi, I saw you put 8 video sequences in the APPENDIX . But I cannot find what are the web links to the videos to get the exact dataset as you used. Can you provide the eight links? Thanks.
你好,请问您提供的apc模型的权重是在中文语料上预训练的嘛?
Hi,
Recently I'm trying to repeat the training part of audio2feature
, could u please share more details about the dataset construction, e.g., the duration of each clip, and the overlap of two clips?
Thanks and looking for your reply😊.
Yunfei
When training the audio2headpose model using the default options provided in the code (predict_length = 5 ), the code fail when computing the GMM Log Loss.
The error is the following:
RuntimeError: The expanded size of the tensor (12) must match the existing size (60) at non-singleton dimension 3. Target sizes: [32, 100, 1, 12]. Tensor sizes: [32, 100, 1, 60]
The GMMLogLoss takes as input two tensors:
output with a size of [32, 100, 25] -> [batch_size, time_frame_length, (2 * A2H_GMM_ndim + 1) * A2H_GMM_ncenter]
target with a size of [32, 100, 60] -> [batch_size, time_frame_length, predict_length * 6]
Can you please explain why the output models outputs 25 values? I understand that you want to output 12 values (pose and velocity) and for each you want to output mu and sigma (24 predictions), but why do you predict an extra feature?
How do you fix this issue with predict_length=5? Using predict_length = 1 obviously fixes the issue.
Thanks for your great job! Can you please give the training stage in detail, as it will be huge support for others in their researches? My researches also is related to talking face generation and I would love to do cooperation research in this field!
cv2.error: OpenCV(4.5.4) 👎 error: (-5:Bad argument) in function 'line'
Overload resolution failed:
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
I got this error when run the demo. Any solutions, please?
---------- Loading Model: APC-------------
Traceback (most recent call last):
File "demo.py", line 146, in
APC_model.load_state_dict(torch.load(config['model_params']['APC']['ckp_path']), strict=False)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/APC_epoch_160.model'
Thanks for your amazing job~Are there any instructions on how to train new models?
Hi,
I'm trying to repeat the training part of audio2headpose
these days. I have two questions about the implementation.
mu_gen=Sample_GMM ...
(Line-103) in audio2headpose_model
benefit to the performance? Besides, I have found 'We also tried with a Gaussian Mixture Model but found no obvious improvement' in the paper, but I am a little confused. Are those the same thing? It seems the implementation of Eq(8) is the Sample_GMM
function (please correct me if I am wrong).Sample_GMM
is rather low. When using it (set smooth_loss > 0
), it needs ~2h for one epoch. I find that there are too many for-loops (line-99) and CPU operation. Are there other alternatives?Thanks
Overload resolution failed:
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
大佬您好,
看到论文中提到的“ Audio to Mouth-related Motion” ,这里的音频特征是使用基于中文数据集训练的(Mandarin Chinese part of the Common Voice dataset)APC模型,然后 audio2feature部分是要针对每个目标人物重新训练一遍是吗?
但是这里每个目标人物的数据可能只有几分钟(3-5分钟),这样子会不会对泛化效果影响很大,例如训练的时候使用的是一个女性角色,测试的时候使用男性去测试?
因为我目前尝试的一个方案,是基于ATnet在大量数据集上训练从音频到人脸关键点的映射,然后利用ATnet提取的音频特征,后面的操作与大佬论文中相似,也是用3-5分钟的视频微调 音频特征到人脸表情相关参数的映射,但是目前测试的效果,感觉泛化能力并不理想。因此想请教大佬的看法
Dear fellow
How can I use my own image to get the corresponding speech video? I would be grateful if you could elaborate or point out that it was mentioned in that section of the article.
As they are both image-to-image translation, what is the major difference between this model and pix2pixHD?
Thanks!
default bata1 in feature2face training is 0.5, while the article says you train your model with beta1 = 0.9.
Is beta1 value critical in feature2face training?
Thanks!
Thanks for sharing such nice work. I tried replicating the work however I got the following error. please guide me.
many thanks
**File "demo.py", line 264, in
facedataset.dataset.image_pad)
File "/content/LiveSpeechPortraits-main/datasets/face_dataset.py", line 280, in get_data_test_mode
feature_map = torch.from_numpy(self.get_feature_image(landmarks, (self.opt.loadSize, self.opt.loadSize), shoulder, pad)[np.newaxis, :].astype(np.float32)/255.)
File "/content/LiveSpeechPortraits-main/datasets/face_dataset.py", line 287, in get_feature_image
im_edges = self.draw_face_feature_maps(landmarks, size)
File "/content/LiveSpeechPortraits-main/datasets/face_dataset.py", line 317, in draw_face_feature_maps
im_edges = cv2.line(im_edges, tuple(keypoints[edge[i]]), tuple(keypoints[edge[i+1]]), 255, 2)
cv2.error: OpenCV(4.5.4-dev) 👎 error: (-5:Bad argument) in function 'line'
Overload resolution failed:
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type
- Can't parse 'pt1'. Sequence item with index 0 has a wrong type**
Hi,
i didnt see the training part in this repo.
Is there a way to license the training code or framework?
Thx.
Hi,
Thanks for this paper implementation, can you please add some documentation to train ownmodel, that gonna help ..
thank you
Thank you for your outstanding work!
I don't know how do you understand this sentence in the article:
"Forthe rest, we sample x- and y-axis rotation by uniform intervals and choose the nearest samples from intervals. " in 3.4 'Candidate Image set'.
I do not understand how the last two images were chosen.Can you explain or provide reference code, please!
Thank U!
感谢大佬的优秀工作!
想请教下,如果训练数据是一个特别录制的视频,人物除了面部外,其他部位包括脖子、肩膀等基本不动,且背景是类似绿幕这种,upper body motion合成和candidate image set这两部分是否还有必要?谢谢!
Thanks for the wonderful project. I want to ask if there is a public 73 landmark detector that you used. I look through the paper and issues but have not found the name of the detector so I am wondering is it a internal tool? Thank you very much.
Dear Dr. Lu:
In the paper, sparse 3D landmark are used as an intermediate representation, and later project these 3D positions to the 2D image plane via pre-computed camera intrinsic parameters 𝐾 to get Conditional Feature Maps.
I wonder whether it is ok to use 2D landmark directly if I don't need head motion as mentioned in #11
Thanks again for your answer!
Hello, thanks for the great job!
In your code, feature2face_model.py , feature matching loss is computed as part of the D loss, which is then used to optimize discriminator but usually, the feature matching loss is used to optimize the generator and your paper said so.
I wonder whether there is anything wrong with the code..
Thanks!
please add option for cpu inference in demo.py
I found you guys disentangled landmark and headpose, but general landmark detector detect actually landmarks and headpose, you guys get neutral landmark from this final actually landmarks and headpose by inverse final actually landmarks using headpose?
Impressive job!
I wonder how to preprocess the images. Specifically, could you please share the scripts on choosing the four candidate images from the sequences and how to draw the shoulder edges since the landmark detectors I have found are all face landmark detectors.
Thanks !
Hi , i want to try to train a model of personal speaker. Can I train only the Audio2Feature model and Audio2Headpose model based on your APC model weights?(In other words, is the APC model generalizable?) Can you give me some advice, thanks a lot
大佬您好,这边是不是缺少了一个文件?
ModuleNotFoundError: No module named 'datasets'
Hi, great work! As I read from your paper, one need to get the ground truth mouth displacements to train Audio2Feature model. You mentioned [Shi, 2014; Thies, 2016] to track 3D mouth shape and head pose. Are there any open source tools to help me get these data?
Hello, I am trying to reconstruct the training code and there are several questions I have:
From what I saw from audio2feature_model.py
, in the forward module, the size for self.audio_feats
is [b, 1, nfeats, nwins]
while in audio2feature.py
, the dimension for audio_features is [b, T, ndim]
. From my understanding(correct me if I was wrong), for batch_size=32, T=240*2, ndim=512(the APC feature dimension)
, the input batch for Audio2Feature model should be [32, 480, 512]
(480 because mel_frame is n_frames * 2) and output size is [32, 240, 75]
. Is that right?
Furthermore, from your paper in section 3.2,a delay d=18
is added during training but not reflected in the code. How that works in training? For example, m0 is inferred by h0, h1,....h18
?
In audiovisual_dataset.py
, you seemed like clipping the audio into many pieces and extract APC feature for each audio. What is the number of clips for a certain dataset, eg. 4 mins 60fps video?
There might be some stupid questions as I am not very familiar with audio processing field, just correct me if I made mistakes, thanks!
Hello, my understanding of the division of the paper is that the third part is the practical application stage of adding an audio-driven portrait speech to the trained character image model, and the fourth part is to give a wild video and then train the corresponding model. May I ask if my understanding is correct? Thank you very much!
please add a google colab for inference
I am wondering how to run it with realtime audio from a microphone like done in the demo video?
And is it possible to run with Pytorch CPU so without cuda?
Thanks for sharing!
https://stackoverflow.com/questions/63997969/attributeerror-module-librosa-has-no-attribute-output
would suggest using soundfile instead in demo.py
sf.write(tmp_audio_path, tmp_audio_clip, sr)
Hey there! Many thanks for your work - it looks awesome!
I tried to implement LIVE Audio2Headpose only pipeline based on your work and on inferencing chunks I see big gaps in result values between end of previous chunks and start of next one(APC_model is using historic data). So the curve is not smooth. What is the best way to solve this issue? Can inferencing with sliding window help?
可以告知视频预处理的方法吗,我想把工程扩展到另一个人,应该怎样操作呢
now i want to train Audio2Lip. i just use the APC_encoder model to extract the audio features.Then I make a set of audio vectors corresponding to a frame of lip images by slicing the audio feature vectors in order. However, I found that there are no consecutive sets of audio vectors to predict a frame of lips image. Will this not be able to play the role of LSTM? The model I use is Audio2Mouth, but I changed 253 to 363 according to the actual situation. But the prediction effect is not ideal. And with muted audio, the lips
doesn't close.Can you tell me how long it takes to train a character, and how long can it be effective? thank you very much.
I used Chinese audio to predict the corresponding lips video
https://user-images.githubusercontent.com/41277638/156123996-e0e92826-de13-46b4-ba86-c729f46d8e42.mp4
大佬您好,感谢开源您的优秀工作。项目中的谷歌网盘链接,有误。
As for 3D face tracking, whether it is obtained from the original image(original video is 1920*1080)and whether it is affected by camera calibration, If I using deca to calculate the 3D key points for each frame,what is the different with two methods?
What method is used to obtain GroundTruth of the HeadPose? ( I use OpenFace2.2.0 method to get the HeadPose GroundTruth )
Thank you again for publishing the code. Thank you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.