vipl-audio-visual-speech-understanding / learn-an-effective-lip-reading-model-without-pains Goto Github PK

The PyTorch Code and Model In "Learn an Effective Lip Reading Model without Pains", (https://arxiv.org/abs/2011.07557), which reaches the state-of-art performance in LRW-1000 dataset.

Python 100.00%

lipreading pytorch deep-learning

learn-an-effective-lip-reading-model-without-pains's Introduction

Learn an Effective Lip Reading Model without Pains

Introduction

This is the repository of An Efficient Software for Building Lip Reading Models Without Pains. In this repository, we provide a deep lip reading pipeline as well as pre-trained models and training settings. We evaluate our pipeline on LRW Dataset and LRW1000 Dataset. We obtain 88.4% and 56.0% on LRW and LRW-1000, respectively. The results are comparable and even surpass current state-of-the-art results. Especially, we reach the current state-of-the-art result (56.0%) on LRW-1000 Dataset.

Benchmark

Year	Method	LRW	LRW-1000
2017	Chung et al.	61.1%	25.7%
2017	Stafylakis et al.	83.5%	38.2%
2018	Stafylakis et al.	88.8%	-
2019	Yang et at.	-	38.19%
2019	Wang et al.	83.3%	36.9%
2019	Weng et al.	84.1%	-
2020	Luo et al.	83.5%	38.7%
2020	Zhao et al.	84.4%	38.7%
2020	Zhang et al.	85.0%	45.2%
2020	Martinez et al.	85.3%	41.4%
2020	Ma et al.	87.7%	43.2%
2020	ResNet18 + BiGRU (Baseline + Cosine LR)	85.0%	47.1%
2020	ResNet18 + BiGRU (Baseline with word boundary + Cosine LR)	87.5%	55.0%
2020	Our Method	86.2%	48.3%
2020	Our Method (with word boundary)	88.4%	56.0%

Dataset Preparation

Download LRW Dataset and LRW1000 Dataset and link lrw_mp4 and LRW1000_Public in the root of this repository:

ln -s PATH_TO_DATA/lrw_mp4 .
ln -s PATH_TO_DATA/LRW1000_Public .

Run scripts/prepare_lrw.py and scripts/prepare_lrw1000.py to generate training samples of LRW and LRW-1000 Dataset respectively:

python scripts/prepare_lrw.py
python scripts/prepare_lrw1000.py

The mouth videos, labels, and word boundary information will be saved in the .pkl format. We pack image sequence as jpeg format into our .pkl files and decoding via PyTurboJPEG. If you want to use your own dataset, you may need to modify the utils/dataset.py file.

Pretrain Weights

We provide pretrained weight on LRW/LRW-1000 dataset for evaluation. For smaller datasets, the pretrained weights can be provide a good start point for feature extraction, finetuning, and so on.

Link of pretrained weights: Baidu Yun (code: ivgl)

If you can not access to provided links, please email [email protected] or [email protected].

How to test

To test our provided weights, you should download weights and place them in the root of this repository.

For example, to test baseline on LRW Dataset:

python main_visual.py \
    --gpus='0'  \
    --lr=0.0 \
    --batch_size=128 \
    --num_workers=8 \
    --max_epoch=120 \
    --test=True \
    --save_prefix='checkpoints/lrw-baseline/' \
    --n_class=500 \
    --dataset='lrw' \
    --border=False \
    --mixup=False \
    --label_smooth=False \
    --se=False \
    --weights='checkpoints/lrw-cosine-lr-acc-0.85080.pt'

To test our model in LRW-1000 Dataset:

python main_visual.py \
    --gpus='0'  \
    --lr=0.0 \
    --batch_size=128 \
    --num_workers=8 \
    --max_epoch=120 \
    --test=True \
    --save_prefix='checkpoints/lrw-1000-final/' \
    --n_class=1000 \
    --dataset='lrw1000' \
    --border=True \
    --mixup=False \
    --label_smooth=False \
    --se=True \
    --weights='checkpoints/lrw1000-border-se-mixup-label-smooth-cosine-lr-wd-1e-4-acc-0.56023.pt'

How to train

For example, to train lrw baseline:

python main_visual.py \
    --gpus='0,1,2,3'  \
    --lr=3e-4 \
    --batch_size=400 \
    --num_workers=8 \
    --max_epoch=120 \
    --test=False \
    --save_prefix='checkpoints/lrw-baseline/' \
    --n_class=500 \
    --dataset='lrw' \
    --border=False \
    --mixup=False \
    --label_smooth=False \
    --se=False

Optional arguments:

gpus: the GPU id used for training
lr: learning rate. By default, we automatically applied the Linear Scale Rule in code (e.g., lr=3e-4 for 4 GPUs x 32 video/gpu and lr=1.2e-3 for 8 GPUs x 128 video/gpu). We recommend lr=3e-4 for 32 video/gpu when training from scratch. You need to modify the learning rate based on your setting.
batch_size: batch size
num_workers: the number of processes used for data loading
max_epoch: the maximum epochs in training
test: The test mode. When using this mode, the program will only test once and exit.
weights(optional): The path of pre-trained weight. If this option is specified, the model will load the pre-trained weights by the given location.
save_prefix: the save prefix of model parameters
n_class: the number of total word classes
dataset: the dataset used for training and testing, only lrw and lrw1000 are supported.
border: use word boundary indicated variable for training and testing
mixup: use mixup in training
label_smooth: use label_smooth in training
se: use se module in ResNet-18

More training details and setting can be found in our paper. We plan to include more pretrained models in the future.

Dependencies

PyTorch 1.6
opencv-python
TurboJPEG and PyTurboJPEG

Citation

If you find this code useful in your research, please consider to cite the following papers:

@inproceedings{feng2021efficient,
  title={An Efficient Software for Building LIP Reading Models Without Pains},
  author={Feng, Dalu and Yang, Shuang and Shan, Shiguang},
  booktitle={2021 IEEE International Conference on Multimedia \& Expo Workshops (ICMEW)},
  pages={1--2},
  year={2021},
  organization={IEEE}
}
@article{feng2020learn,
  author       = "Feng, Dalu and Yang, Shuang and Shan, Shiguang and Chen, Xilin",
  title        = "Learn an Effective Lip Reading Model without Pains",
  journal      = "arXiv preprint arXiv:2011.07557",
  year         = "2020",
}

License

The MIT License

learn-an-effective-lip-reading-model-without-pains's People

Contributors

Stargazers

Watchers

learn-an-effective-lip-reading-model-without-pains's Issues

LRW-1000 Dataset cannot be valid anymore?

Hi,
The LRW-1000 seems invalid.

Could you keep it open please?

将LRW-1000中的音频文件转化为预处理所需的npy格式

你好，请问将LRW-1000上的音频文件转化为预处理所需的npy格式，有什么需要注意的吗，还是仅仅读取然后保存就可以了吗？
另外，代码中的 border 是做什么的呢？
十分感谢！

Audio data format in LRW1000？

Hi, I would like to ask the data format of the audio in the LRW1000 dataset is .wav, while the data format of prepare_lrw1000 is .npy, How to convert？

RuntimeError: [enforce fail at inline_container.cc:144] . PytorchStreamReader failed reading zip archive: failed finding central directory

Traceback (most recent call last): File "main_visual.py", line 100, in <module> weight = torch.load(args.weights, map_location=torch.device('cpu')) File "/home/xjw/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/serialization.py", line 577, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/xjw/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/serialization.py", line 241, in __init__ super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: [enforce fail at inline_container.cc:144] . PytorchStreamReader failed reading zip archive: failed finding central directory

I try to test lrw1000 model which downloaded from your google drive, but when i get ready eveything, some problems have raisen in test code, i guess the model couldn't load by using torch.load, so could you give some advise for the problem, thank you!

if (args.weights is not None): print('load weights') weight = torch.load(args.weights, map_location=torch.device('cpu')) load_mi

word boundary?

Hi, when test samples of real videos(not in dataset), how to get "word boundary"?

run script/prepare_lrw1000.py ,no file generated in the folder LRW1000_Public_pkl_jpeg

After run script/prepare_lrw1000.py, generated LRW1000_Public_pkl_jpeg folder , and trn ,tst ,val folder in it.

but nothing in these folders.

demo code for video?

Dear author dalu:
Thanks for sharing this simple but effective work. I just wonder will you share the demo code for inference a single video and visualize the result on screen. it would be greate helpful for the community. Thank you again.

incorrect results: am i doing something wrong?

Hi,
I tried to test it on some videos. i crop the lip region using dlib. I get results but the words do not match. What could possibly be wrong? Do you need videos at a specific fps? Please help. Thanks for the code!

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

Need help

hi, in prepare_lrw1000.py
audio_file = 'LRW1000_Public/audio/' + audio_file + '.npy'
if(os.path.exists(audio_file)):
should I transform wav files into npy before run prepare_lrw1000?

transformer result

The result of the transformer in your arxiv paper Table2 is 44.5%，which paper you cited?

Training on LRW, accuracy not improving

Hi there,

Thanks for your great work! I have been trying to reproduce the results, however, the training loss didn't decrease and the accuracy was always 0. I followed the instructions and didn't change the code except calculating the ETA in seconds. Do you have any idea what is happening?

Here is parts of the training log.

Start Training, Data Length: 488766
epoch=0,train_iter=0,eta=36370.56232s,CE V=6.21437,lr=0.004800,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=34.82164
epoch=0,train_iter=1,eta=1969.14507s,CE V=6.45720,lr=0.004800,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=36.19475

epoch=1,train_iter=3819,eta=2056.03918s,CE V=7.47496,lr=0.004799,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.12496

epoch=2,train_iter=7638,eta=1983.88589s,CE V=7.74652,lr=0.004797,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=34.23733

epoch=3,train_iter=11457,eta=1980.25018s,CE V=7.42752,lr=0.004793,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=34.97991

epoch=4,train_iter=15276,eta=2137.25215s,CE V=7.26190,lr=0.004787,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.26590

epoch=5,train_iter=19095,eta=2146.40197s,CE V=7.49472,lr=0.004779,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=33.89755

epoch=6,train_iter=22914,eta=2150.36547s,CE V=8.05367,lr=0.004770,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.19277

epoch=7,train_iter=26733,eta=2126.86129s,CE V=7.35550,lr=0.004760,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=34.21873

epoch=8,train_iter=30552,eta=2104.79027s,CE V=7.92131,lr=0.004748,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=36.37280

epoch=9,train_iter=34371,eta=2036.25903s,CE V=7.50125,lr=0.004734,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=35.59670

epoch=10,train_iter=38190,eta=2054.36018s,CE V=7.56499,lr=0.004718,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00205,eta=37.45618

epoch=12,train_iter=45828,eta=2067.50536s,CE V=7.65708,lr=0.004683,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.46259

epoch=13,train_iter=49647,eta=2110.89804s,CE V=7.37481,lr=0.004662,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.27319

epoch=14,train_iter=53466,eta=2129.09115s,CE V=8.00831,lr=0.004641,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00205,eta=37.12486

epoch=15,train_iter=57285,eta=2052.47540s,CE V=7.74086,lr=0.004618,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.35628

epoch=16,train_iter=61104,eta=2103.29428s,CE V=7.08775,lr=0.004593,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=38.62439

epoch=17,train_iter=64923,eta=2127.42126s,CE V=7.76207,lr=0.004566,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=39.42123

epoch=18,train_iter=68742,eta=2032.93654s,CE V=7.45536,lr=0.004539,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=35.68857

epoch=19,train_iter=72561,eta=2047.84631s,CE V=7.57583,lr=0.004509,best_acc=0.000000
Start Testing, Data Length: 25000
start testing

v_acc=0.00000,eta=37.30936

epoch=20,train_iter=76380,eta=2126.75749s,CE V=7.46031,lr=0.004479,best_acc=0.000000
Terminated

What is the metric?

What is the metric of your results (e.g. WER, or accuracy)? It seems not stated in the README or paper.

Preview message

After testing for one video it is only showing the accuracy how to see the predicted word

RuntimeWarning: Mean of empty slice.

load weights
loaded params/tot params:183/183
miss matched params: []
Start Testing, Data Length: 0
start testing

Hi, I already run the scripts/prepare_lrw1000.py and it generate LRW1000_Public_pkl_jpeg folder which has pkl files in trn,val and test. But still report this error

What is the definition of accuracy?

I've read the whole paper, and it fully introduces how the model is built. But there is one thing I don't figure out. What is the definition of accuracy? Is it depend on every single word that the model predicts or the whole sentence?

CUDNN_STATUS_BAD_PARAM error

HI there,
I can only run the test function on cpu device. Have some one faced this error. even when I train the model, it works fine in train mode but when it calls the test function on validation set, it return 'CUDNN_STATUS_BAD_PARAM ' error. I searched and one solution was that to .float() to all your tensors. I did try this solution, but I'm still seeing the same error.
Do any one have any idea?

How to output the predicted word or sentence

Hello，dalu. II want to know how to output the predicted word or sentence. For example, whether it is right or wrong, it will output the predicted result 'about'. And how can I build my own test file with my own video？

where are the codes

Hi, dalu,
"A special setting on LRW-1000 is that we chose 40 frames for each word and put the target word at the center to make it similar to the data in LRW"
Where is the code in this repo?
This line tensor[:t,...] = files.copy() does not mean the center.

How much training time?

Thanks for your great work.
Can you tell me how much time you cost when training model in LRW data? And do you meet with the problem that it cost too much time in loading the data?
Thanks in advance.

questions about testing on my own data

Hi, I am very interested in the work of your team, and I have a few questions

On LRW1000, you need to choose 40 frames for a word, so do I need to adjust it according to the speech rate when testing on other videos?
The accuracy rate is very low when testing my own video, But the demo displayed on the team's homepage have good performance(such as https://vipl.ict.ac.cn/team.php?id=10) ，is the model here trained on LRW-1000? or is there a private data set?
Looking forward to your reply！

Where can I find the code that places the target word in the middle of the frames?

As introduced in the paper, I confirmed codes that the frame is fixed to 40 for each word in the scripts/prepare_lrw1000.py.

However, it was not confirmed that the target word was placed in the middle of the frames.

Do you know where I can find the codes for this?

load weights
loaded params/tot params:151/151
miss matched params: []
Start Testing, Data Length: 0
start testing
main_visual.py:155: RuntimeWarning: Mean of empty slice.
  acc = float(np.array(v_acc).reshape(-1).mean())
/home/ubuntu/anaconda3/envs/learn_lip/lib/python3.8/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
acc=nan