Code Monkey home page Code Monkey logo

gesturegeneration / speech_driven_gesture_generation_with_autoencoder Goto Github PK

View Code? Open in Web Editor NEW
105.0 8.0 27.0 2.02 MB

This is the official implementation for IVA '19 paper "Analyzing Input and Output Representations for Speech-Driven Gesture Generation".

Home Page: https://svito-zar.github.io/audio2gestures/

License: Apache License 2.0

Shell 4.47% Python 95.53%
gesture generation autoencoder gesture-generation neural-network tensorflow keras deep-learning hri gesture-controller

speech_driven_gesture_generation_with_autoencoder's Introduction

Aud2Repr2Pose: Analyzing input and output representations for speech-driven gesture generation

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellström

ImageOfIdea

This repository contains Keras and Tensorflow based implementation of the speech-driven gesture generation by a neural network which was published at International Conference on Intelligent Virtual Agents (IVA'19) and the extention was published in International Journal of Human-Computer Interaction in 2021.

The project website contains all the information about this project, including video explanation of the method and the paper.

Demo on another dataset

This model has been applied to English dataset.

The demo video as well as the code to run the pre-trained model are online.

Requirements

  • Python 3

Initial setup

install packages

# if you have GPU
pip install tensorflow-gpu==1.15.2

# if you don't have GPU
pip install tensorflow==1.15.2

pip install -r requirements.txt

install ffmpeg

# macos
brew install ffmpeg
# ubuntu
sudo add-apt-repository ppa:jonathonf/ffmpeg-4
sudo apt-get update
sudo apt-get install ffmpeg

 


 

How to use this repository?

0. Notation

We write all the parameters which needs to be specified by a user in the capslock.

1. Download raw data

  • Clone this repository
  • Download a dataset from https://www.dropbox.com/sh/j419kp4m8hkt9nd/AAC_pIcS1b_WFBqUp5ofBG1Ia?dl=0
  • Create a directory named dataset and put two directories motion/ and speech/ under dataset/

2. Split dataset

  • Put the folder with the dataset in the data_processing directory of this repo: next to the script prepare_data.py
  • Run the following command
python data_processing/prepare_data.py DATA_DIR
# DATA_DIR = directory to save data such as 'data/'

Note: DATA_DIR is not a directory where the raw data is stored (the folder with data, "dataset" , has to be stored in the root folder of this repo). DATA_DIR is the directory where the postprocessed data should be saved. After this step you don't need to have "dataset" in the root folder any more. You should use the same DATA_DIR in all the following scripts.

After this command:

  • train/ test/ dev/ are created under DATA_DIR/
    • in inputs/ inside each directory, audio(id).wav files are stored
    • in labels/ inside each directory, gesture(id).bvh files are stored
  • under DATA_DIR/, three csv files gg-train.csv gg-test.csv gg-dev.csv are created and these files have paths to actual data

3. Convert the dataset into vectors

python data_processing/create_vector.py DATA_DIR N_CONTEXT
# N_CONTEXT = number of context, in our experiments was set to '60'
# (this means 30 steps backwards and forwards)

Note: if you change the N_CONTEXT value - you need to update it in the train.py script.

(You are likely to get a warning like this "WARNING:root:frame length (5513) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid." )

As a result of running this script

  • numpy binary files X_train.npy, Y_train.npy (vectord dataset) are created under DATA_DIR
  • under DATA_DIR/test_inputs/ , test audios, such as X_test_audio1168.npy , are created
  • when N_CONTEXT = 60, the audio vector's shape is (num of timesteps, 61, 26)
  • gesture vector's shape is(num of timesteps, 384) - 384 = 64joints × (x,y,z positions + x,y,z velocities)

If you don't want to customize anything - you can skip reading about steps 4-7 and just use already prepared scripts at the folder example_scripts

 

4. Learn motion representation by AutoEncoder

Create a directory to save training checkpoints such as chkpt/ and use it as CHKPT_DIR parameter.

Learn dataset encoding

python motion_repr_learning/ae/learn_dataset_encoding.py DATA_DIR -chkpt_dir=CHKPT_DIR -layer1_width=DIM

The optimal dimensionality (DIM) in our experiment was 325

Encode dataset

Create DATA_DIR/DIM directory

python motion_repr_learning/ae/encode_dataset.py DATA_DIR -chkpt_dir=CHKPT_DIR -restore=True -pretrain=False -layer1_width=DIM

More information can be found in the folder motion_repr_learning

5. Learn speech-driven gesture generation model

python train.py MODEL_NAME EPOCHS DATA_DIR N_INPUT ENCODE DIM
# MODEL_NAME = hdf5 file name such as 'model_500ep_posvel_60.hdf5'
# EPOCHS = how many epochs do we want to train the model (recommended - 100)
# DATA_DIR = directory with the data (should be same as above)
# N_INPUT = how many dimension does speech data have (default - 26)
# ENCODE = weather we train on the encoded gestures (using proposed model) or on just on the gestures as their are (using baseline model)
# DIM = how many dimension does encoding have (ignored if you don't encode)

6. Predict gesture

python predict.py MODEL_NAME INPUT_SPEECH_FILE OUTPUT_GESTURE_FILE
# Usage example
python predict.py model.hdf5 data/test_inputs/X_test_audio1168.npy data/test_inputs/predict_1168_20fps.txt
# You need to decode the gestures
python motion_repr_learning/ae/decode.py DATA_DIR ENCODED_PREDICTION_FILE DECODED_GESTURE_FILE -restore=True -pretrain=False -layer1_width=DIM -chkpt_dir=CHKPT_DIR -batch_size=8 

Note: This can be used in a for loop over all the test sequences. Examples are provided in the example_scripts folder of this directory

# The network produces both coordinates and velocity
# So we need to remove velocities
python helpers/remove_velocity.py -g PATH_TO_GESTURES

7. Quantitative evaluation

Use scripts in the evaluation folder of this directory.

Examples are provided in the example_scripts folder of this repository

8. Qualitative evaluation

Use animation server

 

Citation

If you use this code in your research please cite the paper:

@article{kucherenko2021moving,
  title={Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation},
  author={Kucherenko, Taras and Hasegawa, Dai and Kaneko, Naoshi and Henter, Gustav Eje and Kjellstr{\"o}m, Hedvig},
  journal={International Journal of Human–Computer Interaction},
  doi={10.1080/10447318.2021.1883883},
  year={2021}
}

Contact

If you encounter any problems/bugs/issues please contact me on Github or by emailing me at [email protected] for any bug reports/questions/suggestions. I prefer questions and bug reports on Github as that provides visibility to others who might be encountering same issues or who have the same questions.

speech_driven_gesture_generation_with_autoencoder's People

Contributors

aoikaneko avatar daihasegawa avatar svito-zar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speech_driven_gesture_generation_with_autoencoder's Issues

Motion is longer than audio on gesture viewer.

Thank you for your patience on previous question, hope you could help me again!

After predicting from encoded file DATA_DIR/325/Y_test_encoded.npy, and after remove velocity, I implemented the corresponding audio and motion file on gesture viewer. It works however the motion appears to be longer than audio. Do you know why this happens? Or should I check if I did something wrong on file name.

Also I noticed that during decode using decode.py, there are

print(encoding.shape)

# Decode it
decoding = tr.decode(nn, encoding)

print(decoding.shape)

Since the shape is printed, I think the first dim of shape of encoding and decoding should be same since they represents time steps, but they were different. Is this the reason why the audio and motion are not as same length?

About the computer configuration

Hello, I would like to ask you about the computer configuration used in your trainning model and the approximate trainning time. :)

update tensorflow

Hi, I tried to run the code locally but it doesn't work due to the version of some libraries that are too old :( Could you update?

The results are bad?

Thank you for your wonderful work.

I followed the steps step by step to reproduce.

...
205600/206899 [============================>.] - ETA: 2s - loss: 0.0013
206899/206899 [==============================] - 422s 2ms/step - loss: 0.0013 - val_loss: 8.2411e-04

model_500ep_posvel_60

But it's not ideal to use the visualization tools you provide:

1094

Can you help me see what the problem is? Thanks!

How do you divide the each long sample to short samples?

I looked over your code. thank you very much for sharing the great work at first.

I know In genea there are 23 long samples (each about 10 minutes) with name like"Recording_001.***".

How do you divide the long sample into short samples( only 24 samples if not divide. it is a so small examples size)?

Thanks in advance

How do you do the alignment between audio features and posture frames?

This is an excellent work. As a student working on gesture generation, I am very appreciate of this code. Thank you.

However I've got a question for you guys.

I notice that in the file create_vector.py, specifically in create_vectors function, there is a step 3, which is commented as "align vector length". It is written as,

# Step 3: Align vector length
input_vectors, output_vectors = shorten(input_vectors, output_vectors)

and,

def shorten(arr1, arr2):
    min_len = min(len(arr1), len(arr2))

    arr1 = arr1[:min_len]
    arr2 = arr2[:min_len]

    return arr1, arr2

In my view it was supposed to be doing some kind of alignment, because in your work, all MFCC features were concatenated as one vector during training. Therefore it is crucial to know which part of MFCC features is corresponding to posture frames.

The problem is that if the vectors were simply cut by shorter length of MFCCs and posture frames, aren't some information lost? For example, if the length of MFCCs is longer than posture frames, the part in posture frames exceeding the length of MFCCs will be deleted.

Looking forward to your reply!

How to generate bvh from videos

This is an excellent work. As a student working on gesture generation, I am very appreciate of this code. Thank you.

However I've got a question for you guys.

I have a collections of video files, how can I generate bvh from these mp4 files?

Looking forward to your reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.