idea-research / motion-x Goto Github PK

[NeurIPS 2023] Official implementation of the paper "Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset"

Home Page: https://motion-x-dataset.github.io

License: Other

Python 100.00%

motion-x's People

Contributors

Stargazers

Watchers

Forkers

zhongshijun minjunkang 2132660698 jyuhao88 peterzs bruinxiong dajiaohuang wasahaiah

motion-x's Issues

Which dataset to download for EgoBody

For the list of files which one should be downloaded as the EgoBody dataset mentioned in the repo?

Use slerp Interpolation for Alignment in face_motion_augmentation

In the face_motion_augmentation code, for samples where the number of frames in face_motion is less than the number of frames in motion, linear interpolation is currently used for alignment. face_motion uses axisangle representation (so3). It is suggested to use slerp interpolation. The code could be modified as follows:

from pytorch3d.transforms import so3_exp_map, so3_log_map

def slerp(axisangle_left, axisangle_right, t):
    """Spherical linear interpolation."""
    # https://en.wikipedia.org/wiki/Slerp
    # t: (time - timeleft / (timeright - timeleft)) (0, 1)
    assert (
        axisangle_left.shape == axisangle_right.shape
    ), "axisangle_left and axisangle_right must have the same shape"
    assert (
        axisangle_left.shape[-1] == 3
    ), "axisangle_left and axisangle_right must be axis-angle representations"
    assert (
        t.shape[:-1] == axisangle_left.shape[:-1]
    ), "t must have the same shape as axisangle_left and axisangle_right"

    main_shape = axisangle_left.shape[:-1]
    axisangle_left = axisangle_left.reshape(-1, 3)
    axisangle_right = axisangle_right.reshape(-1, 3)
    t = t.reshape(-1, 1)
    delta_rotation = so3_exp_map(
        so3_log_map(so3_exp_map(-axisangle_left) @ so3_exp_map(axisangle_right)) * t
    )

    return so3_log_map(so3_exp_map(axisangle_left) @ delta_rotation).reshape(*main_shape, 3)


def slerp_interpolate(motion, new_len):
    motion_len, n_joints, axisangle_dims = motion.shape

    new_t = torch.linspace(0, 1, new_len)
    timeline_idx = new_t * (motion_len - 1)
    timeline_idx_left = torch.floor(timeline_idx).long()
    timeline_idx_right = torch.clamp(timeline_idx_left + 1, max=motion_len - 1)

    motion_left = torch.gather(
        motion, 0, timeline_idx_left[:, None, None].expand(-1, n_joints, axisangle_dims)
    )
    motion_right = torch.gather(
        motion,
        0,
        timeline_idx_right[:, None, None].expand(-1, n_joints, axisangle_dims),
    )
    delta_t = timeline_idx - timeline_idx_left.float()

    new_motion = slerp(
        motion_left,
        motion_right,
        delta_t[:, None, None].expand(-1, n_joints, -1),
    )
    return new_motion

if motion_length != face_motion_length:
    face_motion = torch.from_numpy(face_motion)
    n_frames, n_dims = face_motion.shape
    n_joints = n_dims // 3
    face_motion = face_motion.reshape(n_frames, n_joints, 3)
    face_motion = slerp_interpolate(face_motion, motion_length)
    face_motion = face_motion.reshape(motion_length, -1).numpy()
else:
    (
        motion[:, 66 + 90 : 66 + 93],
        motion[:, 159 : 159 + 50],
        motion[:, 209 : 209 + 100],
    ) = (face_motion[:, :3], face_motion[:, 3 : 3 + 50], face_motion[:, 53:153])

Difference in File Format between Provided GRAB Facial Data and Original GRAB Data

The file format you provided for the GRAB facial data differs from the file format of the original GRAB data. In the GRAB data you provided, many data segments are split into multiple clips, each corresponding to multiple original data files.

GRAB processing has left and right hand set to 0

in grab.py you have set pose_left_hand and left_hand_pose to zeros even if they are present in the dataset. Why are you not using the dataset parameters?

Mismatch of motion and text for humaml dataset

Hi Author,
After running humanml.py, the motion data .npy has no corresponding txt file. In total, there are 26292 mismatched files. And Do you have statistics for how many files for each subset? Thanks

Feature Extractors for Evaluation

Hello, great work! And, thank you for sharing your code.

I'd like to train several motion generation models on Motion-X and evaluate them. Your paper says

We pretrain a motion feature extractor and a text feature extractor for the new motion presentation with contrastive loss to map the text and motion into feature space and then evaluate the distance between the text-motion pairs.

I'd like to use your pretrained feature extractors (motion and text) and evaluation code. Do you have a plan to make them publicly available?

grab.py

I used your code to process the GRAB dataset and got results like this, which appeared to float, and I thought something was wrong. Then modified pose_trans in grab.py to change the coordinate transformation from (x,y,z) to (x,z,-y). After that it came out looking right. I would like to confirm this thing.
Also thank you very much for posting the SMPL-X visualization sequences for all the datasets, in there the GRAB dataset also has the problem of not being grounded, is it due to not converting pose_trans. Can this script be released?

The unit of root transformation issue and foot contact issue

Hi, I am trying to covert data from SMPLX to AMASS and I noticed that the motion['trans'] part is not unified. E.g. the root transform in aist is much larger than others.
Also, did you do any foot contact process with those datasets?

AIST motions error

Hi! When I try to visualize AIST motions they appear to have some problem. In my custom visualizer the other motion dataset appear like this: (In this example from GRAB)

But the AIST motions appear like this:

Can you please recheck the aist motions?

Significant Errors! Some data are all zero T-Poses

a lot of clips in fitness and idea400 have all zero body poses:

Could you please comment if a fix is available? Thanks.

Visualization as the paper fig show.

Hi, thanks for the great work. I have a more basic question, and I would greatly appreciate it if you could answer it.
I just want to know how to render the results as the paper fig (e.g. Fig 6(b)) shows. is there any rendering script that you could provide？Or this is rendered in software like Blender? Hope to get your help, thanks!

About the dataset

Thank you very much for your work, but I have some questions. How do I download IDEA400 dataset? Does IDEA400 dataset have corresponding pictures or videos data? And when will the videos data of the Motion-X be released?

Inconsistency between the frame count of humanml.py and face_motion_augmentation.py

Could you provide the output information from humanml.py? When running face_motion_augmentation.py, I noticed a mismatch between the number of frames in the output from humanml.py and the number of frames in the face_motion data.

Has the data been released?

Great job! This dataset is like the ImageNet for the 3D motion! It's definitely going to significantly boost performance on various motion-related tasks!

I noticed that you plan to release Motion-X by Sept. 2023. As of now, I can not find somewhere to access the data. Could you please hint me whether the dataset is released or not?

Why are face_expr params empty in face_motion_data?

Maybe I'm missing something here? But I wrote a script to output the face_expr params for all day in face_motion_data/GRAB, face_motion_data/EgoBody, face_motion_data/humanml and all of them are size [x, 0] arrays meaning no data.

dir = 'MotionDiffuse/face_motion_data/smplx_322/GRAB'
for subdir, dirs, files in os.walk(dir):
    for file in files:
        if file.endswith('.npy'):
            motion = np.load(os.path.join(subdir, file))
            motion_data = torch.tensor(motion).float()
            motion_params = get_params(motion_data)
            print(motion_params['face_expr'].shape)

outputs:

(573, 0)
(250, 0)
(507, 0)
(216, 0)
(240, 0)
....
....

And more explicitly if I run:

dir = 'MotionDiffuse/face_motion_data/smplx_322/GRAB'
for subdir, dirs, files in os.walk(dir):
    for file in files:
        if file.endswith('.npy'):
            motion = np.load(os.path.join(subdir, file))
            motion_data = torch.tensor(motion).float()
            motion_params = get_params(motion_data)
            if motion_params['face_expr'].shape[1] != 0:
                print(os.path.join(subdir, file))

it outputs nothing i.e. shows all tensors are empty.

And for a sanity test you can see all the shapes for one example:

motion = np.load('MotionDiffuse/face_motion_data/smplx_322/EgoBody/recording_20210907_S02_S01_01/body_idx_1/003.npy')
motion = torch.tensor(motion).float()
motion_params = {
            'root_orient': motion[:, :3],  # controls the global root orientation
            'pose_body': motion[:, 3:3+63],  # controls the body
            'pose_hand': motion[:, 66:66+90],  # controls the finger articulation
            'pose_jaw': motion[:, 66+90:66+93],  # controls the yaw pose
            'face_expr': motion[:, 159:159+50],  # controls the face expression
            'face_shape': motion[:, 209:209+100],  # controls the face shape
            'trans': motion[:, 309:309+3],  # controls the global body position
            'betas': motion[:, 312:],  # controls the body shape. Body shape is static
        }
for key in motion_params.keys():
    print(key, motion_params[key].shape)

root_orient torch.Size([124, 3])
pose_body torch.Size([124, 63])
pose_hand torch.Size([124, 87])
pose_jaw torch.Size([124, 0])
face_expr torch.Size([124, 0])
face_shape torch.Size([124, 0])
trans torch.Size([124, 0])
betas torch.Size([124, 0])

Which shows actually everything is empty except for root_orient, body, and hand joints.

Am I missing something?

will text2motion diffusion models trained on Motion-X be released?

The Motion-X paper shows results for MDM, MotionDiffuse, etc models trained on Motion-X. Will these models or at least the code for them be released at some point?

What would be orientation of the data and FPS

Awesome work! I wanted to know what is the orientation of all the data? For example Z-up Y forward (AMASS) or Y-up Z-forward (AIST++). Also are all the motions sampled at the same FPS?

Can you share the processing code to convert Motion-X to your Tomato representation

Is it possible for you to release the preprocessing portion to convert the SMPL representation to the representation used in your tomato paper?

Any tips for generating sequence motion labels from videos?

"Meanwhile, we input the videos into Video-LLaMA [62] and filter the human action descriptions as supplemental texts."

Hi authors, I want to annotate sequence motion labels in my own video dataset. I tried Video-LLaMA but the quality is bad, here is an example. Are these results similar to yours? Any tips for improving the quality of labels? And how to automatically filter the human action descriptions ?

BMLhandball

Thank for Huge Projects
I have one question
BMLhandball doesn't have SMPL-X G...
only have SMPL+H G

Is it possible to get data pipeline codebase?

Hi, thanks for this nice work
I am interested in getting pose from other video, can I access to the pipeline codebase for motionX?

What is face shape?

Hi, just want to know what is the meaning of face shape (100-dim) in the data preprocessing scripts? Thanks.

Error in t2m raw offsets?

Hi! I have done something similar to you for my motion processing. I have also used 000021 as the example motion similar to humanml3d. The raw offset is the general direction of the joint from the parent. However, if you visualize the motion (as shown below) you can see that the relative offset of the first finger joint is at -1 y from the wrist similar to lines 94 - 99 for t2m_raw_body_offsets. While finger joints 2 and 3 are on the x-axis. However you have the raw offsets of all finger joints on the x-axis, denoting the fingers are pointing in the x-axis for 000021. Can you recheck again?

This is what I have:

p - pinky, r - ring, m - middle, i - index, t - thumb
left hand
[0, -1, 0], # lp1
[-1, 0, 0], # lp1
[-1, 0, 0], # lp1
[0, -1, 0], # lr1
[-1, 0, 0], # lr1
[-1, 0, 0], # lr1
[0, -1, 0], # lm1
[-1, 0, 0], # lm1
[-1, 0, 0], # lm1
[0, -1, 0], # li1
[-1, 0, 0], # li1
[-1, 0, 0], # li1
[0, -1, 0], # lt1
[0, -1, 0], # lt1
[0, -1, 0], # lt1
right hand
[0, -1, 0], # rp1
[1, 0, 0], # rp1
[1, 0, 0], # rp1
[0, -1, 0], # rr1
[1, 0, 0], # rr1
[1, 0, 0], # rr1
[0, -1, 0], # rm1
[1, 0, 0], # rm1
[1, 0, 0], # rm1
[0, -1, 0], # ri1
[1, 0, 0], # ri1
[1, 0, 0], # ri1
[0, -1, 0], # rt1
[0, -1, 0], # rt1
[0, -1, 0], # rt1

plot_3d_global.py无法运行

请问plot_3d_global.py的final data是通过什么步骤产生的？
另外这个example对象是什么，在哪里定义的？
谢谢。
Traceback (most recent call last): File "/app/Motion-X-main/Motion-X-main/tomato_represenation/plot_3d_global.py", line 362, in <module> joints = np.load(example) NameError: name 'example' is not defined

AttributeError: 'map' object has no attribute 'reshape'

When I run raw_pose_processing.py， it displays an error.
AttributeError: 'map' object has no attribute 'reshape'
It seems that only in Python 2, the map object has the reshape attribute.
The error occurs at line 337 in the smplx2joints.py file.The error code is as follows.
vertices = output.vertices.reshape(batch_size, num_frames, 10475, 3)
I want to know the environment configuration required for the program in the 'tomato_representation' folder.
Please let me know.

what is the difference between PostText and Text description in the readme of the dataset

Hi author,
I got text description in terms of the ReadMe.
In the issue, I found you release the "whole-body" description.
What is the difference between them? If I want to train a model of generating whole-body motion, no face, no hand. Which data I should use for training? Thanks

about the text descriptions (text_v1.1) augmented by Vicuna 1.5

Thank you very much for uploading the text descriptions augmented by Vicuna 1.5. However, it seems that these descriptions are not significantly different from the previous ones, and the modification date of the files is also October 24, 2023. I wonder if you might have uploaded the wrong files?

Concerns About the Use of Only Female Models for HumanML3D Data Processing

In the code for processing HumanML data, it appears that only female models are used. I believe this might affect the accuracy of calculations for root_orient and trans.

How to train T2M on motion-x dataset

Hi, I'm curious how to train a T2M model on motion-x. Is there any existing train.py file or do I have to figure that out by myself?

About tomato representation

Hi, thanks for your great job!
I tried to run the code in the tomato representation, but it seems there are other tasks to be done before that?And I wonder that what environment setting is needed to obtain tomato representation and use body-only motion?

Subset Statistics for the whole body of Motion-X mismatch with the paper

Hi Author,
In total, I get 81k whole body data for Motion-X.

In the paper, in total, there are 91k for motion-x in the paper.

would you help to point which data I missed? Thanks

Since you already have the training result of T2M-GPT from motion x, could you please release the data loader at hand?

          Thank you for your response. Since you already have the training result of T2M-GPT from motion x, could you please release the data loader at hand?

Originally posted by @marybloodyzz in #33 (comment)

Framerate different in SMPL+H G and SMPL-X G

I find the framerates are different among the SMPL+H G and SMPL-X G in AMASS data. For example, the 009655 in HumanML3D, whose raw name is CMU/62/62_11 has 60 framerate in SMPL+H G and 120 framerate in SMPL-X G, while they have same total 3703 frames.

Different framerate will affect the text-motion alignment because the caption is get from the respective start frame to endding frame. How do you think about this issue?

And another question about your code following:

Motion-X/mocap-dataset-process/humanml.py

Lines 215 to 229 in 00e4ecc

    
           if 'humanact12' not in source_path: 
        
               if 'Eyes_Japan_Dataset' in source_path: 
        
                   pose = pose[int(3*ex_fps):] 
        
               if 'MPI_HDM05' in source_path: 
        
                   pose = pose[int(3*ex_fps):] 
        
               if 'TotalCapture' in source_path: 
        
                   pose = pose[int(1*ex_fps):] 
        
               if 'MPI_Limits' in source_path: 
        
                   pose = pose[int(1*ex_fps):] 
        
               if 'Transitions_mocap' in source_path: 
        
                   pose = pose[int(0.5*ex_fps):] 
        
               pose = pose[int(start_frame*1.5):int(end_frame*1.5)]

The [int(3*ex_fps):] slicing index for different dataset is your experience value or any official recommendation? and the same question for pose = pose[int(start_frame*1.5):int(end_frame*1.5)]

Looking for your reply. Thanks very much!

A Question on transfer_to_body_only_humanml.py

Thank you for your priceless work. It really helps a lot.
It seems that a motion-X data sequence is represented as a (nframes,322) vector.But in transfer_to_body_only_humanml.py it says
data_263 = np.concatenate((data[:, :4+(body_joints - 1)*3], data[:, 4+(joints - 1)*3:4+(joints - 1)*3+(body_joints - 1)*6], data[:, 4 + (joints - 1)*9: 4 + (joints - 1) *9 + body_joints *3], data[:, -4:]), axis=1)
with the variable joints=52, and the index 4+(joints-1)*9 is obviously bigger than 322,which causes an AssertionError in
assert data_263.shape[1] == 263
I failed to figure out this problem and is now dying for your assistance.OTZ
btw,express my sincere gratitude again.

"AttributeError: can't set attribute" of plot_feature.py

Hi, when I run the plot_feature.py, line151-->'ax.lines = []' arises the error: AttributeError: can't set attribute. Is it due to the version of mpl_toolkits? How can I solve it, many thanks for your response!

Will the original video frames be released along with the motion data ?

Thanks for share the excellent work.

Just wondering, if you will release the original video data ?

I think it's useful for several tasks, like pose estimation, video generation .

Question about 15.6M frame-level whole-body pose description

Hi authors, I am very amazed by your work.
I notice you use face recognition, posescript and handscript to generate 15.6M frame-level descriptions and have some questions about it.

How to generate frame-level description for a RGB video with only part of body, for example, (a)only the upper-body can be seen, (b)only see the face and shoulder can be seen, (c) part of the body is self-occluded bn or is occluded by loose cloth or some obstacles.
How to generate frame-level description for a RGB video with multiple persons? Or just delete all videos with multiple persons?
Are these descriptions used in one of your experiments to validate it is right, or for new application? The Tab.4(text-driven motion generation) seens not support frame-level description since they always require at least 24 frames as input. Do you aggregate all frames-level into one video-level description (If yes, how to aggregate it)? Is the Tab.6 the only experiment related to frame-level description(Besides, how do you compute the FID in Tab.6)?

When will you release 'whole-body pose descriptions'?

It is a wonderful work. I wonder when will you release whole-body pose descriptions. It is very helpful to my research. Thanks a lot!

Do we need an environment setting before we train the model?

          Do we need an environment setting before we train the model?

Originally posted by @marybloodyzz in #25 (comment)

The whole-body pose description

Thanks for sharing your excellent work. I would like to follow your works, but the sequence-level semantic labels is unsatisfactory in some subsets of Motion-X. I'm wondering if you could release whole-body pose description.

Question about the data pipeline loss (Eq. 2).

Hi, thanks for the great work! I’m a bit confused about the data pipeline, e.g. in Eq. (2), based on my understanding, the initial human parameter is predicted directly by OSX. Afterward, when the model is updated, it predicts new human body parameters, I want to know if the parameter loss is calculated between the initial parameters and each round of updated parameters. If so, how many epochs of training were conducted in the data fitting pipeline, and how many epochs will you update the OSX model parameters? Hope to get your help, thanks!

Magic folder issues

In mocap-dataset-process/README.md

3. Perform face motion augmentation
In this step, we will perform face motion augmentation to replace the face motion, since these Mocap data does not provide facial expression. Notably, we keep the original jaw pose of GRAB dataset.

Move the processed motion data to ../datasets/motion_data/smplx_322
mv EgoBody_motion ../datasets/motion_data/smplx_322/EgoBody
mv humanml ../datasets/motion_data/smplx_322/humanml
mv GRAB_motion ../datasets/motion_data/smplx_322/GRAB

mv humanml makes me confused.
What is this humanml folder and how does that comes out?

I checked back and forth in the README and other files, didn't see how to prepare that.

And then mv GRAB_txt ../datasets/texts/semantic_labels/GRAB, I suppose GRAB_txt is the same thing with GRAB_text(specified in grab.py) right?

And what's more interesting is the python aist.py in Process EgoBody Dataset process. Though you've fixed that mistake recently, but it makes me wonder is this README a true valid guide to run the final script?

	if 'humanact12' not in source_path:
	if 'Eyes_Japan_Dataset' in source_path:
	pose = pose[int(3*ex_fps):]
	if 'MPI_HDM05' in source_path:
	pose = pose[int(3*ex_fps):]
	if 'TotalCapture' in source_path:
	pose = pose[int(1*ex_fps):]
	if 'MPI_Limits' in source_path:
	pose = pose[int(1*ex_fps):]
	if 'Transitions_mocap' in source_path:
	pose = pose[int(0.5*ex_fps):]



	pose = pose[int(start_frame1.5):int(end_frame1.5)]