ut-austin-rpl / viola Goto Github PK

View Code? Open in Web Editor NEW

99.0 2.0 6.0 133.11 MB

Official implementation for VIOLA

License: MIT License

Python 100.00%

viola's Introduction

VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, Yuke Zhu

Project | Paper | Simulation Datasets | Real-Robot Datasets | Real Robot Control

Introduction

We introduce VIOLA, an object-centric imitation learning approach to learning closed-loop visuomotor policies for robot manipulation. Our approach constructs object-centric representations based on general object proposals from a pre-trained vision model. It uses a transformer-based policy to reason over these representations and attends to the task-relevant visual factors for action prediction. Such object-based structural priors improve deep imitation learning algorithm’s robustness against object variations and environmental perturbations. We quanti- tatively evaluate VIOLA in simulation and on real robots. VIOLA outperforms the state-of-the-art imitation learning methods by 45.8% in success rates. It has also been deployed successfully on a physical robot to solve challenging long- horizon tasks, such as dining table arrangements and coffee making. More videos and model details can be found in supplementary materials and the anonymous project website: https://ut-austin-rpl.github.io/VIOLA.

Real Robot Usage

This codebase does not include the real robot experiment setup. If you are interested in using the real robot control infra we use, please checkout Deoxys! It comes with a detailed documentation for getting started.

Installation

Git clone the repo by:

git clone --recurse-submodules [email protected]:UT-Austin-RPL/VIOLA.git

Then go into VIOLA/third_party, install each dependencies according to their instructions: detectron2, Detic

Then install all the other dependencies. Most important packages are: torch, robosuite and robomimic.

pip install -r requirements.txt

Usage

Collect demonstrations and dataset creation

We by default assume the dataset is collected through spacemouse teleoperation.

python data_generation/collect_demo.py --controller OSC_POSITION --num-demonstration 100 --environment stack-two-types --pos-sensitivity 1.5 --rot-sensitivity 1.5

Then create dataset from a data collection hdf5 file.

python data_generation/create_dataset.py --use-actions
--use-camera-obs --dataset-name training_set --demo-file PATH_TO_DEMONSTRATION_DATA/demo.hdf5 --domain-name stack-two-types

Augment datasets with color augmentations and object proposals

Add color augmentation to the original dataset:

python data_generation/aug_post_processing.py --dataset-folder DATASET_FOLDER_NAME

Then we generate general object proposals using Detic models:

python data_generation/process_data_w_proposals.py --nms 0.05

Training and evaluation

To train a policy model with our generated dataset, run

python viola_bc/exp.py experiment=stack_viola ++hdf5_cache_mode="low_dim"

And for evaluation, run

python viola_bc/final_eval_script.py --state-dir checkpoints/stack --eval-horizon 1000 --hostname ./ --topk 20 --task-name normal

Dataset and trained checkpoints

We also make the datasets we used in our paper publicly available. You can download them:

Datasets: Used datasets: datasets, and unzip it under the folder and rename the folder's name to be datasets. Note that our simulation datasets are collected with robosuite v1.3.0, so the textures of robots, robots, and floors in datasets will not match robosuite v1.4.0.

Checkpoints: Best checkpoint performance: checkpoints unziip it under the root folder of the repo and rename it to be results.

viola's People

Contributors

Stargazers

Watchers

Forkers

huihanl benjamesbabala icml2023-3740 ciccio42 alexandor91 wuhud

viola's Issues

questions about real dataset

Hello! Thanks for releasing this great works!
I am trying to reproduce this model in the real-world, so i use the viola dataset. I was wondering what is the scale of the dataset ? Such as, the action is absolute or delta? The translation is measaured as "m" or "cm" ? and the rotation is recorded as "rad" or "degree". I find the measures are confusing.
Thanks for answering my questions ~

Questions about the action token and inference process

I have some questions:
For the question 1 and 2 (line 168 on here)

    transformer_out = transformer_out.reshape(original_shape)
    action_token_out = transformer_out[:, :, 0, :]
    if per_step:
        action_token_out = action_token_out[:, -1:, :]

Is there any reason you only use index 0 of transformer output?
During inference, you why do you take -1: index? Why do you set different setting for inference?
In your paper, you mentioned action token is used for the input, but I cannot find code where you used action token as input. Can you show where the corresponding code exists?
Can you explain why TensorUtils.time_distributed is used on this line?
During inference, is there a reason why do post-processing for gripper-history?

Thank you in advance!

Great job, but I have a problem, I would really appreciate it if you help me solve it

How to visualize when running the program viola_bc/final_eval_script.py? Where does the offscreen_visualization function in img = offscreen_visualization(env, use_eye_in_hand=cfg.algo.use_eye_in_hand) come from? I didn't find this function

Requesting Real dataset experiment file(.yaml)

Thanks for releasing your great works!!
Can you share Real-Robot Dataset's experiment config file(maybe xx_viola.yaml..) and checkpoint .pth file?

real robot data collection

Do you have the scripts to collect the demos on the real robot?

Cannot import name 'CropRandomizer' from 'robomimic.models.base_nets'

I try training the model with your generated datasets, and get this error:
Cannot import name 'CropRandomizer' from 'robomimic.models.base_nets'

With suspecting incorrect installation of robomimic, I go to https://github.com/ARISE-Initiative/robomimic/blob/master/robomimic/models/base_nets.py to check. However, I can't find CropRandomizer and Randomizer either.

Could you help me figure out why and how?

where np.load(f"scenes/{domain_name}/normal/{eval_run_idx}_{i + rank * 50}.npz")

what is initial_mjstate = env.sim.get_state().flatten()

Question about the action Token and image augmentation

action_token_out = transformer_out[:, :, 0, :].

Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?

In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.

Could you help me figure out why and how? Thanks a lot