Code Monkey home page Code Monkey logo

viola's Introduction

VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, Yuke Zhu

Project | Paper | Simulation Datasets | Real-Robot Datasets | Real Robot Control

Introduction

We introduce VIOLA, an object-centric imitation learning approach to learning closed-loop visuomotor policies for robot manipulation. Our approach constructs object-centric representations based on general object proposals from a pre-trained vision model. It uses a transformer-based policy to reason over these representations and attends to the task-relevant visual factors for action prediction. Such object-based structural priors improve deep imitation learning algorithm’s robustness against object variations and environmental perturbations. We quanti- tatively evaluate VIOLA in simulation and on real robots. VIOLA outperforms the state-of-the-art imitation learning methods by 45.8% in success rates. It has also been deployed successfully on a physical robot to solve challenging long- horizon tasks, such as dining table arrangements and coffee making. More videos and model details can be found in supplementary materials and the anonymous project website: https://ut-austin-rpl.github.io/VIOLA.

Real Robot Usage

This codebase does not include the real robot experiment setup. If you are interested in using the real robot control infra we use, please checkout Deoxys! It comes with a detailed documentation for getting started.

Installation

Git clone the repo by:

git clone --recurse-submodules [email protected]:UT-Austin-RPL/VIOLA.git

Then go into VIOLA/third_party, install each dependencies according to their instructions: detectron2, Detic

Then install all the other dependencies. Most important packages are: torch, robosuite and robomimic.

pip install -r requirements.txt

Usage

Collect demonstrations and dataset creation

We by default assume the dataset is collected through spacemouse teleoperation.

python data_generation/collect_demo.py --controller OSC_POSITION --num-demonstration 100 --environment stack-two-types --pos-sensitivity 1.5 --rot-sensitivity 1.5

Then create dataset from a data collection hdf5 file.

python data_generation/create_dataset.py --use-actions
--use-camera-obs --dataset-name training_set --demo-file PATH_TO_DEMONSTRATION_DATA/demo.hdf5 --domain-name stack-two-types

Augment datasets with color augmentations and object proposals

Add color augmentation to the original dataset:

python data_generation/aug_post_processing.py --dataset-folder DATASET_FOLDER_NAME

Then we generate general object proposals using Detic models:

python data_generation/process_data_w_proposals.py --nms 0.05

Training and evaluation

To train a policy model with our generated dataset, run

python viola_bc/exp.py experiment=stack_viola ++hdf5_cache_mode="low_dim"

And for evaluation, run

python viola_bc/final_eval_script.py --state-dir checkpoints/stack --eval-horizon 1000 --hostname ./ --topk 20 --task-name normal

Dataset and trained checkpoints

We also make the datasets we used in our paper publicly available. You can download them:

Datasets: Used datasets: datasets, and unzip it under the folder and rename the folder's name to be datasets. Note that our simulation datasets are collected with robosuite v1.3.0, so the textures of robots, robots, and floors in datasets will not match robosuite v1.4.0.

Checkpoints: Best checkpoint performance: checkpoints unziip it under the root folder of the repo and rename it to be results.

viola's People

Contributors

zhuyifengzju avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

viola's Issues

questions about real dataset

Hello! Thanks for releasing this great works!
I am trying to reproduce this model in the real-world, so i use the viola dataset. I was wondering what is the scale of the dataset ? Such as, the action is absolute or delta? The translation is measaured as "m" or "cm" ? and the rotation is recorded as "rad" or "degree". I find the measures are confusing.
Thanks for answering my questions ~

Questions about the action token and inference process

I have some questions:
For the question 1 and 2 (line 168 on here)

    transformer_out = transformer_out.reshape(original_shape)
    action_token_out = transformer_out[:, :, 0, :]
    if per_step:
        action_token_out = action_token_out[:, -1:, :]
  1. Is there any reason you only use index 0 of transformer output?
  2. During inference, you why do you take -1: index? Why do you set different setting for inference?
  3. In your paper, you mentioned action token is used for the input, but I cannot find code where you used action token as input. Can you show where the corresponding code exists?
  4. Can you explain why TensorUtils.time_distributed is used on this line?
  5. During inference, is there a reason why do post-processing for gripper-history?

Thank you in advance!

Question about the action Token and image augmentation

action_token_out = transformer_out[:, :, 0, :].

Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?

In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.

Could you help me figure out why and how? Thanks a lot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.