Code Monkey home page Code Monkey logo

commavq's Introduction

commavq

commaVQ is a dataset of 100,000 heavily compressed driving videos for Machine Learning research. A heavily compressed driving video like this is useful to experiment with GPT-like video prediction models. This repo includes an encoder/decoder and an example of a video prediction model.

2x$1000 Challenges!

  • Get 1.92 cross entropy loss or less in the val set and in our private val set (using ./notebooks/eval.ipynb). gpt2m trained on a larger dataset gets 2.02 cross entropy loss.
  • Make gpt2m.onnx run at 0.25 sec/frame or less on a consumer GPU (e.g. NVIDIA 3090) without degredation in cross entropy loss. The current implementation runs at 0.5 sec/frame with kvcaching and float16. Note that you are allowed to use other ML inference libraries. The following changes improved the performance of our original implementation:
    • updating onnxruntime-gpu to 1.14 (1.5 sec/frame -> 1.2 sec/frame)
    • using onnxruntime.transformers.optimizer (1.2 sec/frame -> 0.8 sec/frame)
    • using onnxruntime.transformers.optimizer and making sure the Attention op is fused (0.8 sec/frame -> 0.5 sec/frame)

Overview

A VQ-VAE [1,2] was used to heavily compress each frame into 128 "tokens" of 10 bits each. Each entry of the dataset is a "segment" of compressed driving video, i.e. 1min of frames at 20 FPS. Each file is of shape 1200x8x16 and saved as int16.

Note that the compressor is extremely lossy on purpose. It makes the dataset smaller and easy to play with (train GPT with large context size, fast autoregressive generation, etc.). We might extend the dataset to a less lossy version when we see fit.

Download

  • Using huggingface datasets
import numpy as np
from datasets import load_dataset
num_proc = 40 # CPUs go brrrr
ds = load_dataset('commaai/commavq', num_proc=num_proc)
tokens = np.load(ds['0'][0]['path']) # first segment from the first data shard

Models

In ./models/ you will find 3 Neural Networks saved in the onnx format

  • ./models/encoder.onnx: is the encoder used to compress the frames
  • ./models/decoder.onnx: is the decoder used to decompress the frames
  • ./models/gtp2m.onnx: a 300M parameter GPT trained on a larger version of this dataset
  • (experimental) ./models/temporal_decoder.onnx: a temporal decoder which is a stateful version of the vanilla decoder

Examples

Checkout ./notebooks/encode.ipynb and ./notebooks/decode.ipynb for an example of how to visualize the dataset using a segment of driving video from comma's drive to Taco Bell

Checkout ./notebooks/gpt.ipynb for an example of how to use a pretrained GPT model to imagine future frames.

source_video.mp4
compressed_video.mp4
generated.mp4

References

[1] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).

[2] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

commavq's People

Contributors

hcnguyen111 avatar incognitojam avatar yassineyousfi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.