Code Monkey home page Code Monkey logo

aicity_action's Introduction

Modified PySlowFast (with MViT v2) for AI City Challenge

Introduction

The AI City Challenge's Naturalistic Driving Action Recognition intends to temporally localize driver actions given multi-view video streams. Our system, the Stargazer system, achieves second-place performance on the public leaderboard and third-place in the final test. Our system is based on the improved multi-scale vision transformers and large-scale pretraining on the Kinetics-700 dataset. Our CVPR workshop paper detailing the designs is here.

Citations

If you find this code useful in your research then please cite

@inproceedings{liang2022stargazer,
  title={Stargazer: A transformer-based driver action detection system for intelligent transportation},
  author={Liang, Junwei and Zhu, He and Zhang, Enwei and Zhang, Jun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3160--3167},
  year={2022}
}

Requirement

  • ffmpeg >= 3.4 for cutting the videos into clips for training.
  • python 3.8, tqdm, decord, opencv, pyav, pytorch>=1.9.0, fairscale

Data Preparation

  1. Put all videos into a single path under data/A1_A2_videos/. There should be 60 ".MP4" under this directory

  2. Download our processed annotations from here. This annotation simply re-formats the original annotation. Put the file under data/annotations/

  3. Generate files for training on A1 videos.

    • Get the processed annotations and video cutting cmds

      $ python scripts/aicity_convert_anno.py data/annotations/annotation_A1.edited.csv \
      data/A1_A2_videos/ data/annotations/processed_anno_original.csv \
      A1_cut.sh data/A1_clips/ --resolution=-2:540
      

      The processed_anno_original.csv should have 1115 lines.

    • Cut the videos (you can also directly run bash).

      $ mkdir data/A1_clips
      $ parallel -j 4 < A1_cut.sh
      
    • Make annotation splits (without empty segments, see paper for details)

      $ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \
      data/annotations/pyslowfast_anno_na0 --method 1
      
    • Make annotation splits (with empty segments)

      $ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \
      data/annotations/pyslowfast_anno_naempty0 --method 2
      
    • Make annotation files for training on the whole A1 set

      $ mkdir data/annotations/pyslowfast_anno_na0/full
      $ cat data/annotations/pyslowfast_anno_na0/splits_1/train.csv \
      data/annotations/pyslowfast_anno_na0/splits_1/val.csv \
      > data/annotations/pyslowfast_anno_na0/full/train.csv
      $ cp data/annotations/pyslowfast_anno_na0/splits_1/val.csv \
      data/annotations/pyslowfast_anno_na0/full/
      
    • download pre-trained K700 checkpoints from here. Put the k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth under models/. This model achieves 71.91 top-1 accuracy on Kinetics700 validation sets.

Training

Train using the 16x4, 448 crop K700 pretrained model on A1 videos for 200 epochs, as in the paper. Here we test it with a machine with 3-GPUs (11GB memory per GPU). The code base supports multi-machine training as well.

First we need to add the code file path (root path) to PYTHONPATH:

  $ export PYTHONPATH=$PWD/:$PYTHONPATH;

Remove Dashboard_User_id_24026_NoAudio_3.24026.533.535.MP4 from data/annotations/pyslowfast_anno_na0/full/train.csv.

Train:

  $ mkdir -p exps/aicity_train
  $ cd exps/aicity_train
  $ python ../../tools/run_net.py --cfg ../../configs/VITV2_FULL_B_16x4_CONV_448.yaml \
  TRAIN.CHECKPOINT_FILE_PATH ../../models/k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth \
  DATA.PATH_PREFIX ../../data/A1_clips \
  DATA.PATH_TO_DATA_DIR ../../data/annotations/pyslowfast_anno_na0/full \
  TRAIN.ENABLE True TRAIN.BATCH_SIZE 3 NUM_GPUS 3 TEST.BATCH_SIZE 3 TEST.ENABLE False \
  DATA_LOADER.NUM_WORKERS 8 SOLVER.BASE_LR 0.000005 SOLVER.WARMUP_START_LR 1e-7 \
  SOLVER.WARMUP_EPOCHS 30.0 SOLVER.COSINE_END_LR 1e-7 SOLVER.MAX_EPOCH 200 LOG_PERIOD 1000 \
  TRAIN.CHECKPOINT_PERIOD 100 TRAIN.EVAL_PERIOD 200 USE_TQDM True \
  DATA.DECODING_BACKEND decord DATA.TRAIN_CROP_SIZE 448 DATA.TEST_CROP_SIZE 448 \
  TRAIN.AUTO_RESUME True TRAIN.CHECKPOINT_EPOCH_RESET True \
  TRAIN.MIXED_PRECISION False MODEL.ACT_CHECKPOINT True \
  TENSORBOARD.ENABLE False TENSORBOARD.LOG_DIR tb_log \
  MIXUP.ENABLE False MODEL.LOSS_FUNC cross_entropy \
  MODEL.DROPOUT_RATE 0.5 MVIT.DROPPATH_RATE 0.4 \
  SOLVER.OPTIMIZING_METHOD adamw

The model we used that ranks No.2 on the leaderboard was trained using 2x8 A100 GPUs with a global batch size of 64 and a learning rate of 1e-4 (also with gradient check-pointing but no mixed precision training). So for a 3-GPU train, we use a batch size of 3 and a learning rate of 0.000005 according to the linear scaling rule. However, in order to reproduce our results, a similar number of batch size is recommended.

To run this code on multi-machine with PyTorch DDP, add --init_method "tcp://${MAIN_IP}:${PORT}" --num_shards ${NUM_MACHINE} --shard_id ${INDEX} to the commands. ${MAIN_IP} is the IP for the root node. ${INDEX} is the node's index.

Inference

To get submission file for a test dataset, we need the model, threshold file, the videos, and the video_ids.csv.

  1. Get the model

    Follow the Training process or download our checkpoint from here. Put the models under models/. This is the model that achieves No.2 on the A2 leaderboard.

  2. Get the thresholds. Put them under thresholds/.

    • Best public leaderboard threshold from here. (Empirically searched)

    • Best general leaderboard threshold from here. (Grid searched)

    • A1 pyslowfast_anno_naempty0/splits_1 trained and empirically searched from here.

  3. Run sliding-window classification (single GPU).

    Given a list of video names and the path to the videos, run the model. 16x4, 448 model with batch_size=1 will take 5 GB GPU memory to run.

     # cd back to the root path
     $ python scripts/run_action_classification_temporal_inf.py A2_videos.lst data/A1_A2_videos/ \
     models/aicity_train_mvitV2_16x4_fromk700_e200_lr0.0001_yeswarmup_nomixup_dp0.5_dpr0.4_adamw_na0_full_448.pyth \
     test/16x4_s16_448_full_na0_A2test \
     --model_dataset aicity --frame_length 16 --frame_stride 4 --proposal_length 64 \
     --proposal_stride 16 --video_fps 30.0  --frame_size 448 \
     --pyslowfast_cfg configs/Aicity/MVITV2_FULL_B_16x4_CONV_448.yaml \
     --batch_size 1 --num_cpu_workers 4
    
  4. Run post-processing with the given threshold file to get the submission files.

    $ python scripts/aicity_inf.py test/16x4_s16_448_full_na0_A2test thresholds/public_leaderboard_thres.txt \
    A2_video_ids.csv test/16x4_s16_448_full_na0_A2test.txt --agg_method avg \
    --chunk_sort_base_single_vid score --chunk_sort_base_multi_vid length --use_num_chunk 1
    

    The submission file is test/16x4_s16_448_full_na0_A2test.txt. This should get F1=0.3295 as on the leaderboard on A2 test.

Acknowledgement

This code base heavily adopts the PySlowFast code base.

aicity_action's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.