Behnoosh Parsa, Ashis G. Banerjee; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 2352-2362
Abstract
In this work, we propose a new approach to Human Activity Evaluation (HAE) in long videos using graph-based multi-task modeling. Previous works in activity evaluation either directly compute a metric using a detected skeleton or use the scene information to regress the activity score. These approaches are insufficient for accurate activity assessment since they only compute an average score over a clip, and do not consider the correlation between the joints and body dynamics. Moreover, they are highly scene-dependent which makes the generalizability of these methods questionable. We propose a novel multi-task framework for HAE that utilizes a Graph Convolutional Network backbone to embed the interconnections between human joints in the features. In this framework, we solve the Human Activity Segmentation (HAS) problem as an auxiliary task to improve activity assessment. The HAS head is powered by an encoder-Decoder Temporal Convolutional Network to semantically segment long videos into distinct activity classes, whereas, HAE uses a Long-Short-Term-Memory-based architecture. We evaluate our method on the UW-IOM and TUM Kitchen datasets and discuss the success and failure cases in these two datasets..
The details of the best performing multi-task learning network architecture is shown in the following picture.
ย
How to run the code
Required Environment
To install all the requirments for this project you can create a conda environment using the MLTGCN_environment.yml, by executing the following command in your terminal:
Configration for experiments on UW/TUM dataset is in config_files folder. You can change the task by editting the experiment files, for example config_UW_exp.yml.
There are four tasks to choose from, ['classification', 'regression', 'MTL', 'MTL-Emb'].