kracwarlock / action-recognition-visual-attention Goto Github PK

View Code? Open in Web Editor NEW

352.0 25.0 157.0 1009 KB

Action recognition using soft attention based deep recurrent neural networks

Home Page: http://www.cs.toronto.edu/~shikhar/projects/action-recognition-attention

Python 7.06% Jupyter Notebook 92.94%

deep-learning deep-neural-networks deeplearning action-recognition video paper deep attention-mechanism soft-attention

action-recognition-visual-attention's People

Contributors

Stargazers

Watchers

Forkers

eriche2016 dhaneshr benjamesbabala caomw amoliu fireae xuanhan863 cvml vyouman liangdu xzflin pakjce kevinwenya mathkann wangg12 peratham wgapl oztc sunlinjie mathrho yidann michaelxin zhaobozb ilovecv parthchadha jrabary guanghan chenfsjz pinglmlcv charlesyann adityosanjaya ouya-bytes kyuusaku xinyangj xlliu7 scotthavird txd866 lyk125 bullud kings-gee thudzj lihao0214 charlesshang yantianzha appcoreopc ashispapu sbaner03 ml-lab saliencydetection arasharchor milestonesvn lanlianhuaer chuckcho weilamchung bityangke vyraun hyzcn pengcheng-wang tybxiaobao huan2016 longlong-jing xxradon xdxuefei recluze yuyang3478 ilibx shaoxuan92 ieyer wanghuogen yifangfu heysadboy remega mvpduncan ju-chang roy881020 chengwuliang gchen2016 liangxu123 lightbillow konglongteng bysowhat zgsxwsdxg raymongd007 bensondou elchico1990 wxw420 tsingzao iqbal-chowdhury chenlx92 xudonglinthu newzhx ritesh1991 chengmuni66 jillian2017 afshaanmaz willdamon yumiaogithub shubhampachori12110095 liangzhangxd jsmilemsj

action-recognition-visual-attention's Issues

Best model selection

@kracwarlock
For selecting the best model, currently the test mAP is used. Shouldn't it be the validation mAP?

After update 100, the model begin to predict, while as shown above, the prediction has been running for over 24 hours without progress to new model update, is that normal? If not, what maybe the main cause of this problem? I'm using your code of issue #6 to do the data preprocessing, with train/validate/test 6:2:2, thanks!

Results reproducibility

Hi @kracwarlock! Thank you for sharing your code of your amazing paper! In order to reproduce your published results, I was wondering how to select the validation split for HMDB-51 and Hollywood2 datasets. Referring to the latter ones, can you please share your files valid_labels.txt, train_labels.txt, test_labels.txt, train_filenames.txt, test_filenames.txt and valid_filenames.txt for that two datasets? I will really appreciate it a lot :) :) :)

How the soft attention model be implemented in this project?

Hi, @kracwarlock. I am so confused about the implementation of the soft attention model. Why the codes related to alpha (pstate & pctx) in the lstm_cond_layer function are different from the equation (4) in your paper? Hope you can give more explanations about how the weights mappingW_i to the ith element of the location softmax be implemented in this project. Thanks a lot.

Possible redundant code in actrec_mAP.py

If you search for if numpy.mod(uidx, saveFreq) == 0: in actrec_mAP.py, you will get two blocks. Is there any specific reasons to save the model twice?

Your time for one 128-batch?

Hi @kracwarlock ,

This is my first time to train a net using Theano. I wonder if my setup was wrong when it takes so long even it prints out that my GPU is used. I train networks in Caffe it is much faster. Do you remember roughly how many seconds does it take for one 128-image batch in your training? It takes me about 60 second for 1 batch.

Thank you.

Multi-layer LSTM

Hi @kracwarlock, sorry to bother you again. I open here another issue related to you comment

I see that I did not release the multi-layer LSTM code. I will try to do that as soon as I have time. Till then this is how it is done https://github.com/kelvinxu/arctic-captions/blob/master/capgen.py#L542-L548. In the paper the X means the feature of a single sample. In the code everything is done on a batch.

I tried the way you suggested, but soon realized that it cannot work: in fact, by simply replicating the lstm_cond_layer, you also replicate the theano.scan which iterates over the n_steps. It seems to me that this prevents the upper lstm layer to provide the location sofmax to the lower one at each time step. I reckon the multiple layers should be implemented inside the single theano.scan instance.

Could you please either comment on that or provide the original code of the multilayer LSTM?
Thanks in advance!

Can anyone share the code mentioned in the closed issue #6 which is used to combine features to h5 format?

I am interested in combination of deep learning and attention mechanism. I think this is a good way to know what deep learning focus on. I have got the features extracted by convolution layer of GoogLeNet, but the second link in issue #6 is not available (To combine the individual files generated by this script that he sent me I used https://gist.github.com/kracwarlock/96499936487d6125dd010319669c6648).
Can anyone share this code again?
Thanks very much!

1 Input data

Hello @kracwarlock
I have two questions about the input data:

When training for Hollywood2, I get a memory error "Cannot allocate memory" before the training loop starts, probably due to the big amount of data. In the pre-processing, I sparsify the features, and my training feature file is about 10 GB. How big is this file for you?
If I use a subset of the dataset (to avoid the above memory problem), the training starts fine but at Epoch 0 Update 1240 an error occurs: "NaN detected in cost".

Thanks!

Initialization of LSTM layers

How did you initialize the cell state and the hidden state of the LSTM layers?
You gave an equation but didn't explain much. I wonder what the f_init function is. I read the code and guess it is a tanh function. How did you do that separately for the 3 layers? And I don't know what the X meant. Is it the feature of a single sample or a batch?

Can you elaborate how to do data preprocessing?

Hi,

I think I need to prepare four preprocessed files (https://github.com/kracwarlock/action-recognition-visual-attention/tree/master/util). That said, I'm confused at how to get "train_features.h5".

Could you please share your related code that can do this? I would appreciated more if you can share all of the codes that do those preprocessing jobs.

Thank again!

How to get the h5 features file，how to create h5 features file

dataspace = H5S.create_simple( frames,[7 7 1024],{'H5S_UNLIMITED' 'H5S_UNLIMITED'} );
fid = H5F.create('train_features.h5','H5F_ACC_TRUNC', 'H5P_DEFAULT', 'H5P_DEFAULT');
dataset = H5D.create( fid,'features',H5T_IEEE_F32LE,dataspace);it is not to work by python,how to create h5 features file,

Possible bug on data_handler : Reset()

Hi @kracwarlock!
I'd like to check with you the following:

On data_handler.py, the Reset() function shuffles self.frame_indices_ and self.labels_
Then on GetBatch(), it retrieves some values from these arrays to build the batch :
start = self.frame_indices_[self.frame_row_]
label = self.labels_[self.frame_row_]
length = self.lengths_[self.frame_row_]
And based on the length, it decides which features to include:
if length >= self.seq_length_ * skip: ...
else:

Shouldn't self.lengths_ be shuffled in Reset() as the other two arrays?

Thank you!
Nuno Garcia

Location softmax

Hi, thanks for sharing the code!
It looks like the location softmax implemented in the conditional lstm is not the one you describe in the paper 'Action Recognition with Visual Attention' (eq. 4), but rather the one described in eqns. 4,5,6,9 in 'Show, Attend and Tell: Neural Image Caption Generation with Visual Attention'. Could you please comment on that? Really appreciate, thanks!

Python Scripts for CNN Encoding Generation

Hi,

I am using tensorflow in python to generate CNN encodings for the video sequences provided by UCF101 dataset. I am dumping the outputs sequentially in a HDF5 file. My code is currently taking 20s per video file to store it's data in HDF5 file. So, for 9500 videos sequences, it will take a lot of time.

Can someone share their experience in it?

some question about the input and output of the LSTM

Hello,Kracwarklock.I want to know is l(ti) reprensent a scalar?and what the shape of the x(t)=sum{l(ti)_X(t,i)},i=1....k_k, x(t) is a vextor?a martix? or a cube?

How to get the h5 features file

Hello ,after extracting the features in Matrix form, I tried to convert it to HDF5 file by Matlab.
But I don't know whether did I get the right format.
Does this look right?

dataspace = H5S.create_simple( frames,[7 7 1024],{'H5S_UNLIMITED' 'H5S_UNLIMITED'} );
fid = H5F.create('train_features.h5','H5F_ACC_TRUNC', 'H5P_DEFAULT', 'H5P_DEFAULT');
dataset = H5D.create( fid,'features',H5T_IEEE_F32LE,dataspace);

which frames is read from the train_framenum.txt
Or would you plz add some files about how to get the features file? THANK YOU!

I'm new to this task. Could I know that how many frame are feed in testing time?

Thank you @kracwarlock

Enquiries regarding Data Preprocessing

Thanks for make this interesting project open-source. I am trying to replicate the work discussed in the paper. However, the training procedure for Hollywood2 and UCF11 data sets does not converge. I suspect that something is wrong with the extracted features.

I use Python interface of Caffe to extract the features from layer "inception_5b/output" of GoogLeNet. The shape of the features is (1024, 7, 7). According to other forum posts, the shape should be (7, 7, 1024). So I have swapped the axes of the features accordingly. Is that the difference between MATLAB interface and Python interface?
Among the 1024 feature maps, appropriately 35% of them only consist of zeros. Is it normal?
In the Matlab script, how do you define the name of the feature layer that you intend to use, such as "inception_5b/output"? The script simply uses scores = caffe('forward', {input_data{i}});.

Any help would be greatly appreciated :-)

How to get the dataset?

How did you require the dataset? just like train_features.h5, train_framenum.txt and train_labels.txt, etc.
self.data_file = '/ais/gobi3/u/shikhar/ucf11/dataset/train_features.h5' self.num_frames_file = '/ais/gobi3/u/shikhar/ucf11/dataset/train_framenum.txt' self.labels_file = '/ais/gobi3/u/shikhar/ucf11/dataset/train_labels.txt' self.vid_name_file = '/ais/gobi3/u/shikhar/ucf11/dataset/train_filename.txt' self.dataset_name = 'features'
I don't know how to get these? Download from web or generate these yourself? Could you tell me the method. Thanks.