Code Monkey home page Code Monkey logo

egocentric-gaze-prediction's Introduction

Code for the paper "Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition" (ECCV2018)

This is the github repository containing the code for the paper "Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition" by Yifei Huang, Minjie Cai, Zhenqiang Li and Yoichi Sato.

Requirements

The code is tested to work correctly with:

  • GPU environment
  • Anaconda Python 3.6.4
  • Pytorch v0.4.0
  • NumPy
  • OpenCV
  • tqdm

Simple test code

Output gaze prediction using one image only!

  1. Download pretrained models: spatial, late and put them into path/to/models.

  2. Prepare some images named with **_img.jpg in path/to/imgs/.

  3. Run run_spatialstream.py --trained_model /path/to/models/spatial.pth.tar --trained_late /path/tp/models/late.pth.tar --dir /path/to/imgs/ and see the results.

This module assumes fixation at predicted gaze position without any attention transition. Note the model is trained on GTEA Gaze+ dataset, I haven't tested images from other datasets, so images from the same dataset is recommended to use.

Model architecture

Code usage

For simplicity of tuning, we separate the training of each module (SP, AT and LF)

Dataset preparation

We use GTEA Gaze+ and GTEA Gaze dataset.

For the optical flow images, use dense flow to extract all optical flow images, and put them into path/to/opticalflow/images (e.g. gtea_imgflow/). The flow images will be in different sub-folders like:

    .
    +---gtea_imgflow
    |
        +---Alireza_American
        |   +---flow_x_00001.jpg
            +---flow_x_00002.jpg
            .
            .
            +---flow_y_00001.jpg
            .
            .
        +---Ahmad_Burger
        |   +---flow_x_00001.jpg
        .
        .
        .

All images should be put into path/to/images (e.g. gtea_images/).

The ground truth gaze image is generated from the gaze data by pointing a 2d Gaussian at the gaze position. We recommend ground truth images to have same name with rgb images. Put the ground truth gaze maps into path/to/gt/images (e.g. gtea_gts/). For 1280x720 image we use gaussian variance of 70. Processing reference can be seen in data/dataset_preprocessing.py

We also use predicted fixation/saccade in our model. Examples for GTEA Gaze+ dataset are in folder fixsac. You may use any method to predict fixation.

Running the code

To run the complete experiment, after preparing the data, run

python gaze_full.py --train_sp --train_lstm --train_late --extract_lstm --extract_late --flowPath path/to/opticalflow/images --imagePath path/to/images --fixsacPath path/to/fixac/folder --gtPath path/to/gt/images

The whole modle is not trained end to end. We extract data for each module and train them separatedly. We reccomend to first train the spatial and temporal stream separatedly, and then train the full SP module using pretrained spatial and temproal models. Direct training of SP result in slightly worse final results but better SP results.

Details of args can be seen in gaze_full.py or by typing python gaze_full.py -h.

Pre-trained model

You can find pre-trained SP module here

The module is trained using leave-one-subject-out strategy, this model is trained with 'Alireza' left out.

Publication:

Y. Huang, M. Cai, Z. Li and Y. Sato, "Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition," European Conference on Computer Vision (ECCV), 2018. (oral presentation, acceptance rate: 2%)
[Arxiv preprint]

[CVF Open Access]

Citation

Please cite the following paper if you feel this repository useful.

@inproceedings{huang2018predicting,
  title={Predicting gaze in egocentric video by learning task-dependent attention transition},
  author={Huang, Yifei and Cai, Minjie and Li, Zhenqiang and Sato, Yoichi},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={754--769},
  year={2018}
}

Contact

For any question, please contact

Yifei Huang: hyf(.at.)iis.u-tokyo.ac.jp

egocentric-gaze-prediction's People

Contributors

hyf015 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

egocentric-gaze-prediction's Issues

How to train fixation state predictor without teaching data in GTEA Gaze

I have a question about your paper.
According to your paper, you use GTEA Gaze dataset and GTEA Gaze Puls dataset.
But it seems that information of fixation or saccade are not included in GTEA Gaze while included in GTEA Gaze Plus.

So, do you train fixation state predictor by only GTEA Gaze Plus??
I think we cannot train fixation state predictor by GTEA Gaze dataset.

There are some problems when I'm making dense_flow on windows

Hi, I tried to implement this repo but at the first step I was stuck : (
My system is Win10 and I used the cmake-gui to make the dense_flow, after fixed endless errors when I'm making, I finally can generate, but get a .sln and many other files for visual studio.
So I opened my visual studio2019 to continue to build.I clicked the "rebuild" at "ALL_BUILD" but there are hundreds of errors and warnings.
It is painful for me to face such problems because I just spent one whole day to use cmake to generate, and now more problems appeared.
So I began to wonder if this can be implemented on Windows platform?
Thank you : )

What is the value of the variance of 2D Gaussian?

The ground truth gaze image is generated from the gaze data by pointing a 2d Gaussian at the gaze position. Put the ground truth gaze maps into path/to/gt/images.

I want to generate the ground truth gaze map, so I need the value of the variance of 2d Gaussian. I could not find that in your paper. Would you tell us that value?

Thank you.

Something wrong with browsing place in data/STdatas.py

In data/STdatas.py in line, you wrote:

if name == 'main':
imgPath = '../gtea_imgflow'
gtPath = '../gtea_gts'
fixsacPath = '../fixsac'

But I guess it should be

if name == 'main':
imgPath = '../../gtea_imgflow'
gtPath = '../../gtea_gts'
fixsacPath = '../fixsac'

Am I wrong?

Which directory structure is correct??

We used dense flow to generate the input files.
But we want to know which directory structure is correct.

A.
gtea_imgflow/flow/Alireza_American/flow_x_00001.jpg,flow_x_00002.jpg,…,flow_y_00001.jpg,…

or

B.
gtea_imgflow/Alireza_American/flow_x_00001.jpg,flow_x_00002.jpg,…,flow_y_00001.jpg,…

As far as I can see your code, B is correct. But as far as I can see your README.md, A is correct.

Thank you.

We cannot go into SP module during testing

You wrote like below in gaze_full.py

if args.train_sp:
  sp = SP(lr=args.lr, loss_save=args.sp_save_img, save_name=args.save_sp,...)
  sp.train()
   args.pretrained_model = os.path.join(args.save_path, args.save_sp)

But it should be

sp = SP(lr=args.lr, loss_save=args.sp_save_img, save_name=args.save_sp,...)
if args.train_sp:
   sp.train()
  args.pretrained_model = os.path.join(args.save_path, args.save_sp)
else:
  sp.testSP()

Where are Gt^s and Gt^a saved?

I want to know where Gt^s and Gt^a will be saved.

I think Gt^a will be saved in ../new_feat.
And predicted gaze image will be generated in ../new_pred.

Where can we find Saliency map Gt^a?

Question about function extract_late in AT.py

Hi, thank you for creating this repo! I'm a little confused about the code of extract_late.

In equation 2 of your paper, you obtain the weights wt-1 from cropping and averaging the spatial latent representation Ft-1 at time t-1. This spatial latent representation seems to be feature_s from line 226, and the cropping is achieved in lines 235-241. Then lines 242-252 implements equations 3 and 4 depending on if the frame t-1 is a fixation or not. In equation 4 you weigh the new weights wt on the spatial latent representation at Ft at time t, which makes a lot of sense, but in the code feat = get_weighted(chn_weight, feature_s) new chn_weight wt are still used to weigh the same feature_s at time t-1. Maybe I missed something? Thanks in advance for your help!

sigma of gaussian filter used in auc

In 'utils.py' you have used a gaussian filter to get the auc.

z = ndimage.filters.gaussian_filter(z, 14)

Is there a reason to set sigma to 14?

The number of data of grand truth images and RGB images are not the same as each other.

We read your dataset_preprocessing.py and understood how to generate ground truth and fixation.txt.
But the number of data of grand truth isn't same as that of RGB images.

For example, Alireza_American.avi in GTEA datasets plus has 19824 frames, but dataset_preprocessing.py generate 19847 images.

We think it causes error in gaze_full.py. How did you solve it?

How to run testing?

We have finished training. So next, we want to test. In order to test, should we run

python3 gaze_full.py --flowPath ../gtea_imgflow_pre --imagePath ../gtea_images --fixsacPath ../fixsac --gtPath ../gtea_gts --pretrained_model save/best_SP.pth.tar --pretrained_lstm save/best_lstm.pth.tar --pretrained_late save/best_late.pth.tar --extract_lstm --extract_late

Is it correct?
In this case, will SP module work?

Pretrained models

Would it be possible to provide pretrained models from your experiments which can be used directly for evaluation? A script to use the pretrained models to generate saliency maps for a video would also be very helpful.

Thanks for putting up your code! :)

late_fusion.py does not work

late_fusion.py does not work because you cannot cat two tensors whose shapes are different from each other. The dimensions of f and g are (10,1,14,14) and (10,1,224,224) in line 20.

I think you should not comment out in line 19.
You have to change the dimension of f from (10,1,14,14) to (10,1,224,224) by upsampling.

The number of convolution layer is not enough in models/model_SP.py

You wrote, " The SP module is a set of 5 convolution layer groups following the inverse order of VGG16 while changing all max-pooling layers into upsampling layers." in your paper. In VGG16, the final convolution layer group has three layers. So you have to add one more layer in the first convolution layer group when you decode.

I mean that models/model_SP.py should be modified.
You wrote

self.decoder = nn.Sequential(
nn.Conv2d(512, 512, kernel_size=3, padding = 1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, kernel_size=3, padding = 1), nn.ReLU(inplace=True),
nn.Upsample(scale_factor=2),

But I guess it should be

self.decoder = nn.Sequential(
nn.Conv2d(512, 512, kernel_size=3, padding = 1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, kernel_size=3, padding = 1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, kernel_size=3, padding = 1), nn.ReLU(inplace=True),
nn.Upsample(scale_factor=2),

Thank you.

computeAAEAUC

In the function of computeAAEAUC, why it is necessary to subtract 112 from each indices of x and y?

r1 = np.array([predicted[0]-112, predicted[1]-112, d])
r2 = np.array([i-112, j-112, d])

Something wrong with directory structure.

According to your README.md, directory structure should be like

path/to/images
path/to/opticalflow/images/flow/ …
path/to/gt/images

But in data/dataset_preprocessing.py in line from 80 to 85, you browse

gtea_gaze/
gtea_flows/
gtea_images/
gtea_gts/
gtea_imgflow/

We have to create these folders to implement your python file but we do not know what kind of date should be in each folder. Would you explain correct directory structure?

We have to delete 10 grand truth images each video.

To train SP module, we have to delete below files.

gtea_gt/Alireza_American_000000.jpg,…,Alireza_American_000010.jpg
gtea_gt/Alireza_Burger_000000.jpg,…,Alireza_Burger_000010.jpg



gtea_gt/Yin_Turkey_000000.jpg,…,Yin_Turkey_000010.jpg

I recommend you to write this important information in README.md.

There is an error in running the code dataset_preprocessing.py

First of all, thank you very much for the author's work and code.
There are some questions that I would like to ask when I run the code.

Your code: dataset_preprocessing.py can be used to convert coordinate points in the ground truth.txt file into saliency maps. But there are many times in the txt file that there are two or three gaze points in the same frame, as if your code does not determine the corresponding relationship between gaze points and frames, and the resulting saliency map may be incorrect.

Another question is why the following judgment should be made in the function "parsetxt"?
line 42:
“elif int(s[5]) not in nframe:
if nframe[-1] + 1 == int(s[5]):
if int(round(float(s[3]))) in range(1280) and int(round(float(s[4]))) in range(960):

I look forward to your reply. Thank you very much.

Is g_{t-1} found by grand truth during training??

I have a question about Channel weight extractor in Attention Transition Module.
In Channel weight extractor, we need the coordinate of predicted gaze point in the previous frame g_{t-1}.

In your paper in Section 3.4, you wrote g_{t-1} is the PREDICTED gaze point.
But, in AT.py, you use the function "computeAAEAUC" to know g_{t-1} and it seems that you find g_{t-1}
by only sample['gt'] which means GRAND TRUTH.

So, my understanding is below:
During training, you use the gaze point of GRAND TRUTH in the previous frame.
During testing, you use the PREDICTED gaze point in the previous frame.

Is it correct?

TypeError: expected np.ndarray (got NoneType)

Thanks for your shared code! When I followed the guidelines, I met the problem as below:
Traceback (most recent call last):
File "gaze_full.py", line 99, in
sp.train()
File "/home/lqx/LabCode/Gaze/SP.py", line 199, in train
loss1 = self.trainSP()
File "/home/lqx/LabCode/Gaze/SP.py", line 125, in trainSP
for i, sample in tqdm(enumerate(self.STTrainLoader)):
File "/usr/local/lib/python3.5/dist-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/lqx/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 286, in next
return self._process_next_batch(batch)
File "/home/lqx/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/home/lqx/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/lqx/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/lqx/LabCode/Gaze/data/STdatas.py", line 68, in getitem
flowarr.append(torch.from_numpy(currflowx))
TypeError: expected np.ndarray (got list).
I used python 3.5, Cuda 8.0, torch 0.4, opencv 2.4.13, numpy 1.14. It seems like caused by wrong numpy

GPU memory

Could you tell me that what's your GPU memory for your experiment?thanks.

And could you tell me the system and hardware information about your computer?

Hi again, do you run your project on ubuntu platform or macos?
And what kind of hardware is required to train the model scuuessfully..My laptop is i5-6300HQ and NVIDIA GTX960M with a 8G RAM, I'm worried about that it can't train the model. I'll be very sad if it can't run at all , but if it's just slower than the high performance computer, that would be lucky for me.
Thank you!

Should we resize input files by ourselves?

By running dataset_preprocessing.py, we can generate gazemap and it will be resized (224,224) in dataset_preprocessing.py.

But in dataset_preprocessing.py, input image files are not resized. Do we have to resize input image files by ourselves? or by running gaze_full.py, will input image files be resized (224,224) automatically?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.