Code Monkey home page Code Monkey logo

pandla-vijay / video-captioning-using-spatio-temporal-features-and-gaussian-attention Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 1.51 MB

This project utilizes advanced deep learning techniques to automatically generate contextually relevant captions for videos by extracting spatial and temporal features, while incorporating Gaussian attention to focus on important regions. This enhances video indexing, retrieval, and accessibility for visually impaired individuals.

Jupyter Notebook 100.00%
gru lstm msvd spatio-temporal-data video-captioning

video-captioning-using-spatio-temporal-features-and-gaussian-attention's Introduction

Video-Captioning-using-Spatio-temporal-features-and-Gaussian-Attention

Video captioning is a challenging task in the domain of computer vision and natural language processing that aims to automatically generate descriptive textual representations for video content. The incorporation of spatio-temporal features allows the model to capture both spatial information related to individual frames and temporal dynamics across the video sequence. Coupled with the power of attention mechanisms, such as Gaussian Attention, the model can effectively focus on salient regions in the video, further enhancing the quality and relevance of generated captions.

In this project, we explore the application of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks to video captioning, leveraging spatio-temporal features and Gaussian Attention. We employ the widely used MSVD (Microsoft Research Video Description) dataset, containing a diverse range of videos with corresponding human-generated captions, to train and evaluate our model.

Our objective is to develop a robust and contextually-aware video captioning system that can automatically generate meaningful and accurate descriptions, effectively bridging the gap between visual content and natural language, with potential applications in video indexing, retrieval, and accessibility for visually impaired individuals. Through this project, we aim to contribute to the growing field of video understanding and multimodal AI research, while fostering advancements in human-computer interaction and multimedia analysis.

For extracting the features

  1. Download the original MSVD dataset from here
  2. For the already extracted features use this

Results

After training LSTM and GRU models using spatio-temporal features, the generated weights are utilized to generate captions for input videos.Once the models were trained, they were utilized to generate captions for new, unseen videos. The weights learned during training played a crucial role in determining the attention given to different regions and frames within the videos. By focusing on the most relevant visual cues, the models aimed to generate accurate and descriptive captions. The performance of these models is then evaluated using various metrics mentioned. Among these two approaches, the GAUSSIAN model with GRU (Gated Recurrent Unit) achieved the highest METEOR score of 0.304. Overall, the model utilizing spatio-temporal features with GRU outperformed the model employing LSTMs.

Evaluation table

Model Bleu_1 Bleu_2 Bleu_3 Bleu_4 METEOR ROUGE_L CIDEr
LSTM + GAUSS 0.641 0.438 0.330 0.241 0.206 0.568 0.302
GRU + GAUSS 0.782 0.652 0.537 0.425 0.304 0.684 0.647

video-captioning-using-spatio-temporal-features-and-gaussian-attention's People

Contributors

pandla-vijay avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.