Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 2 – uploaded by International Journal of Electrical and Computer Engineering (IJECE)

See full PDF downloadDownload figure

Figure 2. Sequence-to-sequence model is based on RNN For object action’s description in a video, we build a sequence-to-sequence model as shown in Figure 2. Image sequences, after being extracted features, will go through the encoder model built with an LSTM class for storing information from the previous image frames, which support to predict the actions of the next image frames. In the decoder model, after the sequence of image features go through the encoder model, there will be a context vector containing characteristic information. This feature vector will be combined with inputs (context vectors) and sent to the LSTM layers to decode the information. 2.2.2. Sequence-to-sequence model with attention The sequence-to-sequence model uses only one feature vector encoded over a sequence of information extracted from the image frames, which will lose a lot of notably important information in the states. To limit this, another improvement of the sequence-to-sequence model has been done when we combine it with attention mechanism. We propose scale dot-product attention for the model. Attention model is defined as in (6) [25], [28]: — Figure 2 Sequence-to-sequence model is based on RNN For object action’s description in a video, we build a sequence-to-sequence model as shown in Figure 2. Image sequences, after being extracted features, will go through the encoder model built with an LSTM class for storing information from the previous image frames, which support to predict the actions of the next image frames. In the decoder model, after the sequence of image features go through the encoder model, there will be a context vector containing characteristic information. This feature vector will be combined with inputs (context vectors) and sent to the LSTM layers to decode the information. 2.2.2. Sequence-to-sequence model with attention The sequence-to-sequence model uses only one feature vector encoded over a sequence of information extracted from the image frames, which will lose a lot of notably important information in the states. To limit this, another improvement of the sequence-to-sequence model has been done when we combine it with attention mechanism. We propose scale dot-product attention for the model. Attention model is defined as in (6) [25], [28]:

Related Figures (9)

where SoftMax function use to be used; K;, V; respectively the key-values, key of the string in hidden state i. There are many other versions of attention were applied to improve each specific sequence-to- sequence problem with high efficiency such as: dot-product attention, adaptive attention, multi-level attention, multi-head attention and self-attention [17], [26], [27] In our problem, we propose scaled-dot- product attention for sequence-to-sequence model and multi-head attention for transformer model. 2.2. Model for describing action via camera Based on deep learning techniques, we propose three models to solve the camera action description problem: sequence-to-sequence model based on RNN, sequence-to-sequence model with attention and transformer model.

Figure 3. Sequence-to-sequence model with attention 2.2.3. Transformer model

Figure 4. Transformer model

Figure 5. Loss function of three models over training epochs (a) sequence-to-sequence model based on RNN, (b) Sequence-to-sequence model with attention, and (c) transformer model a The training results of 3 models on Google Colab are depicted in Figure 5. We can see that the plot for loss function is smooth, the training process of three models converged well. In Figure 5(a), the model sequence-to-sequence model based on RNN converged after 25 epochs. In Figure 5(b), the sequence-to- sequence model with attention converged after 15 epochs. In Figure 5(c), the transformer converged after 15 epochs.

3.3.2. Evaluate the performance of the model

Figure 6. Accuracy of three models on train dataset over training epochs based on predicting BLEU (solid line), METEOR (dashed line), ROUGE] (star line) and ROUGEL scores (dotted line) (a) sequence-to- sequence model based on RNN, (b) sequence-to-sequence model with attention, and (c) transformer model

Table 2. Execution time of the models

Figure 7. Video used for test models (a) a cat is playing bowling and (b) a baby is smiling In Figure 7(a), the sequence-to-sequence model based on the RNN predicts the comment: “mdt con méo dang choi bowling (a cat is playing bowling)", the sequence-to-sequence model with attention predicts the comment: "mdt con méo dang choi voi mot qua bong (a cat is playing with a ball)”, the transformer model predicts the caption:"m6t con méo dang choi voi mét qua bong (a cat is playing with a ball)". In Figure 7(b), the sequence-to-sequence model based RNN predicts the comment "mdt em bé dang cuoi (a baby is smiling)", the sequence-to-sequence model with attention predicts the comment: "mdt em bé dang cudi (a baby is smiling)", the transformer model predict the caption as "mét em bé dang ngoi trén ghé sofa cuoi (a baby is sitting on the sofa end smiling)".

Figure 8. Deploy the model to the application, (a) the application generate caption as "a woman is cuttin, onions" and (b) the application generate caption as "a group of people dancing"

Related topics:

Natural Language Processing Sequence-to-Sequence Models

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved