[This article belongs to Volume - 58, Issue - 01, 2026]
Gongcheng Kexue Yu Jishu/Advanced Engineering Science
Journal ID : AES-27-03-2026-92

Title : MULTI-ATTENTION TRANSFORMER FRAMEWORK FOR HUMAN ACTION RECOGNITION
T Mita Kumari, Abhimanyu Sahu and Dinesh Kumar Dash

Abstract :

Human Action Recognition (HAR) is an important subfield of computer vision that has been motivated by its broad application in intelligent surveillance, human health, athletics, and human-computer interaction. Although enormous improvements have been achieved based on convolutional and recurrent neural networks, in current methods, it is frequently not practical to grasp long-range temporal structure and intricate spatio-temporal connections in video data. This paper will solve these shortcomings by making suggestions of a Multi-Attention Transformer Framework to ensure effective and powerful human action recognition. The given model combines several types of attention mechanisms, i.e., spatial attention, temporal attention, and channel attention, with the help of a transformer-based model to learn better represent features and consider global contextual dependencies. The hybrid approach to feature extraction is used, which involves the use of both convolutional layers to produce an encoding of the local spatial feature and transformer encoders to produce a long-range coding of the temporal feature. The framework is geared to efficiently deal with issues including occlusions, variations in viewpoints, and intricate movements. The performance of the proposed approach is experimentally assessed on benchmark datasets, and it is shown to be superior to the traditional CNN-based, RNN-based, and standard transformer models in terms of accuracy, precision, and computation efficiency. Multi-attention modules offer great gains to the discriminative strength of the model, without introducing new challenges to scaling the model to real-world situations. The results demonstrate the usefulness of multi-attention transformer architectures to state-of-the-art in human action recognition. The suggested framework would lead to the creation of more advanced and resilient systems of HAR, which would be a step toward further studies in multimodal and real-time action recognition.