基于注意力机制的群组行为识别方法

英文篇名：Group Activity Recognition Method Based on Attention Mechanism
作者：王传旭 ; 龚玉婷
英文作者：Wang Chuanxu;Gong Yuting;School of Information Science and Technology, Qingdao University of Science and Technology;
关键词：群组行为 ; 图像处理 ; 注意力机制 ; 行为识别
英文关键词：group activity;;image processing;;attention mechanism;;behavior recognition
中文刊名：SJCJ
英文刊名：Journal of Data Acquisition and Processing
机构：青岛科技大学信息科学技术学院;
出版日期：2019-05-15
出版单位：数据采集与处理
年：2019
期：v.34;No.155
基金：国家自然科学基金(61472196,61672305)资助项目
语种：中文;
页：SJCJ201903004
页数：8
CN：03
ISSN：32-1367/TN
分类号：38-45

摘要

在基于视频图像的群组行为识别方法中,传统的深度学习方法大多使用标准(最大/平均)池化操作对卷积特征进行处理,并且未考虑群组行为中的关键人物对群组行为分类的重要性。针对以上问题,本文提出一种基于注意力机制的模型来检测群组行为视频中的行为,重点关注活动中的关键人物,根据注意力权重的不同分配动态地对卷积特征进行池化,最终正确识别视频图像中的群组行为。此模型在群组行为数据集CAD(Collective activity dataset)和CAE(Collective activity extended dataset)上的识别准确率优于许多使用标准池化结构的现有模型。
In the video image based group activity recognition method,the traditional deep learning methods generally use the conventional(maximum/average)pooling to process the convolutional feature.However,these methods do not consider the importance of the key characters in the group activity which influence the classified result of group behavior. Therefore,we propose an attention based model to detect behavior in group activity videos. In order to identify the group behavior correctly in the video image,this model focuses on the key people in the activity and pools convolutional features dynamically according to the weight of the attention. We conduct extensive experiments on two group behavior datasets,CAD(Collective activity dataset)and CAE(Collective activity extended dataset). The recognition accuracy of our model is better than many existing models using conventional pooling structure.

引文

[1] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Conference and Workshop on Neural Information Processing Systems. Montreal, CAN:Curran Associates, 2014:568-576.
    [2] Ibrahim M S, Muralidharan S, Deng Zhiwei, et al. A hierarchical deep temporal model for group activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA:IEEE, 2016:1971-1980.
    [3] Lan T, Sigal L, Mori G. Social roles in hierarchical models for human activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA:IEEE, 2012:1354-1361.
    [4] Ramanathan V, Yao B, Li F F. Social role discovery in human events[C]//IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA:IEEE, 2013:2475-2482.
    [5] Hajimirsadeghi H, Yan W, Vahdat A, et al. Visual recognition by counting instances:A multi-instance cardinality potential kernel[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston:IEEE, 2015:2596-2605.
    [6] Wang M, Ni B, Yang X. Recurrent modeling of interaction context for collective activity recognition[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA:IEEE, 2017:2-8.
    [7] Li X, Chuah M C. Sbgar:Semantics based group activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA:IEEE, 2017:2876-2885.
    [8] Choi W, Savarese S. A unified framework for multi-target tracking and collective activity recognition[C]//European Conference on Computer Vision. Firenze, Italy:Springer-Verlag, 2012:215-230.
    [9] Choi W, Shahid K, Savarese S. Learning context for collective activity recognition[C]//Computer Vision and Pattern Recognition. Colorado Springs, USA:IEEE, 2011:3273-3280.
    [10] Rensink R A. The dynamic representation of scenes[J].Visual Cognition,2000,7(1/3):17-42.
    [11] Xu K, Ba J, Kiros R, Cho K, et al. Show, attend and tell:Neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France:IMLS, 2015:2048-2057.
    [12] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems.Long Beach, USA:Curran Associates, 2017:5998-6008.
    [13] Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization[EB/OL]. https://arxiv. org/abs/1409.2329,2014.
    [14] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//International Conference on Learning Representations. San Diego, USA:ICLR, 2015:1-15.
    [15] Srivastava N, Hinton G E, Krizhevsky A, et al. Dropout:A simple way to prevent neural networks from overfitting[J].JMLR,2014,15(1):1929-1958.
    [16] Kingma D P, Ba J. Adam:A method for stochastic optimization[C]//International Conference on Learning Representations.San Diego, USA:ICLR, 2015:1-13.
    [17] Choi W, Shahid K, Savarese S. What are they doing?:Collective activity classification using spatio-temporal relationship among people[C]//IEEE International Conference on Computer Vision Workshops. Kyoto, Japan:IEEE, 2009:1282-1289.
    [18] Kaneko T, Shimosaka M, Odashima S, et al. A fully connected model for consistent collective activity recognition in videos[J].Pattern Recognition Letters,2014,43(1):109-118.
    [19] Azar S M, Atigh M G, Nickabadi A. A multi stream convolutional neural network framework for group activity recognition[EB/OL]. https://arxiv.org/abs/1812.10328, 2018.
    [20] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]//International Conference on Learning Representations. San Diego, USA:ICLR, 2015:1-14.
    [21] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):677-691.