Video understanding identifies and classifies various actions and events in the video. Many previous works, such as video annotations, have demonstrated promising success in generating general video understanding. However, a fine summary of human activities and their interactions using state-of-the-art video captioning techniques is still difficult to produce. The comprehensive explanation of human actions and collective behaviors is important information for real time CCTV video tracking, medical treatment, sports video analysis etc. This research suggests a form of video understanding that focuses primarily on identifying group activity by learning the similarities between the pair and the actors appearance. In order to measure the similarity between the pair appearances and construct an actor relations map, the Zero Mean Normalized Cross-Correlation (ZNCC) and the zeromean sum of absolute differences(ZSAD) are proposed to allow the graph convolution network (GCN) to learn how to distinguish group actions. We recommend that MNASNet be used as the backbone to retrieve features from any video frame. A visualization model is also developed to visualize every input video frame and predict individual behavior or collective activity with projected bounding boxes on a human object.
Figure 1
Figure 2
Figure 3
This preprint is available for download as a PDF.
This is a list of supplementary files associated with this preprint. Click to download.
Loading...
Posted 18 Mar, 2021
On 31 Mar, 2021
On 28 Mar, 2021
On 28 Mar, 2021
Received 28 Mar, 2021
Invitations sent on 15 Mar, 2021
On 08 Mar, 2021
On 08 Mar, 2021
On 08 Mar, 2021
On 25 Feb, 2021
Posted 18 Mar, 2021
On 31 Mar, 2021
On 28 Mar, 2021
On 28 Mar, 2021
Received 28 Mar, 2021
Invitations sent on 15 Mar, 2021
On 08 Mar, 2021
On 08 Mar, 2021
On 08 Mar, 2021
On 25 Feb, 2021
Video understanding identifies and classifies various actions and events in the video. Many previous works, such as video annotations, have demonstrated promising success in generating general video understanding. However, a fine summary of human activities and their interactions using state-of-the-art video captioning techniques is still difficult to produce. The comprehensive explanation of human actions and collective behaviors is important information for real time CCTV video tracking, medical treatment, sports video analysis etc. This research suggests a form of video understanding that focuses primarily on identifying group activity by learning the similarities between the pair and the actors appearance. In order to measure the similarity between the pair appearances and construct an actor relations map, the Zero Mean Normalized Cross-Correlation (ZNCC) and the zeromean sum of absolute differences(ZSAD) are proposed to allow the graph convolution network (GCN) to learn how to distinguish group actions. We recommend that MNASNet be used as the backbone to retrieve features from any video frame. A visualization model is also developed to visualize every input video frame and predict individual behavior or collective activity with projected bounding boxes on a human object.
Figure 1
Figure 2
Figure 3
This preprint is available for download as a PDF.
Loading...