Violence may happen anywhere. One of the ways to know and oversee
the violence in some places is by installing Closed-circuit Television
(CCTV). The recorded video captured by CCTV can be used as proof
in a law court. Violence video classification is also one of the topics
being discussed in deep learning. The latest violence video dataset is
RWF-2000. That dataset contains violent and non-violent videos, 5 seconds
duration, 30 frames per second, with the amount of 2000 videos.
That publication also has the best accuracy of 87.25% by their proposed
method. In this study, we will use a Residual Network known to
have the advantage of solving the vanishing gradient problem. Beside
that, we also implement transfer learning from Kinetics and Kinetics
+ Moments in Time pre-trained data. We also test the number
of frames and the location of the sampling frame range. RGB and
optical flow inputs are separately trained with different configurations.
The RGB input best accuracy is 89.25% with pre-trained Kinetics +
Moments in Time, using frame location 49-149. The optical flow input
best accuracy is 88.5% with pre-trained Kinetics, using 74 frames. We
also try to sum the output of both inputs making accuracy of 90.5%.