Object Detection in Deep Surveillance

. Object detection is a key ability required by most computer visions and surveillance applications. Pedestrian detection is a key problem in surveillance, with several applications such as person identification, person count and tracking. The number of techniques to identifying pedestrians in images has gradually increased in recent years, even with the significant advances in the state-of-the-art deep neural network-based framework for object detection models. The research in the field of object detection and image classification has made a stride in the level of accuracy greater than 99% and the level of granularity. A powerful Object detector, specifically designed for high-end surveillance applications, is needed that will not only position the bounding box and label it but will also return their relative positions. The size of these bounding boxes can vary depending on the object and it interacts with the physical world. To address these requirements, an extensive evaluation of the state-of-the-art algorithms has been performed in this paper. The work presented in this paper performs detections on MOT20 dataset using various algorithms and testing on a custom dataset recorded in our organization premises using an Unmanned Aerial Vehicle (UAV). The experimental analysis has been performed on Faster-RCNN, SSD and YOLO models. The Yolov5 model is found to outperform all the other models with 61% precision and 44% of F measure value.


Introduction
Surveillance is a time-consuming task that entails collecting and analyzing massive amounts of visual evidence. Data collection is a major failure point of any monitoring scheme. Monitoring cameras (CCTV) can only see people within the sight line of the camera. Therefore, criminals may decide when or where they have not been tracked. In terms of preventing and tracking, standalone security cameras have drawbacks. The use of UAVs for surveillance becomes essential for the reliability of the system. In recent years, computer vision technology has developed at a fast and promising pace. Part of this success may be attributed to the introduction and application of Machine Learning and Deep Learning methods, while others can be attributed to the development of novel representations and templates for real-world computer vision challenges, or the development of effective solutions. [1], [2].
Object Detection is one field that has accomplished tremendous progression. Object detection is a problem with computer vision for identifying and finding an object in a picture frame. In computer vision, the more comprehensive testing areas include person detection, facial detection, and recognition of human activity, etc. [3]. The object detector is aimed at detecting all object instances of one or more groups of objects independent of the size, location, position, camera view [4], partial occlusions [5] and lighting conditions [6], [7].
Object detection is the initial task performed in many computer vision systems because it allows further information about the identified object and the scene to be obtained. Additional information can be gathered once an object instance has been identified, such as identifying the instance location, monitoring the object over an image set, and collecting additional information about the object, as well as determining the presence or placement of additional objects in the scene and better estimating additional scene information along with other contextual information.
Processing time, robustness to occlusions, invariance to rotations, and identification under pose shifts are all different criteria for all these applications. Although several applications detect a single class of objects from a single view, some need to detect a single class from multiple views or multiple classes of objects from multiple views. Image detection and motion prediction are in general two main building blocks of video monitoring applications. The first and most critical step is object identification, which is directly determined by the contextual information.
Since video data in surveillance applications comprises an overwhelming amount of unnecessary and redundant data, it must be compressed in the very early stages of data-processing.
A reliable person beings identification is important in visual security system for a wide range of applications, including suspicious incident detection [16], human motion characterization [17], traffic or crowd analysis, individual recognition, gender classification, and fall detection for the elderly [18]. Object detection has been accomplished through background subtraction, optical flow, and spatiotemporal filtration methods.
Drones mounted with cameras have been introduced and deployed to a variety of applications such as to detect anomalies, cover blind-spots, and identify trespassers during the night, etc. [19]. An automated and efficient object recognition plays an important role in the perception and study of the visual data obtained from drones, which could be further extended to civilian and military sectors. Researchers can capture aerial (bird view) photographs from an acceptable altitude using UAVs fitted with mounted cameras. The viewpoint of aerial images allows the presence of objects to squat, while aerial photos can provide more contextual details about the world from a wider viewing angle, the instances of objects can be identified by accident. The pedestrian UAV-based identification model is split into conventional and in-depth learning models [20]. The conventional manual detection model focuses on the nature of pedestrian methods for collecting functions. But most algorithms only consider the application environments. These extraction methods are problematic for complicated real scenarios, the target detection accuracy is too poor, and not ideal for more precise pedestrian detection. In current human identification benchmarks, crowd conditions are still significantly underrepresented, to further complicate matters [21], [22]. The occlusal question of distinguishing people is also a long way from being answered in heavily populated areas. This paper is aimed at constructing an autonomous pedestrian detection framework for city surveillance using a drone. The state-of-the-art deep learning algorithms have been trained on the MOT20 dataset [23] and tested on the custom videos captured by the drone for pedestrian detection. The deep learning algorithms, helps in distinguishing the various objects on the streets that further can be used for detecting crimes like harassment, as well as for the video proof of street accidents, traffic violations etc. The designed drone quadcopter can hold the camera with a gimbal, servos to move the rotors, servos for control surfaces, Wi-Fi module for handling communications and any other miscellaneous tasks like long range video transmission module, video transmission signal repeater, MAV links for long range communication between the drone and the server.
YOLO deep learning architectures were used for both identification and classification purposes. Faster-RCNN, SSD, and YOLO algorithms were implemented in this research work, along with experimental findings from various field experiments and simulations.
The paper is further organized as: Section 2 presents related work discussing the trends, models and challenges in pedestrian detection. Section 3 details the methodology used to perform the training and testing of object detection models. The experiments performed as described in section 4 and the discussion on the corresponding results obtained is discussed in section 5 followed by conclusion in section 6.

Related work
This paper addresses Object detection of pedestrians in crowded scenarios. This work demonstrates an object detection system in a typical crowded scenario with a state-of-the-art MOT20 dataset in which sequences are shot from a single, static camera with a heighted viewpoint. We will apply the trained method to the challenging task of detecting and localizing other pedestrians from a moving camera mounted on a drone.

Pedestrian detection
Pedestrian detection is a major concern in computer vision, with numerous applications that have the potential to improve the quality of life. The range of strategies to detect pedestrians in images has gradually increased in recent years. Nevertheless, many data sets and multiple evaluation protocols are used that make precise comparisons difficult [24]. In this field of machine vision, publicly accessible benchmarks, the most common of which is the INRIA dataset, has led to stimulate interest and advancement [25]. Caltech Pedestrian Dataset [26], which is twice as large than the previously existing dataset provides richly annotated video, taken from a moving car, with complex low-resolution images and occluded persons. The Crowd Human dataset [27] is vast, well-annotated, and diverse. There are 470K individual instances in the dataset, with 22.6 persons per image and various types of occlusions, from the train and validation subsets. Each human case has a head bounding-box, a human visible-region bounding-box, and a human fullbody bounding-box. The simple output of cutting-edge Crowd Human identification algorithms is provided. While significant progress has been achieved in pedestrian recognition [28], [29], identification in congested settings remains difficult. The conventional Non-Maximum Suppression (NMS) [30] has significant problems due to the severe occlusion of pedestrians. A low intersection-over-union (IoU) threshold indicates the lack of substantially overlapping pedestrians, whereas a larger one results in more false positives [31]. The pedestrian detection has been observed with various facets by the researchers such as moving objects-static camera and moving-object -moving-camera. General approach for moving-object-static-camera is background subtraction, whereas it is difficult to detect objects while the capturing device is in motion as the background consistently changes [32].
Convolution networks are made up of Convolution layers, Pooling layers, and a final component, which is a completely linked or expanded item that will be utilized for a certain job like as classification or detection.

Figure 1 Convolution Neural Network
The output of convolution is a two-dimensional matrix called feature map, which is generated by sliding filters all over the input picture.

Detection methods
Object detection is a method for detecting and locating objects in an image or video. Using this type of identification and localization, object detection may be used to count objects in a scene, assess and monitor their precise locations, and accurately mark them. Two types of domainspecific image object detectors are widely used: two-stage detectors, such as Faster R-CNN [33], and single-stage detectors like YOLO and SSD are another choice. Single-stage detectors achieve a high degree of inference, whereas two-stage detectors have high object location and precision. SSD discretizes the output space of bounding boxes per function map position into a set of default boxes with varying aspect ratios and sizes. At prediction time, the network calculates scores for each object category's appearance in each default box, and then changes the box to best fit the object structure. In addition, the network uses predictions from numerous feature maps with varied resolutions to seamlessly handle objects of various sizes. Since it fully removes proposal generation and subsequent pixel or feature resampling and encapsulates all computation in a single network, the SSD model is easy in comparison to methods that include object proposals. SSD is thus simple to train and easy to incorporate into systems that require a detection component. The Region Proposal Network (RPN) shares full-image convolution characteristics with the detection network, making regional proposals nearly cost-free. A RPN is a fully convolutional network that concurrently predicts object boundaries and objects at each location. The RPN is trained from start to finish to provide high-quality region proposals that Fast R-CNN may employ for detection. Object detection work in the past has repurposed classifiers to perform detection. YOLO, a modern method of target identification, presents it as a spatially segregated bounding box regression problem and associated class probabilities. In a single evaluation, a single neural network predicts bounding boxes and class probabilities directly from full images. Since a single network is the entire detection chain, end-to-end detection output can be well optimized.

Deep learning based multi-object detection overview
Despite major developments in object recognition focused on deep neural networks, focusing on small and occlusion targets remains a challenge. The researchers solved this problem by adding a cross-layer fusion multi-object detection and recognition algorithm based on Faster R-CNN [37].
The authors highlighted the problems and solutions for small object detection, as well as important deep learning approaches including fusing feature maps, integrating context detail, matching foreground-background examples, and producing enough positive examples. They looked at four study areas: basic object identification, face detection, aerial imaging object detection, and segmentation. In addition, this study evaluates the efficiency of various major deep learning techniques for tiny object identification, including YOLOv3, Faster R-CNN, and SSD, utilizing three huge benchmark datasets of small objects. Although the detection precision of these deep learning approaches on small objects was poor, less than 0.4, their experimental findings indicate that Faster R-CNN worked better, followed by YOLOv3 [38].

Object detection on UAV images
Object recognition algorithms based on deep learning systems have quickly become a popular way to process moving images captured by drones. In the field of deep learning, the increasing UAV industry dynamics and interest in possible applications like surveillance, visual navigation, identification of objects and sensor-based obstruction planning offer potential expectations.
Object detection in images captured by Unmanned Aerial Vehicles (UAVs) is a daunting challenge in computer vision, owing to the difficulty in learning a well-trained object detection model for managing instances in UAV images of random direction, variations in various scales, unusual forms, etc. Researchers have developed a large-scale benchmark dataset, MOHR, aimed at conducting multi-scale object detection in UAV images with high resolution in order to promote object detection analysis and expand its applications in natural scenarios using UAVs [39]. The authors have conducted an extensive study of the state-of-the-art object detection algorithms focused on deep learning and have discussed the latest contributions of these algorithms to UAV datasets of low altitude. The document addresses the following: Faster RCNN, Cascade RCNN, R-FCN etc. in two stages, YOLO, and its derivatives in one stage, SSD, RetinaNet in CornerNet and Objects in advanced detection phases [40].
This section comprises the preliminaries to carry out the research work including the dataset and the training algorithm. The methodology followed to conduct the research has been show in the flow chart discussed below: The work has been carried out on two different datasets. Models have been tested on the state-ofthe-art MOT20 dataset as well as the customized dataset captured through an assembled drone in campus premises. Both the datasets have been used for object detection purposes.
Develop a hardware attached with all types of sensors. Develop a drone capable of waypoint following.
 Application of algorithms to train the hardware using software.
 Develop a base station for Image analysis (using Image processing and image enhancement algorithms) for unusual activities. Application of Object detection algorithms and data classification algorithms.
Application of Machine learning algorithms like CNN (tensorflow) to train the data Training based on making our own real time dataset. Use of Training need not be exhaustive because system will automatically train itself later.
Automation of the entire process for seamless working of drone throughout its lifetime

MOT20 dataset
The objective of the MOT Challenge was to create a consistent evaluation of different object tracking systems. Because pedestrians are well-studied in the tracking field and precise tracking and detection have great practical significance, the challenge focuses on multiple persons tracking. MOT15 [41], MOT16 [42], and MOT17 have all made significant contributions by offering a clean dataset and precise methodology for benchmarking multi-object trackers after their first release. The MOT20 dataset [43] has eight sequences, half of which is utilized for training and the other half for testing.

Custom drone-based testing dataset
The inferences have been recorded on the custom dataset created using the drone. The drone is a quadcopter integrated with a Wi-Fi enabled action camera SJ4000 supported and controlled by a gimbal, a GPS device, a Pixhawk flight controller, a telemetry, and a 5.8G 48CH transmitter for long range communication. backgrounds. The sample images from the sequence recorded is given above in Figure 3.

Training algorithm
The training process for both datasets has been kept the same, as depicted in Figure 4 and detailed in the following algorithm. YOLO is a shortened form of "You Only Look Once". YOLO is the state-of-the-art object detection algorithm. Due to its fast computation and optimal results, it has become a standard way of detecting objects in the field of computer vision. Previously researchers used sliding window object detection then faster versions were worked upon such as R-CNN [6], fast R-CNN [7], faster R-CNN [8] and cascade R-CNN [9]. But in 2016 YOLO was invented which outperformed all the previous object detection algorithms [10].   For each cell in the feature map, YOLOv3 provides 3 bounding boxes. Every cell predicts an object through one of these bounding boxes if the object center belongs to its receptive field.
When YOLOv3 is trained, it features a ground bounding box that detects an object. Therefore, it is important to identify the cells to which the bounding box belongs.
There are the following characteristics in each predicted bounding box: center coordinates, predicted width, predicted height; the scores for objects and the list of confidences for each class of bounding box.
The following stage is to extract probability between three predicted cell boundary boxes to determine that a given box has a certain class, calculate the product of the objective score and confidence list element wise, and find out the greatest probability of that class. The center coordinates, width and height of the bounding box are bx, by, bw, bh, bx and tx, tw and th are network outputs after training. Yolov3 predicts anchor offsets, because during training it helps minimize unstable gradients.
The objectness score indicates the likelihood that a given cell is a center cell responsible for predicting a single object, with the proper bounding box containing the object within.
The objectness score represents the probability of presence of object inside the bounding box while the probability of detected object belonging to a certain class is represented by the class confidences. The objectness score (Po) can be mathematically represented as the below equation: Where IoU denotes the intersection over union of the predicted and ground truth bounding boxes, and Pobject is the expected probability that the bounding box includes the object. The output is sent into a sigmoid function, which returns values between 0 and 1.
In this study, YOLOv5 is used for multi object tracking. It is state of the art and the latest version of the YOLO object detection algorithms. It beats all the previous YOLO versions and has set the benchmark for object detection very high.

SSD
SSD is the abbreviated term for Single Shot Multibox Detector, single shot means that various tasks like localization of the object and classification of the object is done in a single forward pass of the given network, it came initially in 2016 and trained on the various datasets such as COCO and PascalVOC, on these datasets SSD gave promising score of 74% mAP (mean Average Precision) at 59 frames per second. In Single Shot Detector, the detector classifies the detected objects by the network. SSD in many terms works like the YOLO algorithm, it takes only one shot and detects numerous objects present in the image using multibox. It does not use any sliding window but instead it divides the image into grids and then each grid cell should be responsible for the detection of objects in that image, here detection of objects means to locate an object and predict its class in the given region. Figure 6 SSD architecture [44] The design of SSDs shown in figure 6is mostly based on the VGG-16 architecture, however it does not support fully connected layers. VGG-16 is employed because it is very good at image classification in high quality and it is also popular in the problems where transfer learning is used, further a set of additional convolutional layers were added to increase the extraction of features at multi-level scales, and it also helps reduce the size of the input in subsequent layers.
Multibox is the name given to a technique for bounding box regression, it is a method for bounding box coordinate proposals that are class-agnostic. Multibox employs an inception-style convolutional network, with 1x1 convolutions assisting in reduction. of dimensionality but the width and height remain the same. Multibox losses contains two components that proceeded in making their way into SSD, these two components are confidence loss and location loss, confidence loss is the measure of the bounding box whether it is bounding object correctly or not and location loss is also the measure of the bounding box, but it calculated the distance between bounding box and ground truth of the object.
Alpha is the term which helps to balance the location loss, it is basically a parameter value which most optimally reduces the loss function.
In multibox, researchers created priors (or anchors in Faster-RCNN). These priors assist manage the Intersection over Union (IoU) ratio by providing pre-computed bounding boxes that closely match the original ground truth boxes. It is believed that value of IoU should be greater than 0.5 and to gain this results prior plays an important role, these priors are used in the replacement of starting the predictions with random coordinates.
= (9) In recent years many more features are added to the SSD to make it more strong performer, first the priors of SSD are fixed, in SSD these priors are chosen manually, second, now L1-norm is used in the SSD to calculate the location loss, it is not precise as L2-norm but it can give promising results, third, classification techniques is used in the SSD, for each bounding box there is a prediction class present.

Faster RCNN
Faster RCNN represented in figure 7 is a convolutional neural network-based object identification architecture developed by Ross Girshick, Shaoqing Ren, Kaiming He, and Jian Sun in 2015. Faster RCNN is made up of three steps. The convolution layers are the first step. Filters are trained in these layers to extract relevant features from an image; for example, to obtain important features for a human face, the filters will be trained to learn forms and colors that only appear in the human face.

Figure 7
Faster RCNN architecture [45] The identified features continue to the next step of the Region Proposal Network (RPN). RPN is a tiny neural network that strides over the convolution layers' final feature map to predict the existence of an object and its bounding box. A region proposal network's (RPN) output is a collection of boxes/proposals that will be analyzed by a classifier and regressor to check for the presence of objects. RPN anticipates whether an anchor will be in the background or foreground, then refines the anchor accordingly. The final stage predicts classes and bounding boxes. This stage comprises of another fully connected neural network is utilized to predict object class and bounding boxes using the regions provided by the RPN as an input. The loss function of the Regressor is given by in which The RPN's total loss is made up of both the classification and regression losses.

Experimental Setup
For experimental analysis, MOT20 dataset was used. Various models were applied on the dataset using pre-trained weights. The system specifications were T4 15 GB GPU, 13 GB RAM, 25 GB disk space and xeon dual core 2.2 GHz CPU.
The experiments are conducted using Tensorflow custom object detection API.

Yolov5
Yolov5 contains the simple functionalities for test time augmentation (TTA), model ensemble, hyper-parameter evolution, and exportation to ONNX, CoreML, and TFLite, and is a compoundsized object recognition model family trained in the COCO dataset. Yolov5 was custom trained on MOT20 dataset which gave precision of 61% and recall 35%. The mAP value evaluated by the model is 0.09 and the F measure is calculated as 0.44. the metrics can be visualized in figure   12.

Figure 12
Yolov5 results on state of the art MOT20 dataset

Modified Yolov5
Yolov5 model was modified for experimental purpose by tuning hyperparameters such as learning rate to 1 x 10-05 and beta value to 0.9. However, this slightly mitigated the performance of Yolov5. The results shown in figure 13 depict values of precision and recall as 44% and 11.93% respectively. mAP value was observed as 0.12 and F measure is calculated to 19%.
Thus, from the F1 score values observed, it can be stated that Yolov5 model trained on MOT20 dataset has outperformed rest of the models with 0.44 F1 score followed by SSD, Faster RCNN and modified Yolov5. The comparative analysis of all the above figures is depicted in table 1.
The excerpts from the tested footages of the custom dataset are given below in figure 14 showing precise person detection.

Figure 14
Sample shots from the test case run on custom drone-based dataset showing person detection

Conclusion
The deep learning methods assist UAV based surveillance system to proffer real-time video surveillance, rally round to amass useful crime evidence, and helps to prevent and reduce chances of theft. It enables to automate the monitoring of high-risk regions with precision and provides coverage to even remotest regions where human surveillance is exigent. Pedestrian detection, which offers vital information for the semantics of video evidence, is the significant and critical tasks of every astute video surveillance system. Thus, pedestrian detection is one of the most imperative categories of object detection. An effort has been made to formulate the understanding of deep learning methods for pedestrian detection through this paper. The work presented in this paper has been carried out to understand the performance of state-of-the-art deep learning models for object detection to perform surveillance. The models are trained on MOT20 dataset and tested on custom dataset. It was observed that Yolov5 has outperformed all the other models with highest F1 score of 0.44 while the best mAP values 0.48 are obtained from SSD Mobile Net model.