DEEP VISION BASED SURVEILLANCE SYSTEM TO PREVENT TRAIN-ELEPHANT COLLISIONS

: Animal conservation is imperative, and a lot of technology has been used in different ways. The endangered species like tiger and elephant has raised the need for such efforts. Human-Elephant Collision (HEC) has been an active area of research but still, the optimum solution is not found. As trains are widely used transportation medium in Asian countries, the rail track is even laid down through forest areas and hence intervene the wildlife. Elephants due to their bulky size often become victims of trains. Such tragedy is common especially in green belts in southern zones of India. To rectify the problem, we have proposed a deep vision-based model to identify the elephant near-site using implanted video cameras. Four different models are proposed for the identification of elephants in image/video. One novel lightweight CNN based model is proposed. Three Transfer Learning (TL) models, i.e., ResNet50, MobileNet, Inception V3 have been experimented and tuned for elephant detection. These highly accurate and precise models can alarm the trains hence it can save a precious life .


INTRODUCTION
Animal mortality is becoming a major concern day by day in the whole world as it is disturbing ecological balance and in certain cases, even the species are being endangered. The International Union for Conversation of Nature (IUCN) has already given endangered status to the Indian Elephant. Particularly in India, human life is intrinsically entangled with the big animal Elephant. Whether it is culture, mythology, or the Hindu custom, Elephant plays a very major role and is also considered as a sacred animal besides the symbol of intellectual strength. According to Hinduism, Lord Ganesh is having an Elephant head and is considered as the obstacle remover in life. Hence, every auspicious work usually starts with a prayer to the Elephant faced Ganesh. Whereas, in contrast, the life of elephants is full of struggle in the present scenario. Due to a lot of human intervention into the habitats of these, it is an endangered species now. Whether it is the costly ivory tusk of the mammal or the deforestation by the greedy human beings, at the receiving end are the gigantic mammal's Elephants [1,2].
Humans have interfered with their habitat of wilds. Due to the requirement of space for living, agriculture, and transport, a lot of forests have been cleared. Due to the transport needs a lot of roads and rail tracks exist in /near the green zone or forest areas. In India, trains are the cheapest and mass transport mechanism to connect each corner of land for better business prospects. And usually, these tracks pass through a lot of forest occupied regions and hence a danger for wild animals. Small, hasty animals are alert enough to act upon whenever they feel and realize the danger of trains. While bulky animals like an elephant could not manage to save themselves and hence end up losing their life. Many steps have been taken to avoid the clash. Various alert systems and barriers have been placed at crucial sites but still, the system is taking a huge toll on the life of elephants. The Human-Elephant collision has become a big research point owing to a declining number of Elephants. Many researchers have worked on the impact of railway tracks being passed through the forest area. A lot of researchers studied the statistics of elephants killed by train collisions over the past few years [3] as well as measures taken by the Govt. of India in handling the issue meticulously. This paper can be considered as an extension of the proposal made in [3]. The advent in image processing can be effectively utilized in avoiding a human-elephant collision or train elephant collision. Image processing powered with machine learning and deep neural networks has given promising results in all real-time applications. Recently, multilayer feedforward neural network-based approach is proposed for human-robot collision detection [4]. Similarly, collision avoidance system is proposed for biomimetic autonomous underwater vehicle using artificial neural network [5]. Hence, the use of deep learning for the detection of human-elephant collision can be explored for avoidance for the same. Thus, the main contributions of this paper are: RELATED LITERATURE A good number of researchers have worked on image identification of elephants that help in controlling elephant train collision system development. Many authors worked on animal detection algorithms using machine learning and deep learning. Koik and Ibrahim [6] presented a literature survey on animal detection methods in digital images. Tanwar et al., [7] presented a survey on algorithms on animal detection. A small summary of contributions made by different researchers in this area is shown in Table 1 [12] A robust method to track animals and determine their motion pattern Mammeri et al., 2014[13] Two-step classification system using LBP-Adaboost followed by HOG-SVM Classifier Sharma and Shah, 2016 [14] Animal detection and collision avoidance system using Computer Vision Norouzzadeh and Nguyen, 2018 [15] Automatically identifying, counting, and describing wild animals in camera-trap images with Deep Learning Raja et al., 2018 [16] Image identification using an edge detection algorithm Devost et al., 2019 [17] Automated tool for animal detection in camera trap images Zotin and Proskurin, 2019 [18] Animal detection using a series of images under complex shooting conditions. Backs et al., 2017 [19] Hebb's Law of a learning-based model Bill et al., 2019 [20] Kernel Density Estimation based model Jayakumar et al., 2020 [21] Animal detection using a deep learning algorithm Elephant detection and tracking Venkataraman et al., 2005 [22] Satellite image-based elephant position of African elephant Ardovini et al., 2008[23] Ear shape-based model for an elephant identification system Vermeulen et al., 2013[24] Unmanned Aerial Vehicles at a height of 100mts for tracking Zeppelzauer, 2013 [25] Automated detection of elephants in wildlife video Sugumar and Jayaparvathy, 2014 [26] Image feature extraction and similarity matching model based on Euclidian [32] Detection of Wild elephants using image processing on Raspberry Pi3 Ravikumar et al., 2020 [33] Transfer Learning based MobilNet model for elephant detection Several approaches for animal detection and tracking have been proposed [8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Sharma and Shah [14] proposed real-time animal detection and collision avoidance system using a computer vision technique. Norouzzadeh and Nguyen [15] performed the identification and counting for wild animals in cameratrap images with deep learning. Raja et al., [16] studied the prevention of wild animals from accidents using image detection and edge algorithm. Devost et al., [17] proposed a new automated tool for animal detection in camera trap images.
Zotin and Proskurin [18] performed animal detection using a series of images under complex shooting conditions. Backs et al., 2017 [19] proposed low-cost electronic-based devices to avoid train collision with animals on track. They performed testing on Bayy National Park and Yoho National Park where grizzly bears, Black bears, Wolves, and Moose were found to be killed in large numbers. The first method makes use of two paired but placed distantly devices and of which, one device takes care of detecting passing train and sends information to the paired warning device placed at a high striking rate position. The second method includes predicting train arrival time at a distance hypothetically considered as 200 meters and activates an integrated warning mechanism at the desired time. Random Forest classification models were used with 10-fold 10-repeat cross-validation and 80% detection rate was claimed. Bill et al., 2019 [20] claimed that when the wildlife-vehicle collisions (WVC) are categorized into their occurrences in the cluster and outside the cluster, achieving statistically significant local factors were observed. The Kernel Density Estimation (KDE) method extended with Monte Carlo simulation called KDE+ was used to identify more significant clusters. They achieved a 95% confidence interval and Road width; the presence of shrubs and habitat type are observed as the most significant variables. Jayakumar et al., [21] proposed animal detection using a deep learning algorithm.
Other authors have made efforts for elephant detection and tracking specifically [22][23][24][25][26][27][28][29][30][31][32][33]. Initially, feature extraction was used to detect elephants in frames. Sugumar and Jayaparvathy, 2014 [26] used images captured with a camera mounted on towers or trees and being sent to a distant base station through an RF network. Image feature extraction and similarity matching based on Euclidian and Manhattan distance was done with the help of multilevel wavelet coefficients obtained Haar wavelet decomposition of the image. K-means clustering is used to cluster images and F-Norm theory was employed to find the difference with a threshold of 5 images to be matched for the query image to send an SMS. The researchers made use of an optimized distance measure that gave an 18.5% better performance as compared to other measures. Zeppelzauer [27] implemented the detection of elephants in wildlife videos. Shukla [33] proposed MobileNet architecture for successful elephant detection. Although a lot of methods  were proposed for animal and elephant detection, the usage of CNN architecture and Transfer Learning (TL) for the same is still not explored fully. This paper aims to employ CNN and transfer learning for efficient detection of elephants in different positions on rail tracks so that alert systems can be implemented.

III. DATASET USED
As the experiment aims at identifying elephants on or near the rail track, we have searched for various images and datasets which serve the purpose. Naude and Joubert [34] have presented a new public benchmark for aerial elephant detection. Many such datasets are publicly available for experimentation on animal detection but no ideal dataset for animals on rail track is available as it is practically very difficult to capture natural images of an elephant on or near the track. Some random images are available though. Due to this limitation, we have decided to use a dataset of elephants and a dataset with rail track images. Two public datasets are used in this current research i.e. ELPephant [35] and RailSem19 [36]. Apart from the above-mentioned dataset, we needed more practical images to test the proposed model. Therefore, we used different keywords like "elephant on rail track", "animals on rails track", etc., to search for some real images. These images are used for training and testing purposes. Table 2 shows the distribution of the number of images used for experimentation. Approximately 20% of images are considered for testing purposes from Elephant and RailSem19 datasets. The remaining 80% are further split into training and validation sets. This split is done randomly to avoid any bias in the experiment. Moreover, the variety in images ensured high variance in images during the training phase. Fig. 1 shows the sample images taken for experimentation. Row 1 shows the sample images from the ELPephant dataset. Row 2 shows the images from the RailSem19 dataset. Row 3 and 4 show random images based on google search with different keywords "animals on rail track" and "elephant on rail track", respectively.  We have proposed a complete surveillance model for the prevention of train elephant collision. Initially, video cameras will be implanted at vulnerable sites.
The video from these sites will be converted to frames at a central site. A trained deep vision model will be kept at central sites. This model will inspect the extracted frames for the presence of elephants near rail tracks. If such a situation is identified the alarm will be triggered in trains near that track and a warning sound will be generated at the track to warn the elephant. Fig. 2 illustrates the complete model for train elephant collision prevention.Visual Inspection by Trained models in Fig. 2 will be done by one of the proposed models. We have experimented with several deep learning architectures. First, we designed a new CNN architecture from scratch specially designed for this application. We tried a lot of combinations for various layers to get the best results. Then we experimented with transfer learning using three existing deep learning architectures i.e., ResNet50, MobileNet, and Inception V3. All the experiments are done using Keras and Tensorflow in Google Collaboratory. Let us discuss their implementation in detail.

A. CNN Architecture
CNN stands for convolutional neural networks which aim at extracting features from images so that these can help to identify the objects present in the image. These are a type of neural network with a mix of convolutional, pooling, and fully connected layers. Traditionally, in machine learning, a subject expert is supposed to handpick the features used for the identification and classification of objects in an image/video. With the usage of CNN architecture, feature extraction has become automatic. The convolutional layers in CNN employ a different type of filter on images. These filters can extract useful features from the images, such as color information, edge detection, and others. Multiple convolutional layers cause complex features to be extracted from the images. Then, these extracted features are reduced using pooling layers so that only significant features can pass onto the next layers. This helps in reducing the computational complexity of the architecture. At last, the reduced feature set is passed to fully connected layers. Finally, the last layer is used to classify the results in binary or multiple classes. Convolutional and pooling layers serve as feature extraction while a fully connected layer is used for final classification.
Various CNN architectures are possible for feature extraction and classification. The number and type of layers if changed will cause a new architecture with a different feature set and outcome. Apart from that, a lot of hyperparameters are available to fine-tune the architecture. For feature extraction, we can use the different number of convolution layers with different parameters such as filter size, activation function, and image size. For the pooling layer, stride can be selected as per requirement. For dense layers, the activation function such as 'Relu' or 'leakyrelu' can be selected for fully connected layers. Further, for final classification using a dense layer, activation functions like 'sigmoid' or 'softmax' can be used. The sigmoid function may be used for binary and the Softmax function may be used for multi-class classification. Further, learning rate, drop out and regularization can be used for tuning the architecture.

Details of Proposed CNN architecture:
The proposed architecture has 5 convolutional layers and 5 pooling layers. Then the output is flattened, and a fully connected layer is employed for binary classification. Fig. 3 shows the output of convolution layers and pooling layers as a heat map. Fig. 4 shows the details of the proposed CNN model for classifying images of rail track with or without an elephant. Various parameter values used in the proposed CNN are shown. Data Augmentation is used so that model could become translation, rotation, and scaling invariant. A batch size of 32 is used for training as well as testing. Model is trained for 50 epochs. RMSprop optimizer is used which is like the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction, thus, the algorithm could take larger steps in the horizontal direction and can converge faster. Categorical_crossentropy is used as a loss function. It is a measure of the accuracy of the model in predicting the desired object in the image. The 'Sigmoid' function is used for binary classification. Hence, it is used to categorize the image with and without an elephant. Drop out is used to limit the parameters in fully connected layers.

B. Transfer Learning:
Transfer learning is an AI strategy where a model produced for an application is reused as the beginning stage for a model on a subsequent application. It is a well-known methodology in deep learning realizing where pre-prepared models are utilized as the beginning point for the next model. The pre-trained model (trained using large datasets) is saved with the weights. The same model is then trained on a new dataset with some changes in top layers. Then the final model is tested for classification or other problems. The concept of transfer learning is visualized in Fig. 5.

Fig. 5 Concept of Transfer Learning
The two basic methodologies for TL are as follows: 1.
Develop Model Approach 2.
Pre-trained Model Approach There are three main steps in the pre-trained model approach. First, a pre-prepared source model is explored within accessible models. These pre-trained models are usually trained on huge datasets. A model trained on a similar dataset is usually selected from a pool of available models. Secondly, the selected model is reused. The pretrained model is used as the beginning stage for a model on the second assignment of interest. This may include utilizing all or some parts of the model. Thirdly, the obtained model is finetuned and modified. Alternatively, the model may be adjusted or refined as per the undertaking of interest. The ImageNet project is a large repository of a variety of images designed for use in visual object recognition. The various models trained on the ImageNet project are available for research and re-application. When these learned models are used to solve similar problems, this is termed transfer learning. In the current paper, we have used ResNet50, MobileNet, and Inception Net for binary classification.

ResNet[37] i.e.
Residual Networks is a neural network that is utilized as the foundation of numerous computer visionbased exercises. This model was the winner of the ImageNet challenge in 2015. Incredible advancement with ResNet permits to prepare top to bottom organizations with 150+ layers effectively. Earlier, ResNet usage for the exceptionally deep neural network was troublesome because of the issue of the vanishing of inclinations. ResNet then presented the idea of skip association. Now, ResNet skip associations are utilized in much more model structures like the Fully Convolutional Network (FCN) and U-Net. They are utilized to stream data from prior layers in the model to later layers. In these designs, they are utilized to pass data from the downsampling layers to the upsampling layers. The ResNet-50 model consists of 5 stages each with a convolution and Identity block. Each convolution block has 3 convolution layers and each identity block also has 3 convolution layers. The ResNet-50 has over 23 million trainable parameters.
MobileNets [38] are a class of small, low-computation models that can be utilized for order, recognition, and other regular tasks that can be solved using convolutional neural networks. Considering their little size, these are viewed as incredible deep learning models to be utilized on cell phones. Presently, while MobileNets are quicker and more modest than other significant organizations, as VGG16, for instance, there is a trade-off. That trade-off is precision. Truly, MobileNets regularly aren't as precise as these other huge, asset substantial models are, however they still really perform quite well, with truly just a generally little decrease inexactness.

Inception/GoogleNet[39]
Google devised a module called the inception module that approximates a sparse CNN with a normal dense construction. Since only a small number of neurons are effective, the width/number of the convolutional filters of a particular kernel size is kept small. Also, it uses convolutions of different sizes to capture details at varied scales (5X5, 3X3, 1X1). It exploits the fact that most of the activations in a deep network are either needless (value of zero) or unnecessary because of correlations between them. Consequently, the most efficient architecture of a deep network will have a sparse connection between the activations, which implies that all 512 output channels will not have a connection with all the 512 input channels. Another salient point about the module is that it has bottleneck layer1X1 convolutions. It helps in the massive reduction of the computation requirement.

V. RESULT AND DISCUSSION
As mentioned in the methodology, four different CNN models have experimented with for binary classification. Table 3. shows the results and comparison between the proposed models. The proposed CNN model achieved an accuracy of 99.53% and has the advantage that it is a lightweight model with only seven layers and about 2 lakhs parameters. Fewer parameters make the proposed model computationally efficient. Transfer learning-based models have performed better as they have more layers and more parameters. Resnet50 and MobileNet performed better with an accuracy of 99.81%. Inception Net has performed best in binary classification. It achieved an accuracy of 99.91%. The inception Model has a relatively less number of layers and parameters and hence is suited best for the current problem. Fig. 6 shows the confusion matrix for the Test dataset for 1056 images. 505 images contain elephant(s) in frames and 551 images are without an elephant. Fig 6 a) shows the confusion matrix for the proposed lightweight CNN. Fig. 6 b, c, d shows the confusion matrix for transfer learning  Fig. 7a, 7b shows the training and validation accuracy and loss for the proposed CNN, and Fig. 8a, 8b shows the same for the best transfer learning model i.e. Inception model.
The trend in the graph shows that validation accuracy and loss has followed the training accuracy and loss trend.    Table 4. shows the comparison of the proposed model with similar work. Zeppelzauer et al. [27] proposed a multi-modal early warning system for the detection of elephants in wild video recordings. Elephants are identified based on the color model of their body. SVM classifier is used to predict the presence of an elephant in the wild video recordings. The dataset used had 715 images and the accuracy obtained is 91.7%. Ravikumar et al., [33] proposed MobileNet architecture along with the single-shot detection algorithm for the detection of elephants in wild videos. This CNN model achieved an accuracy of 92.7%. Our proposed CNN and tuned Inception model outperformed the existing models with better accuracy and False Positive Ratio. Therefore, the proposed model is better than the existing model.

TESTING BEYOND DATASET:
When the testing is done for images with an elephant on track and rail track with no one on track then accuracy obtained is almost 100%. But we tried experimented with complex images with humans and small animals on track to check the performance of the proposed model in a realtime situation. These images are not part of the training. Fig. 9 shows the results obtained by the Inception V3 model for typical 40 real-time images found on the Internet. It had been observed that the small animals are not misunderstood as elephants as shown in Row 2-3 and hence the model is ready to be used in a real situation. For few images are shown in Row 1 with big animals are misclassified as an image with an elephant. The poor resolution of these images is also one of the constraints. This can certainly be improved with a larger, ideal dataset which is not available as such. Fig. 9 Result of Inception V3 model on unknown 40 images from the internet VI. CONCLUSION As human activities had led to intervention in wildlife, its consequences are visible in terms of animal extinction. Nowadays a lot of efforts are being made for animal conservation and advanced technology can certainly help in this regard. Artificial Intelligence and the Internet of things together can do a lot in this direction. Along the same lines, this paper aims to develop a model for HEC prevention. It detects the elephants on/near rail track using a deep vision model. Four different models based on CNN and TL are proposed and compared for w.r.t their effort and efficiency. The Inception v3 model has performed best for this application because of its high accuracy and zero true negative rates. The model can be used for generating alarm on-site and in a train near the track for warning and hence saving elephant life. In the future, a detailed dataset can be prepared and trained so that the other type of animals may not be misclassified as elephants and hence can lead to zero false-positive cases.