Evaluation of Deep Learning models on UV ink : a Fake Money detection scheme with RPN

As soon as coins or money were invented, people were trying to make counterfeits. Counterfeit money is fake money that is produced without the permission of the state or government, usually to imitate the currency and deceive the intended recipient. In Bangladesh, this is a significant problem that is becoming an increased phenomenon as the days pass by. Today’s modern banknotes have several security features that make it easier to identify fake notes. One of the security features is the use of UV ink. Banknotes deliberately put random flecks of color scattered all over the surface of the money - which acts as an extra layer of protection against counterfeiters. We proposed an automatic authentication model for identifying counterfeit money based on these random flecks of color which are visible under UV light. To obtain a benchmark result, existing object detection pre-trained models were used, followed by MobileNet, Inception, ResNet50, ResNet101, and Inception-ResNet architectures. After that, using the Region Proposal Network (RPN) method with Convolutional Neural Network (CNN) based classification the optimal model was proposed. The proposed model had a 96.3% accuracy. It is critical to reduce the circulation of counterfeit money in a country’s economy to stop inflation. This study will aid in the detection of counterfeit money and, hopefully, reduce its spread.


INTRODUCTION
With all the technological improvements in printers and scanners, the threat of forgery documents has also gone up. Amongst various forgery documents, counterfeiting of banknotes has become a significant issue, and many countries' economy is badly impacted by it [9]. Thus, an authentication system for banknotes is a major concern in today's world. To the best of our knowledge, forensic signature or handwriting verification has been studied significantly but, unfortunately, extremely limited studies have been conducted for banknotes verification. Several embedded features like artwork, security thread, watermark, UV ink, etc. are used to prevent counterfeiting banknotes. The bank staff is specially trained to detect counterfeit notes and so far, they do it manually, but this process is time-consuming. Moreover, this manual process is unattractive especially when it comes to a large number of documents. Thus, the need for automated Computer vision and Deep Learning can play a vital role in reducing these cases.
Counterfeit money is money that is printed without the consent of the government, usually to imitate the currency and deceive the intended public. During the early years of independence, Bangladesh inherited Pakistan's Central Bank's monetary management legacy. The initial situation was volatile, but the government quickly stabilized the position by establishing Bangladesh Bank in 1972. Following that, the availability of counterfeit money in Bangladesh has rapidly increased. Even the Bangladesh Bank detects two to three fake notes per million takas (Bangladeshi currency), and counterfeit notes are often found during the initial screening process [22]. As the country's economy is rapidly expanding, leading to the spread of fake currencies. Currently, it is estimated that 0.03 percent of the currency in circulation in the world is counterfeit [15]. In Dhaka, the Rapid Action Battalion (RAB) busted a counterfeit currency factory and confiscated more than 10 million takas in counterfeit notes [2]. The RAB also seized 40 million takas in fake notes, most of which were Tk 1,000 bills. A total of $4 million in counterfeit Indian currency was also discovered [1]. However, as the economy grows, the group's size will increase. Computer vision and Deep Learning can play a vital role in reducing these cases.

RELATED WORKS
Object detection is among the most important and challenging problems in computer vision. It aims to identify object instances in natural images from a wide range of predefined categories [12]. Object detection techniques are analyzed [11] to detect objects on any system running the proposed architecture in any area. The proposed method employs multi-layer convolutional neural networks (CNN) to create a multi-layer system model that can classify given objects into specified classes. Paper [27] presents a balanced feature fusion SSD (BFSSD) algorithm to improve the efficiency of SSDs. Pascal VOC2007 and VOC2012 datasets are used to train their model, then evaluated on Pascal VOC2007 test datasets. [13], this research offered a coherent structure for both training and inference, according to experimental results on the MS COCO, PASCAL VOC, and ILSVRC datasets. The models described above were explored and used as a benchmark to assess the dataset developed in this study. The dataset used in this research is being tested by preexisting models. [6] also introduces a residual learning approach for training networks that are much deeper than previously used networks. The authors tried residual nets on the ImageNet dataset. The image slices are fed into a CNN-based Squeeze and Excitation ResNet model in [4] to automatically classify brain tumors from MRI (Magnetic Resonance Imaging) data using CNN. Another paper [17] includes transfer learning to propose a new DNN architecture for microscopic image classification. In this study, three separates deep CNNs were used: Inception-Resnet-v2, Resnet152, and Inception-v3. This study [8] employs the MobileNet architecture for image classification on mobile devices and works with a significant amount of training data. The architectures described in [8] were studied to create our proposed architecture and to compare with different pre-trained models. Using the recently common terminology of neural networks with "attention" mechanisms, Region Proposal Network (RPN) and Fast R-CNN were combined into a single network. The RPN portion tells the unified network where to look [20]. In [14], a multi-tasking environment proposal approach is used to detect small objects in the PASCAL VOC dataset effectively. It achieves state-of-the-art object detection accuracy. This research [3] compares Enhanced Region Proposal Network (ERPN) to five various detection methods on the COCO and PASCAL VOC data sets. This research [23] implemented the Class Aware Region Proposal Network (CARPN) to generate high-quality region proposals. On the other hand, Selective Search is an area proposal algorithm. The power of both an exhaustive search and segmentation are combined in [24] by diversifying their search and using several complementary image partitioning's to deal with as many image conditions as possible. Anagha Kulkarni et al. [10] explores and expands an alternate approach that partitions the dataset into topic-related shards based on document similarity and scans only a few shards estimated to contain appropriate documents for the query. The proposed shard creation techniques are scalable, effective, and self-sufficient, and they produce topic-based shards with low size variance and a high density of relevant documents. RPN with the selective search algorithm was used in the first stage of this study. The studies listed above aided in gathering information for the model's initial stage. In the next stage, one of the most popular deep learning methods is the convolutional neural network (CNN), a multilayer neural network in the next stage. In recent years, CNN has become increasingly popular in image processing and has improved the accuracy of many machine learning tasks. It has evolved into a robust and widely used deep learning model [26]. The datasets used in this study [21] were ImageNet, CIFAR10, and CIFAR100, and the study focused on evaluating the performance of three standard networks: Alex Net, GoogLeNet, and ResNet50. According to the study, GoogLeNet and ResNet50 can recognize objects with greater precision. Alex Net. The main objective of [7] was to conduct classification experiments for the detected object obtained from traffic detectors using CNN and the Histogram of Oriented Gradients (HOG) descriptor. In the second stage of this study, the distinct types of CNN architectures discussed earlier were studied to develop the proposed CNN architecture.
As a result, this study aims to develop a model that can reduce the spread of counterfeit money. In this research, a custom dataset was used by scanning the taka under ultraviolet (UV) light to make the small thread visible. UV detection of counterfeit currency using thread has been around since 1976 and has proven to be remarkably successful [18]. This thread separates the fake money from the real ones. Existing deep learning pre-build models were first put to the test to obtain a benchmark result. In this study, MobileNet, Inception, ResNet50, ResNet101, and Inception-ResNet architectures were used. The architecture generated mediocre results. After that, an optimal result was obtained using the Region Proposal Network (RPN) method with CNN-based classification. The method achieved satisfactory results. Deep neural networks (DNN) for object detection are a well-established area of research. In DNN single-shot and two-shot object detection are thought to be one of the best in speed vs. accuracy.

Single-shot Object detection
The model's goal in object detection tasks is to draw tight bounding boxes around desired classes in the picture, along with object labeling. Region proposal is not carried out by single-shot detection. It provides both final localization and content prediction at the same time. The famous single-shot approaches are the single-shot multibox detector (SSD) and YOLO [19]. The localization is computed by the SSD architecture in a single network pass. The SSD algorithm tiles a grid of anchors in space, size, and aspect ratio boxes onto the image. Zoom augmentation, which reduces or increases the size of the training videos, aids in generalization. SSD, on the other hand, is better at predicting large objects than FasterRCNN.

Two-shot Object detection
The goal of the task is to draw various bounding boxes of any single image objects, which is critical in a variety of fields, including autonomous driving. The two-shot detection model has two stages which are region proposal & region classification and position prediction refinement. For two-shot versions, faster-RCNN variants are the most common option.

DATASET
The dataset used in this research consisted of 119 images. After standard augmentation, the dataset consisted of 1428 images, and the annotation of threads was ∼21k boxes. A security thread is a delicate ribbon or thread that runs through a banknote. When kept under ultraviolet light, the thread in the new notes' glows. These protection threads make copying currency with a commercial color copier difficult. Only 1000 and 500 taka notes were used in this study because they are the largest banknotes in Bangladesh, and people counterfeit these two banknotes more frequently. Photos on both sides of the banknote were taken. Since the chances of an old banknote being counterfeit were extremely low, only pictures of new banknotes were taken. However, once the model has been trained with threads, it can operate on any kind of banknote.

PROPOSED METHOD
The proposed method consists of two stages. Initially, the input image is passed through a region proposal algorithm in the first stage and then goes through a classification algorithm to check if the object is present in the foreground or the background, as shown in figure 2.

Region Proposal Network Stage
RPNs (Region Proposal Network) are intended to predict region proposals with a wide range of scales and aspect ratios efficiently. For the region proposal algorithm, a selective search is selected. Selective search begins by using a graph-based segmentation approach to over-segment the image based on pixel strength. It then uses these over-segments as data, adding all bounding boxes corresponding to segmented sections to the list of regional proposals and grouping adjacent segments based on similarity. More significant segments are created and added to the list of area proposals for each iteration. Intersection over Union (IoU) was used to test the accuracy of the proposed method on a specific dataset. The metric of intersection over union (IoU) is nothing more than a parameter of evaluation [25]. IoU can test any algorithm that produces expected bounding boxes as an output. Each image in the dataset has fewer foregrounds than backgrounds, and both samples need to be balanced. As a result, only the first 16 background images and from the below figure all foreground images that fall within our desired IoU are taken from the first 1000 proposed regions.
By focusing on the balance of the dataset, the optimal IoU point is set for both foreground and background. As a result, various IoU points are tested to see which combination produces a more balanced dataset for training. As can be seen from the table 1, the number of backgrounds is constrained, and regardless of the IoU regions, the number of backgrounds is always greater than 16. As a result, the total number of backgrounds is still the same. However, this is not the case for Foreground. Foregrounds that came under some IoU regions are unrestricted. Any points may be checked to see if the foreground and context are nearly balanced. However, due to the small size of the objects in the foreground, the IoU for the foreground cannot be less than 0.5. Furthermore, the background should not be too close to the foreground, and the region should not be too far away from the object, which is why the IoU for the background is set at 0.3. The Foreground and Background of an image are depicted in the figure 4.

Classification Stage
Convolutional neural networks (CNNs) are a form of an adaptive image processing system that bridges the gap between general feedforward neural networks and adaptive filters. Multiple layers of artificial neurons make up CNNs. They are a big step forward in   image recognition. They are most widely used to analyze visual imagery and are often used in image classification behind the scenes. Since CNN extracts features from videos, it produces better results and removes manual feature extraction. CNN can have clearer picture results because of this. Region proposals from the first stage are used as input in the second stage. From the figure 5, the CNN model contains 41 layers along with four MaxPooling2D layers. Initially, the input image is passed through the first layer. The data is passed drectly from the input layer to the convolutional layer. Convolutional layers are where filters are added to the original image or other feature maps. This is where most of the user-specified parameters are in the network. The number of kernels and the size of the kernels are the most critical parameters. After that, the rectified linear activation function or ReLU is used. All the values in the input volume are subjected to the ReLU layer's function f(x) = max(0, x). This layer simply resets all negative activations to 0. Without influencing the convolutional layer's receptive fields, this layer improves the model's and overall network's nonlinear properties. The first three MaxPooling2D layers are followed by five convolutional layers and ReLU activation functions. The layer after that is a MaxPooling2D. Next, a dropout layer is used. Dropout is a technique for preventing overfitting in a model. Dropout is a teaching strategy in which randomly selected neurons are ignored. After that, the image is flattened in the next layer, which occurs in the flatten layer. After that, the next layer is a dense layer. The dense layer gives the neural network a completely connected layer. Regularization was tested in this model, but the accuracy did not improve. Overfitting can be avoided by using regularization. Regularization is a technique for reducing model complexity. Both outputs from the previous layer are fed to all the neurons in a dense layer, with each neuron supplying one output to the next layer. In neural networks, this is the most fundamental layer. The ReLU activation function is then used. After that, another Dense layer is applied, followed by another ReLU function layer. Then comes a dropout layer, followed by a dense layer. The final layer uses a sigmoid activation function since it resides between the first and second layers (0 to 1). As a result, it is particularly useful for models that require the prediction of probability as an output. Since the chance of something only exists between 0 and 1, sigmoid is the best option. After going through all the layers, the output will contain a list of proposal bounding boxes of 1. The Non-Maximum Suppression (NMS) technique is a computer vision technique which is used to further filter the proposed regions. It is a group of algorithms for picking one entity out of a slew of overlapping ones. After going through NMS, the output will contain a list of filtered proposals.

RESULTS
The results of the pre-build models, model design of the proposed architecture, and hyperparameter selection are discussed in this section.

Architecture Selection
The re-sampling technique named Cross-validation is used to test the machine for learning models which are on a limited set of data.
The method only has one parameter, k, which determines how many groups a given data sample should be divided into. Therefore, k-fold cross-validation is a common name for the technique. When a specific value for k is chosen, it can be used to replace k in the model's relation, such as k=10 for 10-fold cross-validation. In applied machine learning, cross-validation is a technique for estimating a machine learning model's ability on unknown data. It is a popular approach because it is simple to understand and results in a less biased or optimistic model of prediction. Due to its simplicity, this method is popular and gives a less biased or optimistic estimate of model capacity than other methods, such as a simple train/test split. The TensorFlow models mentioned in the table 2 are pre-trained. Faster RCNN architectures achieved better results than SSD architectures. MobileNet v2 and inception v2 have outperformed the other SSD architectures. In the Faster RCNN architectures, however, ResNet101 and inception ResNet v2 achieved better performance. Of all the architectures, inception ResNet v2 came out on top with a score of 88%. In this study, this score is used as a benchmark. The proposed model must achieve better results than this.
The proposed regions from the first stage must be classified whether it is foreground or background. As a result, a CNN classifier is needed. Some model parameters must be checked to design the CNN classifier. In the proposed architecture, some model parameters have been tested such as different node sizes, the number of conv2d layers, and the number of dense layers. The test results are shown below.
The graph 6 illustrates that some of the configurations produced under fit results while others produced satisfactory results. In the graph 7, the top three architectures with the highest validation accuracy and loss are shown.

Hyperparameters selection
The best architecture was chosen based on the graph. Now hyperparameters must be selected, for example, Regularization, batch size, learning rate, and dropout were the four parameters that made up  the hyperparameter. Hyperparameters are all the training variables that are manually set to a predetermined value [5]. The effect of regularization accuracy on the chosen architecture was tested. The parameters in L1 have been reduced to zero. The L1 norm becomes sparse when the weights of input features are close to zero. In a Sparse e solution, most of the input features have zero weights, with just a few having non-zero weights. L1 regularization is used to select functions. L2 regularization provides a non-sparse solution by making the weights small but not zero. Since square terms exaggerate outlier error differences, L2 is not resistant to outliers, and the regularization term tries to remedy this by penalizing the weights. Among the methods, Without Regularizer was marginally better than L2 Regularizer (see figure 8). The Adam optimizer was trained at logarithmic intervals with various learning rates ranging from 0.000001 to 100. The learning rates of 0.0001 and 0.00001 showed the best results, as shown in the graph 8. The highest precision was reached by 0.0001, which was 96.3 percent.
The accuracy of all the dropouts that have been tested on the proposed model is shown in the figure 8. The dropouts with the highest accuracy were 0.1, 0.2, and 0.5. However, with a 96.3 percent accuracy, 0.2 is the most accurate. The table 9 shows the inference latency of each stage. It took 23.10 seconds to generate 1000 proposed regions in the first stage, and 23.80 seconds to predict the 1000 proposed regions and display the results in the second stage.
After selecting all the hyperparameters and applying the NMS technique, the output of the proposed model and best TensorFlow model is given in figure 10.

Hardware Setup
The research was implemented using TensorFlow 2 framework and Python 3 as a programming language. The supporting packages were Jupyter notebook, pip3, NumPy, Matplotlib, SciPy, and pandas. All these packages were installed on the Linux operating system. The hardware was built around an Intel Core i5 4th generation processor with a clock speed of 3.20 GHz. It has 16 GB of DDR3 RAM and 6 GB of GPU (Nvidia GeForce GTX 1060).

Evaluation
TensorFlow models are evaluated on AP as a benchmark. The average precision (AP) is a method of condensing the precision-recall curve into a single number that represents the average of all precisions. The following equation is used to calculate the AP. At each threshold, the AP is the weighted number of precisions, where the weight is the increase in recall. To find the best CNN architecture for the proposed method, validation accuracy for CNN classification was used. The accuracy measured on the data set is known as validation accuracy. It is not used for training but for validating the proposed model's generalization potential for early stopping.

DISCUSSION
In this research, the hyperparameter consisted of four parameters, and the optimal value was adjusted accordingly. The batch size was set to 32, the learning rate was 0.0001, and the dropout was 0.2. On the other hand, the model chosen for this study consisted of 5 conv2d, 64 nodes, and 3 dense layers with no regularization. After training, the validation accuracy of the proposed model was 96.3%, and security thread detection per image was also high. Each test image took 40.84 seconds which can be reduced if any other search method which is faster is used instead of selective search; the total time required to complete this research can be reduced. Overall, the proposed model produced a satisfactory result.
A security thread is a security feature in many banknotes that is made up of a thin ribbon that is threaded through the note's paper to prevent counterfeiting. However, as the banknote ages, these security threads become more mutilated and ineffective. If the proposed model cannot detect the security thread on an old banknote, it could declare it counterfeit. There were several images in the dataset used in this study. Both sides of the banknote were imaged. When capturing banknotes, the light exposure must be precise for the security thread to be noticeable under UV light. The security thread will not appear in the photos if the light's exposure is incorrect. Every country includes several types of security thread in their banknotes to avoid counterfeiting [16]. Since other countries use several types of security thread for their banknotes, the model can also identify a foreign banknote as counterfeit money.

CONCLUSION
It is worth noting that in Bangladesh, the use of counterfeit money is on the rise. Two to three fake notes are detected per million takas by the Bangladesh Bank, and counterfeit notes are frequently discovered during the initial screening process. Counterfeit currency lowers the value of real money and raises prices, resulting in inflation as more money is exchanged in the economy. The country's economy is increasingly growing, which has resulted in the proliferation of counterfeit money. An intelligence system will not be able to completely fix the problem, but it will be able to significantly reduce it at a faster rate. To detect counterfeit money,  Object detection technology is being used in a lot of research. Further research into object detection technology has the potential to not only improve the system, but also to completely eradicate the problem.