Real-Time Ethiopian Currency Recognition for Visually Disabled Peoples Using Convolutional Neural Network

A survey report made by the Ethiopian Ministry of Health along with several non-governmental organizations in 2006 G.C, there were about 5.3% of the Ethiopian population lives with blindness and low vision problems. This research work aims to develop a Convolutional Neural Network-based model by using pre-trained models to enable vision-impaired peoples to recognize Ethiopian currency banknotes in real-time scenarios. The models attempt to accurately recognize Ethiopian currency banknotes even if the input images come up with partially or highly distorted and folded Birr notes. 8500 (1700 for each class) banknotes data are collected within real-life situations by using 9 blind persons. The models were evaluated with 500 real-time videos of different conditions. The whole training, classication, and detection tasks have been demonstrated by adopting Tensorow Object Detection API and the pre-trained Faster R-CNN Inception, and SSD MobileNet models. All the codes are implemented using Python. The model tested using numerous Ethiopian currencies at different banknotes status and light conditions. In the case of Faster R-CNN Inception model an average accuracy, precision, recall, and F1-score of 91.8%, 91.8%, 92.8%, and 91.8% are obtained respectively and in the case of SSD MobileNet model an average accuracy, precision, recall, and F1-score of 79.4%, 79.4%, 93.6%, and 84.4% are obtained respectively within a real-time video. Therefore as the rst research work, the model has shown good performance in both models but Faster R-CNN provides a promising result with an average accuracy of 91.8%.


Introduction
Eye vision problems are considered as one of the major public health issues in the Africa continent. The WHO report indicates that 90% of the blind peoples were living in developing countries [1]. According to the national survey of the Ethiopian Ministry of Health along with several non-governmental organizations in 2006 G.C, there were about 5.3% of the total population of the country lives with blindness and low vision problems (1.6% blind and 3.7% low vision) [2].
In the current world, there is a trend to forget that peoples who are challenged to lead a normal life live among us. A person lives with some sort of vision problem faced numerous di culties to perform a day to day activity, which seems simple tasks for us such as information access (printed media and mail), mobility, shopping, cooking, recognizing objects, and many other independent living skills [3] and also they suffer numerous serious challenges because of the consequences which come by it such as three times more likely to be unemployed, three times more likely to be involved in a motor vehicle accident, three times more likely to suffer from depression and anxiety disorders, three times near for sexual and another arrestment, and two times more likely to have a fall while walking compare with a person without vision problem [4].
As mentioned before one of the major problems visually disabled people faced in their day-to-day routine is recognizing things like currency especially paper currency. Currency is used almost everywhere [5]. Even if the electronic transaction like mobile banking and other electronic forms of payment is growing up, still hand to hand cash transaction widely used for day to day routines in Ethiopia. Both paper and coin currencies exist and the name of the Ethiopian currency is "Birr". There are ve paper currencies those are One Birr, Five Birr, Ten Birr, Fifty Birr, and One hundred Birr each of them stored a speci c size, color, and other features, which makes identi cation task simple/easy. From a miner observation almost all visually impaired peoples identify the coin currencies simply by touching the speci c tactile markings which mounted in each coin, but to identify the paper currencies they faced various challenges, the major one is they will be dependent on others to know the currency banknote value by asking a well-known question "how much is this?" for other individuals because there are no tangible patterns or other forms of marks on Ethiopian currency banknotes which enables a blind or visually impaired person to identify its value.
Thus, to minimize this dependence, the blind peoples have come up with numerous ways of handling mechanisms such as measuring the currencies by putting between their ngers, store different denominations into a different pocket, organize the money in ascending order using the size of the currency, and measure the size of the banknote using readymade paper. The existing approaches are provide an amazing support for vision impaired community but the techniques didn't avoid the challenges at all because it is easy to a mix-up when they receive new banknotes, forget the speci c pocket, and when the paper currency status especially when the status shift to worn. So, there is a gap which makes the community feeling dependent on others and makes them discomfort for their life. Thus, in order to ful l this gap and minimize the discomfort and negative feeling challenges can best be solved through emerging technologies; we propose to develop a model using convolutional neural network to provide real-time Ethiopian currency recognition for the visually impaired person.
Convolutional Neural Network (CNN) is biologically inspired by Hubel and Wiesel's early work [6], which are designed to imitate the behaviour of a visual cortex using monkey. The task of keeping input images with the 2D structure inside CNNs is done by the two main layers such as Convolutional Layer and Pooling Layer. So, the neurons in a layer will only be connected to a small region of the layer before it which is similar to the biological visual cortex [7]. In the meantime even if the CNNs are showed a great performance on simple tasks like character recognition they fell out because of the growing of the problem complexity as well as the computing resource limitation until the second birthday of CNNs released by Krizhevsky, Sutskever, and Hinton by presenting the greatest image classi cation accuracy improvement on ImageNet [7].
There are several state-of-the-art approaches are existed for the area of object detection but these two techniques are nominated because of their well-known capability with regard to speed and accuracy.
Thus, this study describes the only the two approaches such as Faster R-CNN and Single Shot Multi-Box Detector.
Faster R-CNN come up with the main aim of using shared convolutional layers for detection and for the generation of region proposal. Different researchers discovered that the feature maps produced by the networks of object detection networks also can be implemented to region proposals generation. Region proposal network is one portion of the Faster R-CNN network which is the fully convolution part that produces the feature proposals. In this research work the custom dataset which created by this study is trained and evaluated by the pre-trained model which created by the integration of Faster R-CNN and Inception-V3 with the aim of detection and feature extraction respectively.
There is an option of a train a Faster R-CNN network either for the detection or for the generation of the region of interest which means simply for the case of feature extraction. The most common training procedure and description of this network is, rst and foremost training of two distinct networks are performed then combination and ne-tuning techniques implemented on the two networks. In the case of ne-tuning, some layers are preserved xed and some layers are trained one after the other [8]. The feature maps are generated by the shared fully convolutional layers from a single image that is received by the trained network as input. Thus, the region proposal network produces its region proposals as output after receiving the feature maps which generates previously. Finally, the feature maps together with the region proposals being an input for the last detection layers which include a region of interest pooling layer then classi cation [9]. The computational cost of region proposals is very low in the case of shared convolutional layers. In Convolutional Neural Network work out the region proposals with a small computation cost is to provide extra bene t. In the case of detection windows with a variety of sizes and shapes special anchor boxes are implemented instead of the pyramid of different lter sizes. The anchor is the essential idea of the sliding window [8].
Single Shot MultiBox Detector (SSD) [10] was presented as a method to detect objects from image or sequence of images (video) by using a single shot. SSD is one stage detection approach means in the case of region-based object detection approaches like Faster R-CNN the required tasks are done by using performing critical stages those are the region of interest generation and classify the generated regions in a distinct step, however in the case of SSD prediction of the region (bounding box) and classi cation tasks are done in a single shot. Performing the two critical operations in a single shot makes SSD a good nominee for real-time detections because of the speed which earned from its nature. This model has the capability to save computational time because the generation of the regional proposal is not used and there is no resampling of image segments. SSD has less accuracy when it compares with Faster R-CNN. SSD handles objects of different sizes by using features maps from different convolutional layers as input to the classi er. This network produces a large number of regions (bounding boxes) with the scores of an object class in those boxes. Non-maximum suppression is used to eliminate boxes below a certain threshold so that only the boxes with higher con dence values proceed for classi cation. SSD architecture allows end-to-end training and improving the speed of the detector. This architecture does everything in one shot, thus, it is faster than other architectures but it lags the detection accuracy. SSD model is easy to train and simple to integrate with any systems which require an object detection module because as described earlier the SSD model encapsulates all the computation in a single network by removing the proposal generation and feature resampling stages. The authors evaluates the accuracy of this model on the PASCAL VOC, MS COCO, and ILSVRC datasets and obtained a comparable accuracy and much faster than others. The SSD object detection is composed of feature maps extraction, and convolution lters to detect objects parts.
The TensorFlow object detection API is nothing but a CNN based framework built on top of Tensor ow for object detection. The easiness to build, train, and setup capacity make it is popular among the researcher community. There are numerous pre-trained models are exist on it for object detection, those models are trained on the Common Objects in Context (COCO), KITTI, and Open Image datasets. These models can be used either for inference if anyone is interested in categories only in this dataset or for initializing our models when training on the custom dataset. Pre-trained models which trained on the COCO dataset with their speed of execution, accuracy and the type of output are listed in the TensorFlow model zoo.

Related Work
The Recognition system for Pakistani paper currency was proposed by Ahmed Ali and Mansoor [11]. The authors attempted to provide an accurate and intelligent recognition solution for Pakistani paper currency, which has different denominations and properties like size, color, and pattern variations by using image processing techniques. The outcome of the study was claimed to avoid the purchase of expensive recognition hardware and minimizing human effort. The overall construction of the proposed system was based on personal computer, scanner, and classi ers. In the research work, pre-processing techniques like noise removal, RGB to gray conversion, and gray to binary conversion were performed after the currency banknote image was captured from the scanner. One of the instance-based learning algorithms called k-nearest neighbors (KNN) was selected as an algorithm for the study and Euler number, height, width, aspect ratio, and area are identi ed as features or characteristics to decide the classi cation by the authors. These features are extracted from the training images and stored in the database as a MAT-le format. The KNN classi er works by staring at the classi cation of unknown instances which will be done by relating the unknown to the glorious in keeping with some distance/similarity perform. Generally, the proposed approach comprises four different procedures those are the acquisition of images, preprocessing, feature extraction, and classi cation. They have acquired a total of 100 images which means 20 from each currency notes (10, 20, 50, 100, 500, and 1000) with the help of a scanner.
A Fast-Mobile Money Reader was proposed to enable blind peoples to exchange United States currency banknote with no fear by using their smartphones [12]. Scale Invariant Feature Transform (SIFT) algorithm as a faster approach for feature classi cation is selected by the authors for the sake of the feature extraction process. In this work instead of picking different features or characteristics found on the currency bills using hand, they propose to follow a robust machine learning approach to train the data. Only the four United States currency banknotes ($1, $5, $10, and $20) are stored for testing and training. They scaled each of the images to 300 pixels by 300 pixels with a 200-pixel white border is added around the image. For the sake of experimentation, they create arti cial training images by rotating the existing images through 90, 180, and 270 degrees, and scaling by 0.5 and 1.5 its original size. The proposed system taking continuous snapshots until at least 60% of the currency banknote exposed in front of the phone's camera.
Jegnaw Fentahun [13] was proposed to design and develop automatic recognition of Ethiopian paper currency by holding three major aims such as identify Ethiopian paper currency, identi es counterfeit currency banknote from genuine and categorize them in their denominations by using the main color, distribution of color, hue value and SURF as a discriminative feature of the currency banknotes. The proposed system consists of two major components such as currency denomination component which accept scanned image as input then processed through its sub-components pre-processing, feature extraction, and currency categorization to classifying the input currency banknote into one of the ve denomination of Ethiopian currency (1 Birr, 5 Birr, 10 Birr, 50 Birr, and 100 Birr) and currency veri cation component which receives either the output of currency denomination component or an image which captured by the camera as an input to perform its responsibility which is to verify the speci c paper currency whether it is genuine or counterfeit. To classify the banknotes the correlation coe cient-based template matching was implemented and to verify the originality of the banknote segmenting the thin golden vertical strip which is on the paper denomination of 50 and 100 Birr was done.
In another research work investigating the case of feature extraction SIFT, GLCM, color momentum, CNN and combination of SIFT, GLCM and color momentum techniques, and Feed-Forward ANN as a classi er for the design of Ethiopian paper currency recognition system [14] was proposed. The major image processing phases such as image acquisition by using a scanner, pre-processing which is responsible to remove noise, convert RGB to grayscale and normalized the size of the input banknote image to reduce the in uence of the noise to the recognizer, feature extraction which is responsible to extract the descriptive features from the given banknotes by using Convolutional Neural Network (CNN) model and classi cation which was performed by using feed-forward arti cial neural network classi er was followed by this research work. In this research work to train and test the proposed model a total of 2400 banknotes image was collected through the scanner and 70%, 15%, and 15% was used for training, validating and testing respectively Almost all of the reviewed researches in this section was/will play a signi cant role in the problem area of Ethiopian currency recognition, however, they have their limitations/gaps such as many of the research works are trained and tested by using a small number of datasets, some of them require the static environment to perform recognition process, some of them are expensive or needs/requires special knowledge to use it. In addition, most of the technological solutions require blind or vision impaired individual take a picture of the full currency banknotes by presenting the banknotes in front of the camera which does not seem ideally and also practically easy for an individual who lives with a vision disability, some of them require a static background and distance, position or environment but this is also not practically applicable for blind or vision impaired peoples, some others didn't assume different protrusion (e.g. ngers) occurs in between the camera and currency banknote, and others didn't consider the input picture which takes by blind or visually impaired individuals may come up with different issues like folding, lighting conditions, only take the small piece of the banknote and so on. Thus developing a realtime currency recognition model using CNN classi er will have the potential to full ll the mentioned problems, by giving a better assist or uniform consultation to the blind or vision impaired community.

Objective Of The Paper
The general objective of this study is to develop a CNN based model that enables vision-impaired or blind peoples to recognize Ethiopian currency banknotes in a real-time context. To achieve the general objective of the study, the following speci c objectives are formulated: In-depth literature study on background knowledge such as neural network algorithms, vision impaired and blindness, Ethiopian currency banknotes, different countries' approaches to recognize banknotes, and other related concepts.
Datasets for training and testing the model will be collected. Develop the recognition model using a pre-trained CNN model.

Methods
The major aim of this study is to create a model which haves an ability to recognize Ethiopian currency banknotes thus to achieve this aim an in-depth literature review was done on background and other important knowledge's which includes visually disability, currency recognition, previous related works, and Convolutional Neural Network (CNN). Performing review activity on numerous previous researches is an important track cleaning activity because which can enable to share their idea to know what has been done and needs to be done in that particular problem area [15]. Related literature in the problem domain area was reviewed from various sources including books, journal articles, conference papers, reports, and the internet.
This section of the study presented the fundamental methods carried out in this research. The complete methodology of this study is consisting of ve major phases such as data collection, data preparation as well as preprocessing, justi cation of framework as well as pre-trained model selection; re-train the pretrained models, and evaluation of the re-trained model.

Data Collection
In the data collection phase the ve different values of Ethiopian banknotes from new up to worn was gathered from different banks and then recorded a lot of custom video dataset which shows the real-life scenario of the vision-impaired community. All the data is collected by putting Samsung Galaxy A10 mobile phone beside vision-impaired individual ears for 10 seconds. To obtain the presence of the realworld challenge, all the videos were attempted to be recorded within uncontrolled environments. In the case of recording the videos, the instances of different protrusion including ngers and shadows and also the various lighting conditions including moonlight, daylight, and arti cial light were attempted to be included. All the videos were recorded by considering the moonlight, daylight, arti cial light, and the transition of one lighting condition. To reduce the time-consuming and tedious routine task which is manual labeling, only 1700 images of each banknote are selected so that a total of 8,500 images are selected as a dataset. Ideally, the data should look as close as possible for the real-world situation, that is why the custom video was recorded which was collected at the real-world situation of the vision-impaired community. The images presented in the dataset have similar width and height which is 256 × 256 × 3 (Width x Height x Channel).

Data Preprocessing and Data Preparation
The images presented in the dataset have similar width and height which is 1080 × 1080 then resizes the images to the target size which is 256 × 256 based on the work described in [7] was performed.
In the data preparation task image labeling using bounding box and data separation was performed.
Image labeling is an essential task for the supervised machine learning techniques because the output result of the model is determined by the labels we feed the model in the training stage. An open-source graphical labeling tool "LabelImg" and the ground truth bounding box technique were used to label the dataset which holds a collection of the ve Ethiopian banknotes as shown in Fig. 1. For each image, the labeled information was saved as an eXtensible Markup Language (XML) le in PASCAL VOC format, the format compatible for CNN pre-trained models. The XML le stores important information such as image name, folder name, size of the image as (width, height, and depth), each bounding box coordinates as (xmin, ymin, xmax, ymax), class name, and others. We attempt to be rational in the case of include or discard banknotes from labeling. Banknotes that were fully or partially visible and recognizable were included, whereas banknotes that were unrecognizable because of size or position were excluded.
After the completion of image labeling task randomly splitting the dataset into training and test dataset which is used for train the model and is used for evaluating the trained model respectively was performed before the actual training begins. The training set contains images with their corresponding XML le generated by the image labeling task and similarly, the test directory contains images with their corresponding XML les which are used to evaluate the trained model. The data splitting task was done by adopting the ratio of 9:1 (90% training and 10% testing). This means that 90% from the dataset which is 7650 images are used to train the model, and 10% from the dataset which is 850 images are used to test the trained model.

Pre-Trained Model Selection
To train the Ethiopian currency banknote dataset two pre-trained models are selected those are SSD with MobileNet v1 and Faster R-CNN with Inception v2. These pre-trained models are nominated by using the literature review described in the Introduction part of this study. To select pre-trained models from the existing one's speed, accuracy, detection approaches, and problem domains are mainly takes into consideration.
In the case of SSD with MobileNet v1, the extraction of features is performed using the Mobile Network (MobileNet) and the detection task performed by SSD. The reason behind selecting this pre-trained model is by considering the problem domain means blind peoples need to know the required information as quickly as possible, speed, and its lightweight nature capacity to perform object detection on a device with low computational power such as a smartphone or Raspberry Pi. Mobile network (MobileNet) is a lightweight deep neural network that is e cient for mobile and embedded devices. The principle behind this architecture is the division of the standard convolutional lter into depth-wise convolution and pointwise convolution lter [16].
In the case of Faster R-CNN with Inception v2, the feature extraction task is performed by the Inception algorithm and R-CNN set of rules are applied for the detection task. The reason behind selecting this pretrained model is by considering the importance of accuracy and detection nature of the model.

Environment Setup
Train CNNs from zero (scratch) requires a lot of data and high-performance computing powered hardware. The training and evaluation of the model are implemented using a Tensor ow object detection API which is con gured in Windows 10 environment.
All the training, preprocessing, and experimental tasks were done on HP ProBook 450 G3 laptop with Intel(R) Core(TM) i7-6500U CPU @ 2.50GHZ processor and 16 GB RAM having Windows 10 operating system and HP Pavilion power 15 laptop with Intel(R) Core(TM) i5 7th generation CPU @ 2.1 GHZ processor, 8 GB RAM, and 2 GB AMD Radeon graphics having Windows 10 operating system. It used either the front-facing webcam of the laptop or external webcam to demonstrate real-time recognition. All the codes presented in this study are written by using Python programming languages. The idea behind using python as programming languages is its capabilities of easy to learn and the availability of matured resources to perform computer vision and real-time techniques [17].
Tensor ow is a completely open-source framework developed by Google in 2015 by holding the ambition of being a playing place for machine learning. This framework is written in C++, Python, and Cuda. There are so many reasons behind the implementation of this framework some of them are speed for computation which makes Tensor ow appropriate for the practical industry and academic research, expressive architecture, matured online support, and resource availability [18].
Open Source Computer Vision Library (OpenCV) is a completely free and open-source software library for the area of machine learning speci cally computer vision. OpenCV has the capability to support numerous programming languages such as Python, C++, Java, etc. and also platform-independent means working on a variety of platforms including Windows, Linux, Android, and iOS [19]. Thus, it can be easily accessed and used as a tool. Based on the above reasons, the real-time Ethiopian currency banknote recognition model is to use OpenCV by integrating with other machine learning libraries and tools.

Train the Pre-Trained Model
Transfer learning is applied by using the pre-trained models SSD MobileNet and Faster R-CNN Inception to train the prepared Ethiopian currency custom dataset. As mentioned in the framework selection section, Tensor ow Object Detection API was used to train the dataset. Thus before beginning the actual training process numerous must-to-do steps must be done such as XML to CSV conversion, label map creation, Tensor ow record generation, and training pipeline con guration.

XML to CSV Conversion
As mentioned before, training and testing datasets hold XML les which are generated by data labeling task with the name of its corresponding image (.jpeg) le. Each XML le contains important values such as image le name, width, height, category/class name, the four corners points of the bounding box (xmin, xmax, ymin, ymax), and others. Thus, the XML les were converted into two (test_labels.csv and test_labels.csv) CSV les which hold essential information for all images in the train and test dataset by editing the xml_to_csv.py le which comes together with the API. These two test_ labels.csv and train_ labels.csv les provide tables of 858 and 7655 rows respectively since some of the images contain more than one class.

Creating a Label Maps
Map every label into an integer value (ID) because of the training and detection processes. Thus, the label map le by the name label_map.pbtxt was created with the ve classes and their integer representation as shown in Fig. 2.

Tensor ow Record (TFRecord) Generation
Object Detection API requires all the labeled training data to be in the TFRecord le format. Thus, the CSV le and the training images were converted to a TFRecord le by adopting and generate_tfrecord.py le which comes together with the API and modifying the row labels to by One Birr, Five Birr, Ten Birr, Fifty Birr, and One Hundred Birr as shown in Fig. 3.

Pipeline Con gurations and Fine-Tuning
Transfer learning is selected rather than creating the model from scratch. Before triggering the actual training process con guring the required pieces of information for the object detection training pipeline comes rst. Thus, for both pre-trained models by using the provided con guration les (ssd_mobilenet_v2_coco.con g and faster_rcnn_inception_v2_pets.con g) as the basis and then some modi cations have been made to the default con guration le. For both cases, the number of classes in our cases we have ve classes, the locations of checkpoint le which delivered by Tensor ow, train and test TFrecord le those created for the training and test datasets respectively, the label map le which holds the target classes/categories, and other important parameters are required by the con guration le. Accordingly, all the required modi cations and the de nition of which model and what parameters to be used for training were done. There are also various important adopted as well as customized parameters that exist for two of the selected pre-trained models as shown in Table 1. Table 1  Important parameters for Faster R-CNN with Inception v2  The classi cation accuracy is determined by the total number of correctly classi ed currencies divided by the total number of validation as shown in Eq. (4 − 1). In the case of Faster R-CNN Inception, the model obtained an average of 91.8% accuracy for detection as shown in Table 3, and in the case of SSD MobileNet the accuracy obtained an average of 79.4% accuracy for detection as shown in Table 4 Table 7.

Conclusions
In this study, the researchers attempted to develop and tested models that provide a capability to detect Ethiopian currency banknotes in a real-time scenario by using transfer learning. The models are able to classify Ethiopian currencies into their respective categories. In-depth reviews of currency recognition studies are performed. Pre-trained Faster R-CNN Inception and SSD MobileNet models are used and also both the models are trained by using a custom dataset and evaluated in the real-time scenario. In this research work, both single-stage and two-stage detection approaches are applied. The detection process takes a frame from a live video as an input and attempts to classify it as One Birr, Five Birr, Ten Birr, Fifty Birr, or One Hundred Birr.
Even if there are a few pieces of research conducted in the area of Ethiopian currency recognition but their domain is very far from the domain of this research work thus, as being the rst Ethiopian currency recognition research in the domain of blindness and vision impairment, the evaluation result of the model can be considered as one that has a good performance. Both the models are evaluated by using numerous status means from new up to worn Ethiopian currencies. The classi cation accuracy of the models is evaluated by using 500 currencies provided by using real-time video. In this research work, the