Eye vision problems are considered as one of the major public health issues in the Africa continent. The WHO report indicates that 90% of the blind peoples were living in developing countries [1]. According to the national survey of the Ethiopian Ministry of Health along with several non-governmental organizations in 2006 G.C, there were about 5.3% of the total population of the country lives with blindness and low vision problems (1.6% blind and 3.7% low vision) [2].
In the current world, there is a trend to forget that peoples who are challenged to lead a normal life live among us. A person lives with some sort of vision problem faced numerous difficulties to perform a day to day activity, which seems simple tasks for us such as information access (printed media and mail), mobility, shopping, cooking, recognizing objects, and many other independent living skills [3] and also they suffer numerous serious challenges because of the consequences which come by it such as three times more likely to be unemployed, three times more likely to be involved in a motor vehicle accident, three times more likely to suffer from depression and anxiety disorders, three times near for sexual and another arrestment, and two times more likely to have a fall while walking compare with a person without vision problem [4].
As mentioned before one of the major problems visually disabled people faced in their day-to-day routine is recognizing things like currency especially paper currency. Currency is used almost everywhere [5]. Even if the electronic transaction like mobile banking and other electronic forms of payment is growing up, still hand to hand cash transaction widely used for day to day routines in Ethiopia. Both paper and coin currencies exist and the name of the Ethiopian currency is “Birr”. There are five paper currencies those are One Birr, Five Birr, Ten Birr, Fifty Birr, and One hundred Birr each of them stored a specific size, color, and other features, which makes identification task simple/easy. From a miner observation almost all visually impaired peoples identify the coin currencies simply by touching the specific tactile markings which mounted in each coin, but to identify the paper currencies they faced various challenges, the major one is they will be dependent on others to know the currency banknote value by asking a well-known question “how much is this?” for other individuals because there are no tangible patterns or other forms of marks on Ethiopian currency banknotes which enables a blind or visually impaired person to identify its value. Thus, to minimize this dependence, the blind peoples have come up with numerous ways of handling mechanisms such as measuring the currencies by putting between their fingers, store different denominations into a different pocket, organize the money in ascending order using the size of the currency, and measure the size of the banknote using readymade paper. The existing approaches are provide an amazing support for vision impaired community but the techniques didn’t avoid the challenges at all because it is easy to a mix-up when they receive new banknotes, forget the specific pocket, and when the paper currency status especially when the status shift to worn. So, there is a gap which makes the community feeling dependent on others and makes them discomfort for their life. Thus, in order to fulfil this gap and minimize the discomfort and negative feeling challenges can best be solved through emerging technologies; we propose to develop a model using convolutional neural network to provide real-time Ethiopian currency recognition for the visually impaired person.
Convolutional Neural Network (CNN) is biologically inspired by Hubel and Wiesel’s early work [6], which are designed to imitate the behaviour of a visual cortex using monkey. The task of keeping input images with the 2D structure inside CNNs is done by the two main layers such as Convolutional Layer and Pooling Layer. So, the neurons in a layer will only be connected to a small region of the layer before it which is similar to the biological visual cortex [7]. In the meantime even if the CNNs are showed a great performance on simple tasks like character recognition they fell out because of the growing of the problem complexity as well as the computing resource limitation until the second birthday of CNNs released by Krizhevsky, Sutskever, and Hinton by presenting the greatest image classification accuracy improvement on ImageNet [7].
There are several state-of-the-art approaches are existed for the area of object detection but these two techniques are nominated because of their well-known capability with regard to speed and accuracy. Thus, this study describes the only the two approaches such as Faster R-CNN and Single Shot Multi-Box Detector.
Faster R-CNN come up with the main aim of using shared convolutional layers for detection and for the generation of region proposal. Different researchers discovered that the feature maps produced by the networks of object detection networks also can be implemented to region proposals generation. Region proposal network is one portion of the Faster R-CNN network which is the fully convolution part that produces the feature proposals. In this research work the custom dataset which created by this study is trained and evaluated by the pre-trained model which created by the integration of Faster R-CNN and Inception-V3 with the aim of detection and feature extraction respectively.
There is an option of a train a Faster R-CNN network either for the detection or for the generation of the region of interest which means simply for the case of feature extraction. The most common training procedure and description of this network is, first and foremost training of two distinct networks are performed then combination and fine-tuning techniques implemented on the two networks. In the case of fine-tuning, some layers are preserved fixed and some layers are trained one after the other [8]. The feature maps are generated by the shared fully convolutional layers from a single image that is received by the trained network as input. Thus, the region proposal network produces its region proposals as output after receiving the feature maps which generates previously. Finally, the feature maps together with the region proposals being an input for the last detection layers which include a region of interest pooling layer then classification [9]. The computational cost of region proposals is very low in the case of shared convolutional layers. In Convolutional Neural Network work out the region proposals with a small computation cost is to provide extra benefit. In the case of detection windows with a variety of sizes and shapes special anchor boxes are implemented instead of the pyramid of different filter sizes. The anchor is the essential idea of the sliding window [8].
Single Shot MultiBox Detector (SSD) [10] was presented as a method to detect objects from image or sequence of images (video) by using a single shot. SSD is one stage detection approach means in the case of region-based object detection approaches like Faster R-CNN the required tasks are done by using performing critical stages those are the region of interest generation and classify the generated regions in a distinct step, however in the case of SSD prediction of the region (bounding box) and classification tasks are done in a single shot. Performing the two critical operations in a single shot makes SSD a good nominee for real-time detections because of the speed which earned from its nature. This model has the capability to save computational time because the generation of the regional proposal is not used and there is no resampling of image segments. SSD has less accuracy when it compares with Faster R-CNN.
SSD handles objects of different sizes by using features maps from different convolutional layers as input to the classifier. This network produces a large number of regions (bounding boxes) with the scores of an object class in those boxes. Non-maximum suppression is used to eliminate boxes below a certain threshold so that only the boxes with higher confidence values proceed for classification. SSD architecture allows end-to-end training and improving the speed of the detector. This architecture does everything in one shot, thus, it is faster than other architectures but it lags the detection accuracy. SSD model is easy to train and simple to integrate with any systems which require an object detection module because as described earlier the SSD model encapsulates all the computation in a single network by removing the proposal generation and feature resampling stages. The authors evaluates the accuracy of this model on the PASCAL VOC, MS COCO, and ILSVRC datasets and obtained a comparable accuracy and much faster than others. The SSD object detection is composed of feature maps extraction, and convolution filters to detect objects parts.
The TensorFlow object detection API is nothing but a CNN based framework built on top of Tensorflow for object detection. The easiness to build, train, and setup capacity make it is popular among the researcher community. There are numerous pre-trained models are exist on it for object detection, those models are trained on the Common Objects in Context (COCO), KITTI, and Open Image datasets. These models can be used either for inference if anyone is interested in categories only in this dataset or for initializing our models when training on the custom dataset. Pre-trained models which trained on the COCO dataset with their speed of execution, accuracy and the type of output are listed in the TensorFlow model zoo.