The approach was designed to detect and recognise American Sign Language using machine learning algorithms. In order to solve this issue, a machine learning method and a Convolutional Neural Network were applied. Picture capture, image preprocessing, image segmentation, and image classification are the four steps of the suggested sign language recognition system. The architecture of the proposed system is illustrated in figure 1. The initial phase is picture acquisition, which involves collecting various sign images with a camera. Gaussian filter, contrast enhancement, equalisation of the histogram, and averaging filters are used to preprocess the gathered sign images. Then the preprocessed images are segmented using Modified Canny Edge Detector (MCED). Then the segmented images are classified, for classification machine learning algorithm is used, and the images are trained using a modified resNet 101 CNN classifier. The final CNN settings are fine-tuned till the outcome is accurate enough.
Image preprocessing is the method it contains different morphological operations. In this phase, four methods can be used to preprocess the image that are Gaussian filter, contrast, enhancement, equalisation of the histogram, averaging filter.
(i)Gaussian filter
The first method in preprocessing is noise removal. In order to remove noise from the image Gaussian filter is used, and it is also used to correct the blurriness of the image and used for smoothing the image (Kaur, S et al. 2021).
$$p G\left(z\right)=\frac{1}{\sigma \sqrt{2\pi }}{e}^{-\frac{{(x-\mu )}^{2}}{{2}^{2}}}$$
1
\(\sigma\) Denoted as a standard deviation, the amount of smoothing is determined by the Gaussian standard deviation, and \(\mu\) is a mean value.
(ii)Contrast enhancement
The next method in preprocessing is contrast enhancement, and it is used to make the image features stand out more clearly. Image quality should be improved, and enhancing contrast level of the images and brightness of the image must be preserved (Luque-Chang A., 2021).
$${I}_{n}\left(x\right)={I}_{n}\left(x\right)+a(\stackrel{-}{I1}-\stackrel{-}{I2})(I-{I}_{n}\left(x\right)){I}_{1}\left(x\right)$$
2
X is the pixel location \(In\) represent the pixel intensity value. \(I1,I2\) represent the average pixel intensity.
(iii)Equalisation of histogram
Histogram equalisation is the next approach to preprocessing. It is utilised to enhance the contrast of a picture by altering the intensity distribution of the histogram (Reddy, K.S and Jaya, T. 2021).
$${D}_{i}=\left[\sum _{j=0}^{i}{N}_{j }\right]*\frac{maximum intensity level}{no of pixels}$$
3
N represents the number of pixels i, j represents intensity levels.
(iv)Averaging Filter
The next preprocessing approach is average filtering, which smooths pictures by lowering the difference in intensity between neighbouring pixels. The average filter analyses the image pixel by pixel, substituting the average value of surrounding pixels for each value. The averaging filter is derived using Eq. 4, (Shedbalkar J et al. 2021)
$$y\left[n\right]=1/N\sum _{i=0}^{N-1}x[n-i]$$
4
N represents the length of the average x(n) present input, x(n-i) represents previous input and y(n) represents present output.
3.2 Image Segmentation
The image segmentation is utilised to split the image into different portions according to its features and characteristics. In this method, Modified Canny Edge Detector is used for image segmentation. MCDE is used to detect the edges and hence defines the boundaries of an object. In MCDE, double thresholding and edge tracking methods will be used then the final segmentation image will be produced.
(i)Modified Canny Edge Detection (MCDE)
Step 1
Use a gradient in the x and y dimensions to find the edge direction. The direction is found using 3 by 3 convolution kernels in CDE, although CDE does not provide an exact direction, so we used MCDE. Unnecessary features can be ignored by using a larger Sobel operator without affecting the detection of actual edges. The middle pixel technique calculates the gradient intensity of non-edge pixels using a pair of 55 convolution kernels. The first pixel of interest (POI) will be drawn in the centre of the adjacent pixels. The distance between the centre pixel of interest and the relative distance between the centre pixels of interest is then calculated.
Figure (2a)
The Taylor expansion of bivariate functions, which is utilised in nonparametric regression, was applied in this study to estimate the regression function and local linear kernel estimation, which uses Eq. 5 to evaluate the gradient (G x and G y). (Qin, X, 2021).
$${y}_{i\approx \widehat{y1}}+\varDelta x*Gx+\varDelta y*Gy$$
5
Where x and y represent pixel distances, Yi is a measure of how interested a pixel is in one of its neighbours.
The pixel of interest is (0,0), and a 25 by 3 matrix was created to define the x and y directions, with the first column entries being 1’s and the next two columns occupied with relative pixel locations. It was calculated using Eq. 6 (Qin, X, 2021).
W (I, j) = exp (-d (I, j)/2k), k (6)
Use the weighted residual sum of squares to get the gradient intensity, which is defined in Eq. 7 [32]
Q=(Y-X\(\beta )\)1\(w(Y-X\beta )\) (7)
Equation 8 achieves the second and third parts by decreasing Q by subtracting the horizontal and vertical gradients (Gx, Gy). (Qin, X, 2021)
$$\widehat{\beta }=\left({X}^{\text{'}}WX\right){X}^{\text{'}}WY=AY$$
8
Closest neighbouring pixel to the Pixel of Interest. The intercept is represented by the first row of a 25 by 3 matrix. The gradient along the x-axis is shown in the second row, while the gradient along the y axis is shown in the third row.
After that, the edge pixel must be located. The convolutional kernel fully maps every pixel in CDE. This will have an impact on the pixel’s precision.
Figure (2b)
To address this problem in MCDE edge pixels, different operators are used. In order to generate an appropriate convolution kernel using Eq. 9. An alternative form of surrounding pixels should be utilised to determine the gradient of the pixel at (0,0). Then, without making any alterations, repeat the procedure for each and every point in the original image. (Qin, X, 2021)
\(S S Q=\sum _{i=1}^{n}(yi-\widehat{y1)}\) 2 (9)
Step 2
Non-maxima suppression creates a narrow line for the edge by tracking along the edge path and suppressing any pixel value that is not designated an edge.
The magnitude of a single-pixel q with a 45-degree edge will be compared to the magnitudes of r and p, where r and pare interpolated values based on the two nearest pixels. For example, Eq. 10 can be used to calculate r. (Qin, X, 2021)
R =\(\alpha b+\left(1-\alpha \right)a\) (10)
where, a and b represent pixel’s original magnitudes closest to r.
$$\alpha =\left|{G}_{x}\right|/\left|{G}_{y}\right|$$
Step 3
The dual thresholding approach is often used to prevent streaks and to filter out the weak gradient values generated by noise fluctuations.
Step 4
Edge tracking is a technique for identifying strong and weak edges in an image and subsequently removing any weak edges.
Step 5
The final processed image will be displayed once we set the removed weak edges in the previous phase to zero.
3.3 Classification
Image classification is the process of employing rules to locate and label groups of pixels or vectors inside an image. The image is classified using a modified deep residual CNN classifier. CNN represents a significant advancement in image identification. CNNs have an input layer, an output layer, and a hidden layer that includes convolutional, ReLU, and pooling layers, as well as fully connected layers.
(i)Modified deep residual CNN
In this proposed method, modified deep residual 101 CNN is used to recognise the 36 classes that are A to Z and 0 to 9. A Convolutional Neural Network with 101 deep layers, ResNet 101 is a Convolutional Neural Network. The ImageNet database can be used to load a pre-trained version of the network that has been trained on over a million images. We test the suggested method’s ability to recognise various types of sign images.
Figure 3 explains the first layer of the Convolutional Neural Network is the convolution layer, and it performs convolution operation to the input. Then the sum result will be filtered using an activation function like ReLU and passes the output to the pooling layer. The procedure of the convolution layer will be calculated using Eq. 11(Liew, W.S. et al. 2021)
$${O}_{i\left(x,y\right)}={f}^{\text{'}}\left(\sum _{i=0}^{m-I}\sum _{j=0}^{m-I}w\left(i,j\right).I\left(x+i,y+j\right)+bias\right)$$
11
The pooling layer is being used to minimise the number of variables in the network and the number of dimensions in the feature map. To extract features such as edges, points, and so on, max pooling is utilised. The neurons from the current layer to the next layer are then connected using a fully connected layer. The training dataset classifies an input image into various categories. The result of the neural network is classified using Eq. 12, (Liew, W.S. et al. 2021)
$$y=\sigma \left({w}^{L}..\sigma \left({w}^{2}\sigma \left({w}^{1}x+{b}^{1}\right)+{b}^{2}\right)\dots +{b}^{L}\right)$$
12
\(x\) is input and \(\sigma\) is an activation function, and w is denoted as a network parameter and b denoted as bias.
Original resNet 101 has 101 deep layers, and a network results with different structures and produces a different outcome. Several changes to resNet 101 decreased the size, computational cost, and the number of residual blocks, and the ReLU activation between each residual block was deleted(https://www.kaggle.com/ayuraj/american-sign-language-dataset). To keep the output size of each layer consistent, a max-pooling layer was added before each convolutional block. The upgraded resNet 101 contains fewer filters and lower complexity parameters than the original resNet 101. The computation time is affected by the amount of data and the size of the neural network. The modified resNet 101 is employed in this proposed solution due to the high processing complexity of deeper and more advanced neural networks. Modified resNet 101 reduces training time and interference while preserving as much of the original performance as possible. Finally, in this proposed method, 35 classes are classified using modified resNet 101. Then all 35 signs are identified using this proposed ASL system.