The study compares the efficiency and accuracy of CNN models with two other supervised learning algorithms, namely Support Vector Machines (SVM) and Random Forest (RF). To determine the optimum parameters for the traditional models development, GridSearch algorithm was used from the Sk-Learn interfaces for hyperparameter tuning. For Random Forest (RF), it assumed that the most critical hyperparameters were the number of estimators (n) and the criterion (gini or entropy). A value of 200 was chosen for the number of trees in the forest (n), and the gini optimizer was chosen as the tree criterion, as the average training time and cross-validation value are both acceptable. The results showed that the model's accuracy was 72% with a training time of 0.142 minutes and a standard deviation of 0.08. On the otherhand, for the SVM model the accuracy was 63% with a training time of 2.43 minutes. Such relatively low accuracies of the traditional models may be attributed to the fact that these models are not suitable for classification of solid waste items from RGB images. This is in agreement with the findings of Shi et al., 2021 who reported that traditional machine learning algorithems cannot fit the data and balance the sample well. This is also confirmed by Sothe et al., (2020) who indicated that the performance of CNN is more accurate than SVM and RF. Therefore, a CNN model has been developed in this study. The model included both local images dataset and TrashNet datasets for both single and multiple images (Mao et al., 2022). Since the images of the solid waste generated in Jordan used in the model development, the model has been called JONET model.
3.1 JONET Model
The use of CNN models in image processing and recognition has been applied in many areas (Bobulski and Kubanek, 2019) including solid waste classification (Shi et al, 2021, Xia et al., 2021, Malik et al., 2022) and inteligent waste identification and recycling (Wu et al., 2023). The concept of transfer learning (Wu et al., 2023) was used in this research to transfer weights from pre-trained models and then reset them to be compatible with the local garbage and TrashNet datasets. The fine-tuning concept was also used by removing the old classifier for pre-trained models and replacing it with a new one that suited the dataset (Kaya et al., 2019).
To decide on the proper model architecture, variety of different fully connected layers with different numbers of neurons were tested and analyzed. Figure 2a. illustrates the selected architicture of the JONET model. As shown, the model consists of two blocks, the first block is dealing with the feature extraction, while the second one utilized for image classification. The feature extraction process was accomplished by unfreezing all convolutional layers of the pre-trained base model. After that, the extracted features were introduced into a fully connected layer that contains 1024 neurons for classification of the solid waste items. Sixteen base models were tested and evaluated as shown in Fig. 2.b. As it can be seen, DenseNet 201 gave the highest accuracy of 92.7% among all the tested pre-trained models with only 67 epochs, which makes it suitable to be selected in the classification stage of model development. Various researchers adopted different pre-trained base models. For example, Kryzhanovsky et al. (2020) introduced a method for selecting a pre-trained CNN for fine tuning on a fresh image sample. The method is based on estimating the separability of pre-trained CNN features for the examined image classification issue and selecting the CNN with the highest one. The selection of the suitable base model is relevant to several factors that may include the CNN model architecture, dataset size and characteristics, where CNN complexity is not always reflecting a higher accuracy of the model (Choromanski et al., 2020).
The performance of the developed deep learning algorithm for classifying the images was evaluated using the performance indices (Chowdhury, et al, 2020) namely, recall, precision, and F1 score as given by equations 1 to 3. Precision refers to how the ratio of waste item was correctly predicted, while recall is the accuracy ratio of waste item predicted and how they matched to categories. These indices must be as high as possible.
$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\left(\text{T}\text{P}+\text{F}\text{N}\right)+(\text{F}\text{P}+\text{T}\text{N})}$$
1
$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$
2
$$\text{F}1 \text{S}\text{c}\text{o}\text{r}\text{e}=\frac{2\text{*}\text{T}\text{P}}{2\text{*}\text{T}\text{P}+\text{F}\text{N}+\text{F}\text{P}}$$
3
Where, (TP) is true positive, (TN) is true negative, (FP) is false positive and (FN) is false negative.
The confusion matrix and classification report for the JONET Model tested on TrashNet datasets are presented in Table 2. It can be observed that glass has the highest prediction ratio with F1 score of 0.98, where out of 101 images, 100 images were correctly predicted (99.0%) followed by paper 98.3% then metal 97.6%, while the plastics category has the lowest prediction accuracy of 91.8%. The overall accuracy of the model by Eq. (1) is 96.06% with training time of 28 min. Using a four-layer CNN and a total of 400 images, Altikat et al, 2022 reported a lower prediction accuracy for all items, which ranged from 56.7% for plastic to 76.7% for organics. Such a relatively lower accuracy may be attributed to the relatively low number of images used in training and testing the model. This indicates that the number of images used in the training and testing of the CNN model is an important factor in determining the accuracy of prediction. For example, Niu et al., (2022) reported accuracy of 90.62% using 3680 images. Another study by Togacar et al. (2020) CNN with 13966 images of organic class and 11,111 images of recyclables class, achieved an accuracy that reached 99.95%.
Table 2
Confusion Matrix and classification report for JONET tested on TrashNet dataset.
| | Predicted Classes | | | |
| | Cardboard | Glass | Metal | Paper | Metal | Misc. | Total | Precision | F1 Score |
True Classes | Cardboard | 75 | 0 | 0 | 5 | 0 | 1 | 81 | 92.6 | 96 |
Glass | 0 | 100 | 0 | 0 | 1 | 0 | 101 | 99.0 | 98 |
Metal | 0 | 2 | 80 | 0 | 0 | 0 | 82 | 97.6 | 97 |
Paper | 0 | 0 | 1 | 117 | 0 | 1 | 119 | 98.3 | 96 |
Plastic | 0 | 2 | 2 | 1 | 89 | 3 | 97 | 91.8 | 95 |
Misc.* | 0 | 0 | 0 | 1 | 0 | 27 | 28 | 96.4 | 90 |
| Total | 75 | 104 | 83 | 124 | 90 | 32 | 508 | | |
| Recall | 100.0 | 96.2 | 96.4 | 94.4 | 98.9 | 84.4 | | | |
*Miscellaneous |
Using the local garbage dataset alone in JONET also gave relatively high results as shown in Table 3. The results showed that paper class has the highest f1 score of 98% with a precision of 96%. While Plastic has the lowest f1 score of 92% with 87.5% precision. The analysis showed 94.42% of overall accuracy and a training time of 59 minutes. Mao et al. (2022) reported an accuracy of 92.12% by testing on local garbage in Taiwan with training time of 12 hours.
Table 3
Confusion matrix and classification report for JONET tested on local garbage dataset
| | Predicted Classes | | | |
| | Cardboard | Glass | Paper | Plastic | Misc.* | Metal | Total | Precision | F1 Score |
True Classes | Cardboard | 60 | 0 | 0 | 0 | 0 | 0 | 60 | 100.0 | 96 |
Glass | 2 | 120 | 2 | 0 | 0 | 1 | 125 | 96.0 | 98 |
Paper | 0 | 1 | 91 | 1 | 0 | 2 | 95 | 95.8 | 96 |
Plastic | 1 | 0 | 0 | 28 | 3 | 0 | 32 | 87.5 | 92 |
Misc.* | 1 | 0 | 1 | 0 | 98 | 3 | 103 | 95.1 | 92 |
Metal | 1 | 0 | 1 | 0 | 10 | 111 | 123 | 90.2 | 93 |
| Total | 65 | 121 | 95 | 29 | 111 | 117 | 538 | | |
| Recall | 92.3 | 99.2 | 95.8 | 96.6 | 88.3 | 94.9 | | | |
*Miscellaneous |
Table 4 presents the results of the JONET Model classification report and confusion matrix for tests on both datsets (TrashNet and local garbage datasets). It can be noticed that using the combined datasets, the paper class has the greatest prediction ratio, with a f1 score of 0.95, where 175 out of 182 images were likely predicted (96.2%). The metal class has the lowest accuracy of 75% accuracy and f1 score of 0.8 with 30 minutes of training, the model total accuracy is 92.73%. The relatively low accuracy of the model on the combined dataset as compared with the accuracy of testing on individual data sets may be explained by the condition of the solid waste item in the two datasets. The TrashNet data set items are clean ones, while the items in the local dataset have some dirt, as they have collected from the street containers. Similar results were reported by Wu et al. (2023) who indicated that CNN are reliable in identifying the recyclable solid waste items with an accuracy ranged from 70–99.6%. However, when the TrashNet data set extended by additional dataset either from internet or generated locally by the researchers, the prediction accuracy ranged between 79.27% and 96.5%.
Table 4
Confusion matrix and classification report for JONET tested on TrashNet and local garbage datasets.
| | Predicted Classes | | | |
| | Cardboard | Paper | Glass | Plastic | Metal | Misc. | Total | Precision | F1 Score |
True Classes | Cardboard | 193 | 8 | 1 | 3 | 2 | 6 | 213 | 90.6 | 93 |
Paper | 5 | 175 | 0 | 1 | 1 | 0 | 182 | 96.2 | 95 |
Glass | 0 | 0 | 191 | 9 | 0 | 2 | 202 | 94.6 | 94 |
Plastic | 1 | 1 | 9 | 206 | 3 | 5 | 225 | 91.6 | 92 |
Metal | 1 | 2 | 2 | 3 | 33 | 3 | 44 | 75.0 | 80 |
Misc.* | 0 | 2 | 3 | 2 | 0 | 158 | 165 | 95.8 | 93 |
| Total | 200 | 188 | 206 | 224 | 39 | 174 | 1031 | | |
| Recall | 96.5 | 93.1 | 92.7 | 92.0 | 84.6 | 90.8 | | | |
*Miscellaneous |
Table 5. summarizes the findings of the tested models in this study. As it can be seen, the traditional machine learning models, namely SVM and RF are having the lowest accuracies among the tested models with an accuracy of 62.5% and 72% respectively. As for the deep learning models, it can be observed that the highest accuracy was 96.06% for JONET model tested on the TrashNet dataset with a training time of 28 minutes. On the other hand, testing the JONET model on the combined datasets of TrashNet and local images revealed to get the lowest accuracy of 92.73% and a training time of 30 minutes. While using the local garbage images dataset alone resulted in an accuracy of 94.42% with training time of 17.67 minutes.
Table 5
Summary of different machine learning models accuracy tested in the study
Model number | Model | Dataset | Accuracy | Splitting of Data | Training time (min) | Epoch | Input Image Size |
1 | Support Vector Machine(SVM) | TrashNet + Local Garbage | 62.5 | Train: Test 80:20 | 2.43 | NA* | 32x32 |
2 | Random Forest (RF) | TrashNet + Local Garbage | 72.0 | Train: Test 80:20 | 0.142 | NA* | 32x32 |
3 | JONET model | TrashNet + Local Garbage | 92.73 | Train: Test 80:20 | 30 | 67 | 256x256 |
4 | JONET model | Local Garbage | 94.42 | Train: Test 80:20 | 17.67 | 59 | 256x256 |
5 | JONET model | TrashNet | 96.06 | Train: Test 80:20 | 28 | 97 | 256x256 |
* not applicable |
3.4 Multiple Object Model
In the real life conditions where segregation of solid waste at source is not practiced, the solid waste items are mixed together. Therefore, a models for detecting images with multiple solid waste items are needed in this case, which is rarely addressed in the literature (Mao et al., 2022).
To enhance the capabilities of the CNN in detecting classes of solid waste from multiple object images a multiple object model was developed. Segmentation of the items from the multiobject image was implemented (Mazzeo et al., 2019, Melinto et al., 2020) by splitting the images into single objects. This was achieved by using Canny Edge algorithm. After that the images of each item introduced into the single classification model. Figure 3.a shows the steps of extracting each object from the image by identifying the waste item using a bounding box, while Fig. 3.b is an example of the prediction accuracy of the model using the segmentation technique with JONET model.
In certain cases, the model has made wrong predictions between glass and plastic. This is may be attributed to the transparent nature of both items. Furthermore, it has been observed that the boundary boxes for detecting an object was interfering with each other, where sometimes the object was not clearly identified. Similar problem was reported by Mitra (2020) which was eliminated by decreasing the threshold value during the model testing. The results from the current study indicate the need for further research to develop reliable models for predicting solid waste items from multiple object images.