To thoroughly investigate the proposed method's accuracy, we test its capability to correctly assign sandfly specimens at various taxonomic levels: subfamily, genus and subgenus, and species.
Test for accuracy at the family/subfamily taxonomic level.
The accuracy of the classifier was tested at various taxonomic levels ranging from the family (Psychodidae) to the genera (Phlebotomus, Lutzomyia (Lutzomyia & Migonemyia, if taking into account revised taxonomy25,26), Sergentomyia) and the species level (12). The Psychodidae family encompasses about 2600 species; however, only specimens belonging to the Phlebotominae subfamily are included in the dataset due to their medical importance as pathogen vectors. We first explored the training classifier accuracy on the Phlebotomine dataset and other non-Psychodidae specimens from Calliphoridae, Culicidae, Glossinidae, Muscidae, and Tabanidae datasets18. We trained the CNN on such a combination to improve the model's accuracy. The dataset was filled with 1673 pictures of Phlebotomine WIPs. Still, five species were unsatisfactorily covered in terms of WIP pictures and were discarded from the training dataset of the Phlebotomine subset. Using this pictures-set, we ascertain the accuracy of the process to discriminate the Psychodidae family from other non-Psychodidae. From our dataset and method, the automatic classification process accuracy is an astonishing 99.8% (Table 2). Knowing that the wing size doesn’t belong to the descriptor selected during the training process, our classification accuracy would rely on other descriptors more specific to the WIPs.
Table 2: Psychodidae vs. non-Psychodidae classification accuracy
|
Predicted
|
Psychodidae
|
non-Psychodidae dipteran
|
Truth
|
Psychodidae (331)
|
99.7%
|
1
|
non-Psychodidae dipteran (685)
|
1
|
99.8%
|
Number of pictures in bold
Test for accuracy at the genus taxonomic level.
Sandfly taxonomy has complex and still ongoing evolution. A conservative and simplified approach recognizes six main genera: three in the Old World (Phlebotomus, Sergentomyia, and Chinius) and three in the New World (Lutzomyia, Brumptomyia, and Warileya)1. Although a revision of the New World genera was recently proposed 1,26, in this study, since we refer to this conservative taxonomy27, we still added the information dealing with the revision of the New World sandfly taxonomy to highlight changes. Therefore, L. migonei specimens are gathered with those of L. longipalpis in the analysis and the accuracy computation. We have also focused on the genera that harbor proven or suspected vectors and thus are most relevant for human or veterinary medicine. Our dataset contains pictures documenting three genera if we refer to Akhoundi and Coll1 (Phlebotomus, Lutzomyia, and Sergentomyia), and four if we refer to the revised taxonomy25,26, clearly more samples are required to address this question on New World sandfly fauna. At the genus level, our classification accuracy was always >90% (Table 3).
Table 3: Classification accuracy of genera
|
|
Predicted
|
|
Genus
|
Phlebotomus
|
Lutzomyia
(Lutzomyia
& Migonemyia)
|
Sergentomyia
|
non-Psychodidae dipteran
|
Truth
|
Phlebotomus (254)
|
98.0
|
1.6
|
0.0
|
0.4
|
Lutzomyia
(Lutzomyia & Migonemyia) (58)
|
5.2
|
93.1
|
1.7
|
0.0
|
Sergentomyia (20)
|
0,0
|
10.0
|
90.0
|
0.0
|
non-Psychodidae dipteran (686)
|
0.1
|
0.0
|
0.0
|
99.9
|
Number of pictures in bold
Test for accuracy at the subgenus taxonomic level.
To further investigate the taxonomic congruence of our methodology with the already proposed one, we assess its classification reliability at the subgenus level. At the generic level, the subgenera of sandflies have been intensively studied over many decades, with taxonomists providing varying views about their number and designation. The genus Phlebotomus currently encompasses 13 subgenera. For the genus Sergentomyia, ten subgenera are proposed. The Chinus genus is not further divided into subgenera. The taxonomic subdivisions in Neotropical Phlebotominae are rather complex and remain debatable 26,28. A checklist of American sandflies is available 25. We use the classification provided by Akhoundi et al. 1. Our dataset does not fully cover the biodiversity of sandflies, particularly for New World sandfly species, at the subgeneric level; however, we provide data on four subgenera of the genus Phlebotomus, namely Adlerius, Larroussius, Paraphlebotomus and Phlebotomus, that in total harbor 30 species proven or suspected as vectors or many human-infecting Leishmania.4 At the subgenus level, the classification accuracy computed remains high, consistently above 80% (Table 4). Higher confusion occurs between the Adlerius and Laroussius subgenera, which are regarded as phylogenetically close, than the Sergentomyia and New World (Lutzomyia and Migonemyia) ones.
Table 4: Classification accuracy of subgenera
|
|
Predicted
|
|
Subgenus
|
Adlerius
|
Laroussius
|
Paraphlebotomus
|
Phlebotomus
|
Lutzomyia
(Lutzomyia
&
Migonemyia)
|
Sergentomyia
|
Non-Psychodidae dipteran
|
Truth
|
Adlerius (17)
|
82.3
|
11.8
|
0.0
|
5.9
|
0.0
|
0,0
|
0,0
|
Laroussius (96)
|
1,0
|
88,6
|
0,0
|
6,3
|
3,1
|
0,0
|
1,0
|
Paraphlebotomus (45)
|
0,0
|
4,4
|
88,9
|
6,7
|
0,0
|
0,0
|
0,0
|
Phlebotomus (96)
|
0,0
|
0,0
|
0,0
|
99,0
|
1,0
|
0,0
|
0,0
|
Lutzomyia
(Lutzomyia & Migonemyia) (58)
|
0,0
|
3,5
|
0,0
|
1,7
|
93,1
|
1,7
|
0,0
|
Sergentomyia (20)
|
0,0
|
0,0
|
0,0
|
0,0
|
10,0
|
90,0
|
0,0
|
non-Psychodidae dipteran (686)
|
0,0
|
0,1
|
0,0
|
0,0
|
0,0
|
0,0
|
99,9
|
Number of pictures in bold
Test for automatic classification of the 12 sandfly species filled in the dataset.
At the species level, our data set is filled with pictures of 17 species, but only 12 provided enough images to encompass a training process. Even if limited in terms of species richness, considering the vast number of sandfly species described (900 to 1000), our dataset is composed of primary proven vectors of pathogens causing cutaneous and visceral leishmaniasis in the New and Old World, which highlighted the interest for medical entomology purposes. Among the 12 species in the dataset that have undergone a learning process, the best accuracy score was for P. papatasi (100%), the proven vector of L. major (an agent of cutaneous leishmaniasis). The lower recorded accuracy score (77.8%) is computed for P. perniciosus, a proven vector of L. infantum (an agent of visceral leishmaniasis). The overall accuracy scores remain astonishing and are always higher than 77% for all the species filled in the database (Table 5). The exactness of the method to assign sandfly species must now be probed on a larger in natura collected sandfly sample covering a wider geographic area, including New World species. In addition, with the methods proposed, WIPs variation at the populational level can be investigated with a proper sampling strategy.
Table 5: Accuracy score for specimens of the 12 Phlebotominae filled in the database.
|
|
|
Predicted
|
|
|
Species
|
Class
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
Truth
|
P. tobbi (33)
|
1
|
84.8
|
0.0
|
3.0
|
9.1
|
0.0
|
0.0
|
3.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
P. perniciosus (36)
|
2
|
8.3
|
77.8
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
2.8
|
0.0
|
2.8
|
5.6
|
0.0
|
2.8
|
P. argentipes (39)
|
3
|
0.0
|
0.0
|
97.4
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
2.6
|
0.0
|
0.0
|
P. duboscqi (30)
|
4
|
0.0
|
0.0
|
0.0
|
96.7
|
0.0
|
0.0
|
3.3
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
P. sergenti (45)
|
5
|
4.4
|
0.0
|
0.0
|
2.2
|
88.9
|
0.0
|
4.4
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
P. ariasi (4)
|
6
|
0.0
|
0.0
|
0.0
|
25.0
|
0.0
|
75.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
P. papatasi (27)
|
7
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
100.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
P. orientalis (23)
|
8
|
4,3
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
91,3
|
4,3
|
0,0
|
0,0
|
0,0
|
0,0
|
P. arabicus (17)
|
9
|
5,9
|
0,0
|
5,9
|
0,0
|
0,0
|
0,0
|
0,0
|
5,9
|
82,4
|
0,0
|
0,0
|
0,0
|
0,0
|
L. longipalpis (35)
|
10
|
0,0
|
5,7
|
0,0
|
2,9
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
91,4
|
0,0
|
0,0
|
0,0
|
L. migonei
M. migonei (23)
|
11
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
95,7
|
4,3
|
0,0
|
S. schwetzi (20)
|
12
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
10,0
|
90,0
|
0,0
|
non-Psychodidae (686)
|
13
|
0,1
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
0,0
|
99,9
|
Number of pictures in bold
Morphological species identification remains a golden standard in sandfly taxonomy; however, it is prone to various limitations, including compromised state of decisive structures in the field-collected specimens, intraspecific variability among populations, laborious sample preparation and declining entomological expertise among taxonomists. Hence, alternative molecular approaches are gradually applied, namely DNA sequencing (DNA barcoding) 29 or MALDI-TOF protein profiling30-35. These methods, however, also have their limitations. It was demonstrated that reference DNA sequences for sandflies currently cover less than 50% of the subfamily species diversity, and, depending on the sandfly group or genus, different markers rather than a universal set are applied26,36. Moreover, despite increasing affordability and decreasing costs per analysis, sequencing is still not always available in some endemic countries and requires considerable expertise. MALDI-TOF protein profiling, a mass spectrometry method, provides a time- and cost-effective alternative as the sample preparation is quick and cheap. However, the required machinery may be prohibitively expensive and not always readily available to medical entomologists. So far, only in-house databases of sandfly reference protein spectra have been established, further limiting the applications of this approach. In addition, the interoperability of MALDI-TOF requires a standardized procedure in the conservation of samples, the choice of the adult specimen body part or even the trapping method37, and the standardization of procedures for preparation and reproducibility between instruments and homemade databases is desirable33. Hence, an alternative method for species identification of adult sandflies performed under conditions that do not allow costly and highly sophisticated infrastructures is highly desirable.
The application of Deep learning leads to robust results in terms of classification performance. The proposed method has the potential to be used in real-life scenarios since the proposed architecture ends up with a good compromise compared to other methodologies reviewed in Cannet et al. 15. Future development and technical implementation of this methodology include strengthening the database in terms of Phlebotomine species and population representation, the use of GANs (Generative adversarial network) allowing to fill up the database with new species, even with a low number of representatives. From an application point of view, previous works15,16,24 and this study add evidence to the generic potential of the method for dipteran insect identification. Implementing a SaaS platform would offer a complete service for remotely localized computers with an internet connection.