Machine learning analysis of wing venation patterns accurately identifies Sarcophagidae, Calliphoridae and Muscidae fly species

In medical, veterinary and forensic entomology, the ease and affordability of image data acquisition have resulted in whole‐image analysis becoming an invaluable approach for species identification. Krawtchouk moment invariants are a classical mathematical transformation that can extract local features from an image, thus allowing subtle species‐specific biological variations to be accentuated for subsequent analyses. We extracted Krawtchouk moment invariant features from binarised wing images of 759 male fly specimens from the Calliphoridae, Sarcophagidae and Muscidae families (13 species and a species variant). Subsequently, we trained the Generalized, Unbiased, Interaction Detection and Estimation random forests classifier using linear discriminants derived from these features and inferred the species identity of specimens from the test samples. Fivefold cross‐validation results show a 98.56 ± 0.38% (standard error) mean identification accuracy at the family level and a 91.04 ± 1.33% mean identification accuracy at the species level. The mean F1‐score of 0.89 ± 0.02 reflects good balance of precision and recall properties of the model. The present study consolidates findings from previous small pilot studies of the usefulness of wing venation patterns for inferring species identities. Thus, the stage is set for the development of a mature data analytic ecosystem for routine computer image‐based identification of fly species that are of medical, veterinary and forensic importance.


INTRODUCTION
In medical, veterinary and forensic entomology, the identification of fly species is the first step towards using entomological data to establish evidence for estimation of the minimum post-mortem interval in cases of death-related investigations (Amendt et al., 2011).Traditionally, the identification of adults uses external morphological characteristics, but internal characters can also be used, particularly the male genitalia (Sivell, 2021).Difficulties can arise in the case of female flies, for which identification keys are limited for groups such as Sarcophagidae (Giroux et al., 2010;Kurahashi et al., 2021;Kurahashi & Chaiwong, 2013;Lopes, 1961;Richet et al., 2011;Vairo et al., 2015).
A trained taxonomist is required to reliably infer the species identity of a specimen using morphological identification keys.DNA sequence analysis complements traditional morpho-taxonomy by enabling typically reliable species identification even if only morphologically uninformative body parts are available, such as fly legs (Tan et al., 2010).However, DNA-based species identification may be problematic in some situations.For example, Whitworth et al. (2007) found that 12 Protocalliphora species infected with the endosymbiotic bacteria Wolbachia were poorly identifiable using partial COI and COII gene sequences.Picard et al. (2018) showed that two species of Lucilia cannot be separated by standard DNA barcodes but require amplified fragment length polymorphism (AFLP).
Another method that has been used to identify flies of medical, veterinary and forensic importance is geometric morphometrics (Adams et al., 2013).This method is attractive as only basic digital optimal instruments are needed to photograph the wing, and mature data analysis softwares are available.Wing venation patterns captured through a set of homologous landmarks have been shown to be sufficient for clustering many fly species in the Calliphoridae and Sarcophagidae families (Sontigun et al., 2017(Sontigun et al., , 2019)).More recently, Khang et al. (2021) showed that a machine learning model that uses fly wing geometric morphometric data produces species identifications that are highly concordant with those inferred using DNA sequence data.
Nevertheless, the number of homologous landmarks that can be identified in the fly wing limits geometric morphometrics's ability to detect more subtle patterns of wing venation.
In principle, whole-wing image analysis allows more thorough extraction of shape variation information, thus overcoming limitations of geometric morphometrics in species identification tasks.This approach should enable good species identification since divergence between taxa is the main source of variation in wing venation patterns, with rare and incomplete secondary convergence (Perrard et al., 2014).Furthermore, image analysis allows the possibility of developing automated approaches for species identification, thus potentially alleviating the burden of routine species identification on professional taxonomists.However, to use this approach effectively, relatively more sophisticated data processing and analytic steps are required.In medical and veterinary entomology, Macleod et al. (2018) first demonstrated the applicability of pixel-based image analysis for identification of Chrysomya bezziana Villeneuve, 1914 (Diptera: Calliphoridae) at the sex-and population-level, as well as identification of Ch. bezziana, Chrysomya megacephala (Fabricius, 1794) (Diptera: Calliphoridae) and Chrysomya rufifacies (Macquart, 1843) (Diptera: Calliphoridae) at the species level.More recently, Goh and Khang (2021) applied a class of mathematical transformation known as Krawtchouk moment invariants onto pixel data from binarised fly wing images.They found that random forests classification of 12 fly species produced an out-of-bag misclassification rate of 0%, compared to about 39% using geometric morphometric data.However, the misclassification rates were likely underestimated because small sample sizes (n = 74) prevented meaningful test samples from being constructed.In this article, we aim to provide a definitive validation of the usefulness of the approach in Goh and Khang (2021) using a substantially larger set of samples with proper test sample estimation of misclassification rate.

Specimens
Since sex can be an additional source of variation in wing venation patterns, we only used male flies.A total of 759 male flies (3 families; 13 species and a species variant) were used in the present study (see Table S1).The specimens used in this study came from three separate collections.Their geographical sources are given in Table S2.Collection 1 (SHT) consists of specimens collected in Malaysia.It includes three Calliphoridae species: Ch. megacephala, Ch. nigripes, Ch. rufifacies, and all the five species of Sarcophagidae.The specimens were collected from various geographical localities and habitats (e.g., primary forests, farms, mangrove swamps, beaches and national parks) in Malaysia.The flies were identified using suitable identification keys (González-Mora, 1989;González-Mora & Peris, 1988;Kurahashi et al., 1997Kurahashi et al., , 2021;;Peris & González-Mora, 1991;Pont, 1991)  Only the right wings from the specimens were used.For specimens in Collection 1 and 2, the wings were dislocated at the basal articulation, stuck onto a transparent tape and mounted on an acetate sheet.A video illustration of the procedure is available at https:// youtu.be/o7-XnUk0vKY.The wings were then digitised using a stereo microscope (Olympus SZX7, Japan).For specimens in Collection 3, the wings were removed close to their bases using a combination of delicate forceps and a fine scalpel blade, while preserving the basicosta.

Seven
The dislocated wings were then placed beneath a cover slip on a glass slide that had been mounted with Euparal.This was accomplished by placing a drop of Euparal on the slide, followed by the wing, a thin layer of Euparal on the wing and finally the cover slip.The mountant was kept as thin as feasible to maximise wing flattening.All mounted wing slides were placed in a 56 C oven overnight to eliminate air bubbles.Digital images of slide-mounted wings were then obtained using standard entomological photomicrography procedures (Hall et al., 2014) with a Leica DFC295 camera and Leica Application Suite version 3.5.0software.Figure 1 shows examples of the raw images derived from each of the three collections.

Image pre-processing
We first binarised all raw image data to focus on the wing venation patterns and remove unnecessary features such as background noise and wing membrane details (Figure 2), as the trailing edge of the wings are frequently damaged due to normal wear and tear (Beutler et al., 2020).To do so, different binarisation configurations were applied to different sets of images to accentuate the wing venation patterns.Raw images that could not be properly binarised or contained broken venation patterns were removed (Figure 3).For each binarised image, residual background noises were manually removed to prevent machine learning models from learning them as signals for identifying the fly species.Next, the images were centred and then oriented with the wing costa parallel to the horizontal axis.Subsequently, they were cropped into images of dimension 724 Â 254 pixels (Figure 4) and saved in PNG file format.To optimise the machine learning model's performance and accelerate training speed, we reduced the size of the images to 256 Â 90 pixels, thereby eliminating extraneous features that may impede accurate identification.The time to pre-process each image approximately ranged from 3 to 8 minutes, with noisier images requiring more time to process.In general, the better the initial quality of the raw image, the less time was needed.In the end, we obtained 759 clean, binarised images of fly wings.Figure S1 shows examples of binarised wing images of representative specimens from each species.

Feature extraction and selection
For each binarised image, its 0-1 pixel matrix was transformed to Krawtchouk moment matrix of order p (see Supplementary Text 1) The machine learning model was trained on the feature matrix of the training samples (K).It was then used to infer species identities in the test samples using the latter's feature matrix (L).The columns (features) of K were standardised using the Z-score to ensure that each feature had zero mean and unit variance.This produced the matrix K norm .Assuming that the features in L have statistical distribution identical to those in K, we standardised the columns of L using the corresponding column mean and standard deviation from K. Thus, we obtained the matrix L norm : This step was important for avoiding data leakage during the testing phase, which could produce spurious performance results.
Next, to reduce the dimensions of the feature space, we applied principal component analysis (PCA).The eigenvectors of the first k principal components (k = 264) that accounted for approximately 92% of the total variance in the dataset were extracted as the p 2 Â k matrix S. Thus, we obtained the matrix of the first k principal component scores:

Exploratory data analysis
To assess the informativeness of the new features obtained using the preceding steps, we performed exploratory data analysis.If the new features were informative, we expected to see reasonably clear patterns of clustering at the family level as well as at the species level.In particular, we expected Muscidae samples to cluster strongly together as they are relatively different from the more closely related Sarcophagidae and Calliphoridae.We also expected mutant Ch.albiceps to form a distinct cluster at the family level because of their broken wing venation patterns.To this end, we applied LDA onto the PCA feature matrix of the training samples K pca , and then examined the pairwise scatter plots of the linear discriminants.

Model training, testing and performance evaluation
We used stratified random sampling with fivefold cross-validation to allocate 80% of the samples for training, and 20% of the samples for testing.To ensure unbiased model training, it was necessary for sample sizes to be balanced.To this end, we implemented the Synthetic Minority Oversampling Technique (SMOTE)-a technique for generating synthetic samples from the minority class (Chawla et al., 2002).
We defined species with at the most 20 samples as being in the minority class, and those with more than 20 samples as being in the majority class.Since coupling oversampling of the minority class with undersampling of the majority class tends to produce better classification performance (Chawla et al., 2002), we implemented the Edited Nearest Neighbours (ENN) undersampling technique concurrently.
ENN excludes a selected data point and its k-nearest neighbours if the class label (i.e., species identity in the present context) returned by the majority vote of the k-nearest neighbours is different from the class label of the selected data point (Batista et al., 2004).The F1-score is defined as the harmonic mean of precision and recall: Values of precision, recall and F1-score close to 1 indicate a model's propensity for making reliable identifications.For this, the mean of precision, recall and F1-score over all k folds were used to summarise the model performance in correctly identifying a species.
Figure 5 summarises the methodological workflow of the present study.

Software and computing tools
ImageJ version 1.53k was used to binarise the images.PixlrE (https:// pixlr.com/e/)was used for manual denoising of the binarised images.
Computations were done using R version 4.1.3(R Core Team, 2022).
The IM R package (Rajwa et al., 2013) was used to compute the Krawtchouk moment invariants.SMOTE and ENN were implemented via the UBL R package with default settings (Branco et al., 2016).Data

Data and code availability
The metadata of the samples, raw images, their binarised versions and the R scripts used are available in the Data Dryad Repository at https://doi.org/10.5061/dryad.vdncjsxzh.

Sample clustering
The training samples show strong clustering patterns at the family level (Figure 6) in the space of the first three linear discriminants (LD1, LD2 and LD3).As expected, samples of the Ch.albiceps wing mutant formed a distinct cluster rather than clustering with species in the Calliphoridae family.Within the Sarcophagidae family, there was also clear separation between all the species (Figure 7).

Family and species level identification
The GUIDE random forests model identified test samples with a mean accuracy (±standard error) of 98.56 ± 0.38% at the family level (Table S3).This result is concordant with the results of exploratory data analysis (Figure 6).For identification at the species level, GUIDE achieved a mean accuracy of 91.04 ± 1.33%, (Table S4), which is substantially higher than the baseline identification accuracy of 7.14% (1/14) obtained from random guessing.
The mean accuracy of identifying the majority class was 92.46 ± 1.13%, and 79.59 ± 5.43% for the minority class (Table S4).
Overall, the GUIDE random forests model achieved a mean precision of 0.91 ± 0.02, a mean recall of 0.86 ± 0.03, and a mean F1-score of 0.89 ± 0.02 (Table 1).
We explored the model's patterns of misclassification to better understand its weaknesses.The five folds collectively contributed 68 instances of misclassifications (Table S4).Most of the misclassifications involved samples from the Calliphoridae family (76.47%;52/68), followed by samples from the Sarcophagidae family (16.18%;11/68) and the Muscidae family (7.35%; 5/68).There could be two reasons for calliphorids making up the largest fraction of misclassifications observed.The first reason is that the number of samples in the Calliphoridae family was much higher (n = 586) than the number of samples in the other two families (n = 173).Assuming that the probability of misclassification is the same for each sample, more errors are therefore expected among calliphorid samples (Tables S1 and S2).
The second reason is that the current model tends to misclassify a species into another species of the same genus more frequently simply by chance, because five out of six species in the Calliphoridae family are from the Chrysomya genus.Within each species, we observed that out of 52 misclassifications in the Calliphoridae family, 42 of them were misclassified as another species within the same genus (Table 2).In particular, the misclassification happened more frequently between the following pairs: Ch. bezziana-Ch.megacephala, Ch. albiceps (normal)-Ch.rufifacies and Ch.albiceps (normal)-Ch.megacephala, which accounted for 12, 8 and 7 misclassifications, respectively (Table 2).

DISCUSSION
Our present results consolidate the findings in Goh and Khang (2021) that Krawtchouk moment invariants features extracted from wholeimage analysis of fly wing venation patterns enable a random forests classifier to produce highly accurate species identifications.Importantly, our findings have established a high accuracy lower limit ($91%) that is attainable in identification of fly species based on their wing venation patterns.This implies that advanced image-based classification algorithms like deep learning have the potential to further improve identification accuracy.Indeed, deep learning has recently begun to diffuse into species identification in entomology, bringing with it a certain degree of model interpretability (Borowiec et al., 2022;Høye et al., 2021).Recently, Popkov et al. (2022) showed  wings thus mounted is good, with clear contrast between the wing venation patterns and the background (Figure 1).For samples from Collection 1 and 2, the average time required for manual denoising of image was <2 minutes per image.In contrast, it took an average of 8 minutes for samples from Collection 3.
For a balanced view, we have identified some limitations of the present work.Firstly, some loss of potentially informative species-specific variation is probably inevitable with the image binarisation step.For example, wing membrane pigmentation is potentially informative for discriminating some species, such as Chrysomya marginalis, species from the Isomyia, Stomorhina (Elleboudy et al., 2016) and Hypopygiopsis genera (Heo et al., 2015;Kurahashi et al., 1997).Secondly, the generalisability of the model's performance for identification of female flies is unknown since it was trained using only male specimens.For sarcophagids and calliphorids, females are more likely to be found on carrion (Martín-Vega & Baz, 2013;Szpila et al., 2015).Furthermore, wing venation pattern-based identification may be potentially more impactful for female flies.For example, females of Ch. bezziana and Ch.megacephala are often difficult for the untrained taxonomist to separate without voucher specimens, because of only a subtle difference in the frons and the calypter colour (Irish et al., 2014).In the literature, measurable differences in the wing venation patterns of males and females of some calliphorids have been reported (Hall et al., 2014).However, species-specific clusters containing both male and female samples that are more or less identifiable using wing shape geometric morphometric data have also been reported in calliphorids (Jiménez-Martín et al., 2020;Sontigun et al., 2017), sarcophagids (Sontigun et al., 2019) and muscids (Limsopatham et al., 2021).For future work, expanding the coverage of the training samples to include female specimens would be important to resolve this issue.Thirdly, the model lacks explainability-that is, we cannot infer which region of the wing venation pattern is used by the model to discriminate one species from another.An explainable model is desirable because it provides the rationale behind the predictions (i.e., species identification in the present context) to the end users (Murdoch et al., 2019).If a model could associate its prediction decision with a region of the image from which it based its decision on, such as in the heatmaps of a CNN, this could be invaluable for prompting taxonomists to investigate previously unthought of morphologies for species delineation.Finally, it is possible that wing venation patterns as represented using Krawtchouk moment invariants might not contain sufficient resolution to identify some species.
Our present machine learning model for automated identification of fly species aims to supplement the expertise of taxonomists by providing a computable summary of their knowledge, without intending to replace them.Recommendations from a machine learning model can be invaluable for aiding a taxonomist in making rapid, and more confident routine identifications, thus freeing them to focus on discovery and description of new species.

Flies
were collected with a handheld insect net by sweeping and decomposed beef was used as bait.Collection 2 (TI) consists of specimens collected in the province of Alicante, Spain.It includes three Calliphoridae species: C. vicina, Ch. albiceps (normal and wing mutant variant), L. sericata, and a Muscidae species: Sy. nudiseta.For specimens in Collection 2, C. vicina and L. sericata specimens were captured using pork liver baits.Specimens from Ch. albiceps and Sy.nudiseta were obtained by growing larvae obtained from a human autopsy at the Institute of Legal Medicine of Alicante (IMLA, Spain).Collection 3 (AHW) consists of specimens collected mostly from the islands of Indonesia.It includes three Calliphoridae species: Ch. bezziana (from Java, Sulawesi, Sumatra, Sumba islands, Malaysia, India and Africa), Ch. megacephala (from Java, Kalimantan, Lombok, Sumatra, Sulawesi, West Papua and West Timor islands) and Ch.rufifacies (from Sumatera and Sumba islands).Some Ch. bezziana samples were grown in the laboratory from larvae found at myiasis-infected wounds.The Ch. megacephala and Ch.rufifacies samples were captured using a Lucitrap Modification (LTM) or sticky trap with Bezzilure (Urech et al., 2012) as a bait.
. Samples from Collection 1 were deposited in the MP2 Genetics and Molecular Biology Laboratory, Institute of Biological Sciences, Faculty of Science, Universiti Malaya (Kuala Lumpur, Malaysia); samples from Collection 2 in the Entomological Collection of the University of Alicante (CEUA, Spain) and samples from Collection 3 in the Natural History Museum (London, UK).

F
I G U R E 1 Raw images of fly wings from three different sources.(a) Z. aquila from Collection 1; (b) Ch. albiceps with mutant venation pattern from Collection 2; (c) Ch. bezziana from Collection 3.
Binarisation of raw image using ImageJ.(a) Raw image from a Z. aquila specimen; (b) the image of (a) after binarisation.The blue rectangle indicates the region where wing venation patterns are likely to be biologically meaningful.The regions bounded by red rectangles are generally noisy or contain biologically uninformative patterns.using the IM R package (Rajwa et al., 2013).The Krawtchouk moment matrix contains information that allows image reconstruction of varying degrees of resolution, depending on the choice of the order p (Figure 4).For the present study, we set p ¼ 90.For feature extraction, the Krawtchouk moment invariants were used instead because they are invariant to the effects of rotation, scaling and translation in the images.Subsequently, for each image, we flattened its Krawtchouk moment invariants matrix into a vector of length p 2 .Each vector was then stacked row by row to produce a feature matrix of dimension n Â p 2 , where n represents the sample size.
Subsequently, we applied linear discriminant analysis (LDA), obtaining the matrix of eigenvectors of the l linear discriminants T (l ¼ min k, d À 1 ð Þ¼13, d = number of groups), which is of dimension k Â l.Thus, the matrix of linear discriminant scores is given by K lda ¼ K pca T ¼ K norm ST.For the test samples, we transformed the feature matrix L norm as L lda ¼ L norm ST using the matrix of eigenvectors T and S obtained from the training samples.

F
I G U R E 3 Examples of low quality binarised images, with boxed problematic region.(a) Loss of venation patterns caused by excessive brightness; (b) obfuscated venation patterns caused by poor lighting in the original image; (c) damaged veins.Both SMOTE and ENN were applied to samples from the minority class in the K lda matrix.In the end, we obtained augmented training datasets containing approximately 600 samples whereby each group had approximately equal sample size (about 43 per group).The Generalized, Unbiased, Interaction Detection and Estimation (GUIDE) random forests(Loh, 2009) was used as the classifier.Since six species (all five Sarcophagidae species, and Ch.nigripes) have small sample size (10-20 specimens), setting aside samples for a validation set for the purpose of hyperparameter tuning runs the risk of producing an overfitted model that simply captures the idiosyncrasies of a small set of specimens.For this reason, we kept to the default settings in GUIDE.After training the GUIDE random forests using the augmented training samples, we estimated the accuracy of the model using the test samples in each fold and averaged the result.We also estimated the accuracy of predicting the majority class and the minority class separately to understand the model's classification behaviour.To investigate the overall performance of the GUIDE random forests model in species identification, we considered four metrics: accuracy, precision (i.e., positive predictive value), recall (i.e., sensitivity) and the F1-score.Briefly, accuracy is the proportion of inferred species identities that are correct (reported as percentage); precision gives the proportion of inferred species identities that are true:Precision ¼ True Positive True Positive þ False Positive :Recall gives the proportion of a species of interest that is inferred correctly:F I G U R E4 Examples of wing image data from Ch. megacephala.(a) Raw image; (b) binarised image after manual denoising, centring and compressing; (c) reconstructed image using Krawtchouk moments of order 90 (see Part 1C of Supplementary Text 1 for mathematical details).Recall ¼ True Positive True Positive þ False Negative : processing and analyses were done using a MacBook Air 2020 with an M1 chip and a 16GB RAM processor running on the macOS Monterey operating system.The random forests model was implemented using the GUIDE programme (version 40.1;Loh, 2009) with default settings (number of trees = 500; number of variables for splitting = 5; splitting fraction = 0.16639; maximum depth = 15; minimum node size = 5).
For discriminatingC.vicina, L. sericata, mutant Ch.albiceps, Ch. rufifacies and Ch.nigripes, F I G U R E 5 Data processing and machine learning workflow for identifying fly species using wing image data.Symbols: μ = vector of feature mean; σ = vector of standard deviation; S = matrix of eigenvectors of the first k principal components accounting for 92% of total variance; T = matrix of eigenvectors of linear discriminants.The experiment was repeated five times using different test samples to estimate variability of the evaluation metrics (accuracy, precision, recall and F1-score).Classification is synonymous with species identification; label is synonymous with species identity established using identification keys.LD1, LD10 and LD13 were sufficient (Figure8a,b).LD6 was necessary for distinguishing normal Ch.albiceps from Ch. bezziana and Ch.megacephala (Figure8b).Finally, LD9 allowed differentiation of Ch. bezziana from Ch. megacephala (Figure8c).The presence of strong species clusters indicates that the Krawtchouk moment invariants features successfully capture important species-specific signals in the wing venation patterns and are therefore useful for inferring species identities.Figure9shows that the synthetic minority class samples generated using SMOTE clustered well with actual samples from the same species in linear discriminant space.

F
I G U R E 6 Distribution of training samples at the family level in linear discriminant space.(a) Scatter plot of LD2 against LD1-Mutant Ch. albiceps samples and Muscidae samples are separable as distinct clusters using both LD1 and LD2 (dashed lines); (b) scatter plot of LD3 against LD1-Sarcophagidae and Calliphoridae samples are separable as distinct clusters using LD3 (dashed line).

F
I G U R E 7 Distribution of training samples within the Sarcophagidae family in linear discriminant space.(a) Scatter plot of LD2 against LD1-B.karnyi, Le. alba, Z. aquila and S. princeps are separable based on LD1 and LD2 (dashed lines); (b) scatter plot of LD8 against LD1-A.gressitti is separable from the other four sarcophagid species based on LD8 (dashed line).F I G U R E 8 Distribution of samples within the Calliphoridae family in linear discriminant space.(a) Scatter plot of LD10 against LD1-C.vicina, L. sericata, Ch. albiceps (mutant) and Ch.rufifacies are separable (dashed lines); (b) scatter plot of LD13 against LD6-Ch.nigripes is separable based on LD13 and normal Ch.albiceps is separable from Ch. bezziana and Ch.megacephala based on LD6 (dashed lines); (c) scatter plot of LD9 against LD1-Ch.bezziana is separable from Ch. megacephala based on LD9 (dashed line).
Distribution of original and synthetic samples (for minority class) generated using SMOTE and ENN in linear discriminant space.(a) Samples in LD2-LD1 linear discriminant space for species in the Sarcophagidae family; (b) samples in LD10-LD1 space for species in the Calliphoridae family.Synthetic samples are represented by open symbols.Dashed lines indicate partitions in linear discriminant space that enable identification of species-specific clusters.
that convolutional neural networks (CNN) produce expert-level accuracy in the identification of species from the plant bug genus Adelphocoris Reuter (Heteroptera: Miridae), which technically requires dissection of the genitalia, using digitised habitus photographs of the specimens.Additionally, CNN enables heatmaps that highlight regions in an image which the algorithm uses to produce an identification, thus allowing the identification rationale to be cross-checked by a taxonomist, if necessary.Nevertheless, deep learning methods require a substantial number of training samples before their potential can be fully realised in a practical setting.Furthermore, they require more sophisticated computing hardware.In contrast, the application of classical feature extraction techniques using Krawtchouk moment invariants is practical and produces reasonably accurate predictions while requiring only ordinary computing devices.The most time-consuming step in the present workflow is the data cleaning process, which required 3 to 8 minutes for binarisation and subsequent manual denoising.The need to control for batch effects arising from variations in sample preparation and imaging in different collections further increases processing time, since we need to find different optimal configurations to obtain high quality binarised images.If this step is not done properly, the classifier model will confound batch effects with wing venation variation, leading to poor identification outcomes in future samples.We believe the image processing steps in the present study has been effective for removing batch effects, based on two observations: (i) Ch. megacephala and Ch.rufifacies samples form distinct species clusters (Figure 8) even though the samples come from two different collections; (ii) the muscid Sy. nudiseta forms a distinct cluster and does not cluster with other species from the same collection.A standardised protocol for preparing wing specimens and subsequent imaging work would substantially cut the amount of data-cleaning work.The protocol for preparing wing specimens from Collection 1 and 2 is fast and scales well to accommodate the preparation of large numbers of samples.More importantly, image quality of the T A B L E 1 The result of fivefold cross-validation for precision, recall and F1-score of each population in the test sample using the Generalized, Unbiased, Interaction Detection and Estimation random forests model.
Krawtchouk moment invariants.The latter were transformed into principal components and subsequently linear discriminants for use as features in the GUIDE random forests model.GUIDE was able to achieve mean classification accuracy close to 99% at the family level, and about 91% at the species level.The mean F1-score of 0.89 suggests that GUIDE random forests has good precision and recall properties.The present work indicates that the development of an image-based automated fly species identification programme using wing venation patterns is viable, practical and potentially invaluable for advancing the fields of medical, veterinary and forensic entomology in the digital age.Siew Hwa Tan https://orcid.org/0000-0002-7526-0830Tsung Fei Khang https://orcid.org/0000-0003-4433-9738 The misclassification matrix of species in the genus Chrysomya.