Sample information
Images of individuals from each of the haplogroups of Triatoma dimidiata were obtained from Gurgel-Gonçalves et al. [20]. These images are part of a collection of images of 51 triatomine species from Mexico and Brazil available for public use in the Dryad repository (http://dx.doi.org/10.5061/dryad.br14k). The original series of images that represent the species distributed in Mexico was taken from the following entomological collections in Mexico: Regional Center for Health Research, National Institute of Public Health of Mexico, Guanajuato State Public Health Laboratory, Benito Juárez Autonomous University of Oaxaca, and the Autonomous University of Nuevo León, Monterrey, and details of how these images were taken are described in the referenced publication. We obtained a total of 44, 30, and 40 images of individuals belonging to haplogroups 1, 2, and 3 respectively; the haplogroup assignment of these individuals was corroborated genetically, and this corroboration constitute a major factor to use this images in quantitative analysis like our work. From the 114 images, we selected only high-quality images that clearly captured the spot pattern, eliminating the cases where the spots were fused or covered by hyperchromatic wings (Additional file 1). This resulted in a final sample of 101 images (39 from H1, 23 from H2, and 39 from H3).
Image processing
The images were processed to facilitate the extraction of standardized measurements of the spot pattern (Fig. 9). The abdomens were clipped manually, removing the legs and cutting off the head at the thorax level. Subsequently, the images were aligned and re-scaled, using the insertion angles of the abdomen and thorax and the back of the body as references for alignment and scaling all individuals to the width of the first individual that was taken as a reference (image H10355). These transformations may slightly alter the shape and absolute values of the spot measurements, but they are essential to standardize the spatial patterns of the spots and make them comparable, eliminating differences due to body shape or size, whose identifying value has been tested in previous works [20, 26]. For this reason, the quantitative estimates of areas were always expressed relative to the total area of the abdomen and linear measurements are relative to the square root of the total abdomen area.
Processing for spot pattern extraction included removing color information (desaturation) and reduction of levels to the central 50% of the image histogram. In some cases, noise produced by surface reflectance of the specimens or shadows that artificially connected adjacent spots during the binarization of the images were manually eliminated.
In the ImageJ program [35] a macro (Additional file 2) was programmed to automate image processing and measurements. This included 8-bit image conversion, binarization with a minimum automatic threshold, background removal, mask conversion, and gap filling. The outlier points, both black and white (using radius 6 and threshold 50) were then removed and the resulting particles (spots) were measured.
Commons Attribution 1.0 Universal (CC0 1.0) Public Domain Dedication license (https ://creat iveco mmons .org/licen ses/by/1.0). Images modified from Gurgel-Gonçalves R, Komp E, Campbell LP, Khalighifar A, Mellenbruch J, Mendonça VJ, et al. Automated identification of insect vectors of Chagas disease in Brazil and Mexico: the virtual vector lab. PeerJ. 2017;5:e3040 [20]
Heat maps were obtained by superimposing the images of the spot patterns of all individuals per haplogroup, using the PAT-GEOM v1.0.0 package, developed by Chan et al. [36]. This package allows the analysis of different measures of the coloration pattern quantitatively, and it was designed to work with macros on ImageJ. These maps allowed us to visually explore and qualitatively describe the general patterns that characterized each haplogroup.
Quantitative characterization of the spot pattern
The spots were numbered consecutively for identification; spots 1 and 2 were the central spots, and the spots on the edge of the abdomen were numbered with consecutive odd numbers on the left and even numbers on the right. To quantitatively describe the pattern of spots, a series of primary variables were taken at the spot level, as well as derived variables that included both the spot and individual levels.
The variables measured are shown in Fig 10. The total body area (Ta) was used for standardization purposes only. The relative area (Ra) was the area of each spot relativized as a percentage of Ta (%). The sum of the Euclidean distances (SED) was calculated by taking the centroid coordinate of each spot and calculating, at the individual level, the distance between the central and lateral spots after making a Procrustes record of the complete configurations. The maximum and minimum Feret diameters (MaxFd and MinFd, respectively), as well as the Feret angle (Fa), were calculated for each spot. These variables refer to the maximum and minimum distances between any pair of contour points of a shape, and although they are identified as diameter, they are not strictly analogous to a diameter, since they do not pass through the center of the figure or divide it into symmetrical sections. The Fa refers to the angle of the vector of the MaxFd and indicates the general directionality of the spot (its inclination). The aspect ratio (Ar) of each spot (ratio of the minor to the major diameter) was used as an indicator of its shape.
For each individual, the averages of the variables per spot, the sum of the total Ra of the spots, and the ratio of the mean Ra of the central spots to the lateral spots were calculated as derived variables. For the calculation of the average inclination angle, both for the central and lateral spots, the angles of the spots from the left to the right quadrant (0-90º) were reflected.
Data analysis
Non-parametric descriptive statistics (median, quartiles, and range) were used because the distribution of the data was not normal, and traditional descriptors gave a false impression of precision and marked differences. Statistical comparisons among haplogroups were done using Kruskal-Wallis tests in Statistica v8 software. Also, a Linear Discriminant Function Analysis (forward stepwise) (LDFA) was performed to estimate the ability to discriminate haplogroups based on the variables used. Since this method has a series of restrictive premises and can only linearly differentiate the groups, a multilayer perceptron type neural classification network was used as an alternative method. Neural networks are supervised machine learning procedures and do not have statistical premises on the nature of the data, making them more powerful and capable of exploring nonlinear relationships in complex sets of variables. The most efficient topology for the network was found by the automated search procedure of the Statistica 8.0 software, considering all of the variables analyzed. The network was trained with 60% of the individuals by haplogroup and validated with the remaining 40%. Assignment to each group was random, except for individuals wrongly classified by the LDFA, who were forced into the validation sample for a more robust check of network performance. The weight assigned by the neural network to each variable was estimated to identify those of greatest importance in the discrimination process.