Phylogenetic tree
We used a recently published Lepidoptera phylogeny that was time-calibrated with genomic and fossil data1. Because our study focused on the relationship of family-level morphological variation, we dropped those tips representing lower taxonomic ranks (e.g., subfamily and genus), using the ‘drop.tip’ function in the R package ‘ape’76. The resulting Lepidoptera tree contained 83 tips, including 74 Lepidoptera families (containing 145,623 described species77) and 9 outgroups.
Estimating diversification rates
To estimate diversification rates through time across the phylogeny, we applied the Bayesian Analysis of Macroevolutionary Mixtures (BAMM) model78 on the trimmed phylogeny using the R package ‘BAMMtools’79. The sampling fraction was adjusted based on the species numbers within each family77 before the BAMM analysis. The priors for the BAMM analysis were determined by the setBAMMpriors function with the modified tree and the estimated number of Lepidoptera species. To avoid difficulty in initial likelihood estimation due to the extremely low sampling fraction, we specified the extinctionProbMax as 0.99999 in the control file. A total of 109 generations of MCMC searches were launched for the BAMM speciation-extinction analysis, with samples saved every 105 generations.
We used the ‘plot.bammdata’ function to visualize diversification rates on the phylogenetic tree. The first 50% of samples were discarded as burn-in. We then used the ‘getBestShiftConfiguration’ function to detect the greatest number of possible locations where the diversification rate changed. The location of inferred rate shifts on the phylogeny was added to the visualized tree with the ‘addBAMMshifts’ function. Next, we used the ‘credibleShiftSet’ function to estimate the 95% credible set of different rate shift scenarios. A diversification rate through time with an associated confidence interval was plotted using the ‘plotRateThroughTime’ function for the Lepidoptera only (i.e., the outgroups were excluded).
Ancestral state estimation
We dropped tips (families) lacking morphological data from the Lepidoptera phylogeny using the ‘drop.tip’ function. Fifty-two tips remained in the resulting phylogeny for morphological disparity analysis. We treated the six morphogroups (“Similarity tree and morphogroups” section) as six discrete character states, and we applied three different evolutionary models (equal rates (ER), symmetrical (SYM), and all-rates different (ARD)) to the phylogeny to describe discrete state evolutions and estimate their likelihoods using the ‘ace’ function in the R package ‘ape.’ The best-fit model was chosen using a chi-square analysis. We then applied stochastic trait mapping simulations80 on the chosen model to estimate the ancestral states at nodes, as well as the time spent along edges of the phylogeny using the ‘make.simmap’ and ‘describe.simmap’ functions in the R package ‘phytools’81.
Morphological evolution based on continuous phenotypic states
To understand the evolutionary history across the morphospace, we analyzed the evolutionary dynamics of the morphological data in continuous states. Specifically, we applied PCA on the mean key features of families (“Identifying key image features” section) to further reduce the dimensions of interest. The resulting axes of PC1, PC2, and PC3 explained 28.6%, 10.1%, and 8.5% of variance, respectively (47.2% explained in total, Extended Table 1) of the family-level morphological variation. The values of PC1, PC2, and PC3 were used as input trait values for the following analyses.
We performed a disparity through time (DTT) analysis and plotted the results using the ‘dtt’ function in the R package ‘geiger’82. The DTT analysis compared the trends of morphological diversification to the expectations from a Brownian motion (BM) model constructed with 1000 random simulations. We further estimated morphological evolutionary rate shifts along lineages using the ‘phyloEM’ function implemented in the R package ‘PhylogeneticEN’83. We determined if the morphological evolutionary change rates of any of the three PC axes deviated from the Brownian Motion (BM) or Ornstein-Uhlenbeck (OU) process expectations on the plot, and when such deviations occurred.
We employed three functions, namely ‘fitContinuous’ from geiger84, ‘fitDiversityModel’ from phytools81, and ‘OUwie’ from OUwie85, to assess the statistical adequacy of several models explaining the values of PC1, PC2, and PC3. The models we tested were, rate trend (trend), early burst (EB), diversity dependent (DD), OU model with a single regime (OU1), and OU model with multiple regimes (OUM). For OUM, we estimated multiple regimes or adaptive peaks for the different morphogroups (morphogroups 1-6, as detailed in the“Similarity tree and morphogroups” section). To explain the continuous character evolution, we determined the best-fitting models using the AIC weight, calculated with the ‘aicw’ function in geiger84.
Image collection
The specimen images of Lepidoptera were searched within the open database of GBIF and downloaded from their registered institutional websites (accessed on 2021-02-03, the full citations and rights are declared on https://github.com/yuhinas/LepMorphAI). To cover the widest range of taxa (i.e., the greatest number of species) and photographic conventions (e.g., lighting environments and backgrounds), while keeping the total number of images low, we downloaded one image per species from each registered institutional website. That is, if different institutions had photos of the same species, there would be multiple photos of the same species, but from different institutions. We trained a object detection model YOLO v449 for locating and cropping specimens from the original image. The anchors were tweaked for Lepidoptera specimen images, and the loss functions for bounding boxes were modified to punish more on small predicted boxes that would crop off the edges of specimens. The cropped images were further resized and padded to a resolution of 256x256. The padded areas were filled with the mean RGBs of pixels sampled from the edge. In total, we ended up with 32,262 standardized specimen images of 17,186 species. Our modification of YOLO v4 was based on https://github.com/bubbliiiing/yolov4-pytorch.
Data augmentation
We applied two sets of random augmentations for variational autoencoder training. To prevent potential learning biases, the first set simulated the variation among photographic conventions, including RGB adjustments, independent treatment of RGB channels for contrast adjustments, and slight variations in zoom levels. The second set was treated as noise, including xy translations, rotations, and random percentage of pixel obscuring with three different patch sizes. The model was forced to learn the random variations introduced from the first set and to ignore the noises introduced from the second set. The model needed to catch as many details as possible to compensate for the information loss with pixels obscured.
Model and training
We used a customized variational autoencoder called DFC-VSC for self-supervised feature extraction (Fig. 1a) and image generation. DFC-VSC was composed of the variational sparse coding (VSC) autoencoder48 and the deep feature consistency module from DFC-VAE50. The former was for extracting sparse and disentangled representations for better interpretability, and the latter guided the model in the training process to focus more on relevant foregrounds (the specimens of Lepidoptera) over the various backgrounds.
The encoder of our VSC implementation digested noise-introduced images of 256x256x3 dimensions with layers composed of residual blocks and encoded them into 512-dimension latent vectors. The latent vectors were regularized to follow spike and slab prior distributions with alpha=0.01, as recommended by VSC48. The decoder was symmetric to the encoder, and reconstructed denoised images from the latent vectors in the training process. The decoder could later generate specimen images from any given vectors of the same dimensions for feature exploring, visualization, or random creations. Pixel-wised MSE reconstruction loss (L-rec) and spike-and-slab prior distribution loss (L-prior, KL-divergence of two arbitrary spike-and-slab distributions instead of two Gaussian distributions in VAE) were adopted in the VSC part.
The DFC module played the role of supervisor and defined deep features on which to be focused more by the VSC module. Our DFC implementation used a pre-trained ResNet50 model targeting the classification of subfamilies of Lepidoptera from scratch. The reason we chose subfamilies for classification was that images of each subfamily covered a sufficient amount of variety of photographic conventions such as various backgrounds and lighting conditions. In addition, the number of subfamilies (~300) was more appropriate for the classifier than the number of genera (~5000) or families (~90), depending on the sample size requirements per class for training a classifier. The DFC module could therefore learn to ignore irrelevant backgrounds for the classification task, as shown with the Relevance-weighted Class Activation Mapping (Relevance-CAM)86 visualization (Extended Fig. 9)
In the main DFC-VSC training process, the parameters of the DFC module were locked, and the DFC module only worked as a feature extractor. The feature maps of major layers extracted from the original and reconstructed images were compared and were used to calculate deep feature MSE loss (L-dfc). The loss was then propagated back to the VSC model for updating parameters. The process made the VSC learn to reconstruct images with high deep feature consistency to the original images, and so the VSC model parameters were balanced toward the relevant foregrounds. Ultimately, DFC-VSC was trained for 40,000 epochs (learning rate=0.0002, Adam optimizer), and all three losses reached their plateau.
Identifying key image features
Features of 512-D from all 32,262 standardized specimen images (a 32262x512 matrix) were extracted with the trained DFC-VSC encoder. To further reduce dimensions for easier analysis and interpretation, we managed to identify the key features related to subfamilies. We applied the back-propagation algorithm on a trained simple classifier targeting subfamilies to determine the dimensions of features with the steepest gradients. The reason we chose subfamily instead of family as the learning target was that we wanted to find an adequate number of key features, while maintaining a higher resolution of features over families. The key features related to subfamilies were then aggregated to the family level for all following analyses.
To avoid the difficulty of interpretation for interactions of signed feature values and signed gradients, we extended and converted the 512-D features into 1024-D vectors with non-negative values (referred as the non-negative features). To do this, we concatenated the original 512-D features with their sign-reversed clone into 1024-D vectors and then set the values in locations originally with negative values to zero. The new 1024-D non-negative vectors were then used to train the simple classifier. The classifier consisted of only two linear layers, each with 512 neurons and followed by 50% dropout and leaky ReLU activations. It achieved 75% top-1 accuracy for classifying 297 subfamilies, each of which contained at least three species. The term "top-1 accuracy" refers to the accuracy of a model in correctly predicting the highest-ranked class label. It is a measure of the model's performance that evaluates the percentage of times the model predicts the correct class label as the top output. After applying back-propagation, the top-N key features of each subfamily were voted by the dimension locations with the largest N mean gradients of their member species (at least three sampled). We tested the number N from 1 to 20. For each N, we took the corresponding N-union features (i.e., the union of top-N features of each subfamily) and calculated three different indices of clustering, including silhouette score, Davies-Bouldin score, and Calinski-Harabasz score, for evaluating and selecting the appropriate N according to subfamily clustering. The overall evaluations showed that the optimal scenario occurred with N=2 and N-union=59 (Extended Fig. 10). Finally, we used 40 unique key features by converting the non-negative 1024-D features back to 512-D features. We could see that these 40 features were highly representative of the original 512 features, as shown by the high similarity between typical forms of subfamilies reconstructed with the 40 features and those with all 512 features (Extended Fig. 11).
Similarity tree and morphogroups
We used the mean of the key features of each species to calculate the morphological similarity between any two families (i.e., the mean pairwise distances between any two species from different families). The resulting distance matrix was then used to build a similarity tree with the neighbor-joining algorithm implemented in the ‘scikit-bio’ package in python3. We set the root of the similarity tree to the family Eriocraniidae, the earliest-evolved family with enough data sampled in our dataset. The branches were ladderized with morphological similarity to the root.
The tree was then split into K morphogroups for categorical analysis, and each group was required to be composed of one or several adjacent clades. The best K was decided using the following process. First, based on a general definition of clustering87, we attempted to maximize the similarity in each cluster and minimize the similarities among clusters. For any given number K, we calculated the total variation within groups, the total variation of intragroup variation to intergroup variation, and the variation among characteristics of groups from the enumerations of all possible combinations of K groups of our similarity tree. The sum of these calculations was defined as "the lower the better" grouping index. Next, we evaluated the grouping indices for each K from 3 to 10. Finally, we found the optimal result (the minimum index) when K = 6 (Extended Fig. 2c).
Wing shape and color patterning analysis
To understand the basic flight type of each family and their morphogroups, we first generated each family’s typical form from their mean key features. We then isolated the forewings with the background removed from the reconstructed images to calculate the aspect ratios (AR) and the second moment of area (2nd MoA) following the approach of Roy et al.67. For each image, we selected only the left or right forewing, depending on the clarity of reconstructed wing boundaries. We followed the procedures and equations of the Matlab tool ‘WingImageProcessor’ (https://biomech.web.unc.edu/wing-image-analysis/) to reimplement the calculations for AR and 2nd MoA with Python.
For the other features of shape and color patterning, we measured both the isolated and background-removed forewings and the hindwings of each family. We used the built-in functions of the ‘opencv-python’ package to find bounding boxes, perimeters, and areas. The aspect ratio of the bounding box was defined as the box width divided by the box height. The perimeter area ratio was defined as the wing perimeter divided by the wing area. The mean brightness was defined as the mean grayscale value of pixels of a wing. We applied the BIRCH clustering algorithm (implemented in the ‘scikit-learn’ python package, with n_clusters=None and threshold=16) on the RGB values according to all forewings and all hindwings to get non-arbitrary K-forewing and K-hindwing color groups in total. The wing color richness was defined as the number k (1≤ k ≤ K) of color groups a wing has. We calculated color evenness using the number of color groups and the pixel counts of each color group, using the same formula as for species evenness.
We defined a measurement named Pattern Distinctiveness-Distance Index (PDDI) to indicate the mean horizontal distances of distinctive color regions (such as belts and spots) from wing bases using the following procedure. First, we calculated pairwise color distances (RGB Euclidean distances) between any two pixels on a wing and then take the mean for each pixel. (Extended Fig. 12a). The higher mean color distance indicates higher distinctiveness (i.e., the color differed more from the others and occupied less proportion of a full wing) (Extended Fig 12b). We truncated the mean values higher than 95% to suppress noise. Second, we aggregated the 2-dimensional wing distinctiveness maps into a 1-dimensional vector by taking the mean of the highest 50% of each pixel column (Extended Fig 12c) to highlight the occurrence of distinctiveness on the horizontal axis. Third, we extracted the relative distances from the wing bases of the highest 50% aggregated distinctiveness, and calculate the mean distance normalized by the variance as the final PDDI (Extended Fig. 12d). A higher value of PDDI indicated that the distinctive color regions were located farther away from the wing base (i.e., closer to the outer edges of the wings,) on the horizontal direction. Finally, we used bootstrapping (resampled 10000 times with replacement) to estimate and compare the means and variance in wing color patterning between the primitive and recent morphogroups, with the “boot” library in R.
PCA axis visualization and analysis
We applied PCA to mean key features of families with at least three sampled species (68 families) to investigate morphological disparity through time. For an intuitive understanding of the synergistic effects of PC axes, instead of using biplots, we visualized the planes constructed with PC axes through the following steps. We first sampled a set of 4*4 points whose coordinates are the unique combination of (x, y) for x, y ∈ {-3, -1, 1, 3} on the 2-D planes constructed with any two PC axes. We then inverse-transformed these points back to the morphospace of key features and generated the corresponding 16 images with the decoder of DFC-VSC. By rearranging these images back to 4*4 grids according to their coordinates on the PC axes, we could easily see how the wing shapes and color patterns vary across PC planes.
We further quantified wing shape and color patterning to look for potential trends along the PC axes. The left forewings and hindwings were isolated from the generated images on a grid (g-x, g-y) for g-x, g-y ∈ {-3, 0, 3} with the background removed. Their characters were then measured, including the aspect ratios of the bounding boxes of wings, mean saturation, mean brightness, color richness, and evenness, the perimeter area ratios, and the PDDI. The characters related to flight capability (i.e., wing AR a2nd MoA) were only measured on the forewings (see “Wing shape and color patterning analysis” section for further details).
Hypervolume and trait probability density
We first projected the mean key features of each species to the PC axes of families and then calculated the trait hypervolumes, performed their set operations, and calculated the trait probability density (TPD) indices of the primitive and recent morphogroups. Only those families that were included in the ancestral state estimation were included in the following analyses. The total number of families used for these analyses is 52 (containing 16849 species), including 25 families from the primitive morphogroup (containing 3300 species), and 27 families from the recent morphogroup (containing 13549 species).
The hypervolume calculations were performed with the R package ‘hypervolume’ v3.0, using the Gaussian kernel and the ‘cross-validation’ method for bandwidth estimation. Other parameters were set to the default values of the package. The resulting hypervolume of the primitive morphogroup was 47.78 with a uniqueness of 25.93, and the hypervolume of the recent morphogroup was 28.91 with a uniqueness of 7.06. Their intersection was 21.85, and their union was 54.84. The recent morphogroups increased the total hypervolume by 14.8% (7.06 / (54.84 - 7.06)).
The TPD indices were calculated with the R package ‘TPD’ v1.1. We changed the bandwidth estimation method from ‘plug-in’ to ‘smooth cross-validation’ to match our hypervolume calculation. The TPD was applied on the assemblage level, and the TPDs were applied on the family level. The TPDs required each family to have more observations (i.e., the number of sampled species in a family) than the number of trait dimensions (3 PC axes), so we took two approaches to meet the requirement. First, we removed small families, so the number of families decreased from 52 to 47. Second, we added one dummy observation imputed with median trait values for each of the five smallest families. The currently known species numbers of each family77 were used as weights for calibrating redundancy estimation. Importantly, both approaches produced qualitatively similar results, with the main difference being in the relative redundancy due to the difference in the number of families. Finally, the distribution difference between families of the primitive and recent morphogroups in the morphospace of the PCA was analyzed with Permutational Multivariate Analysis of Variance (PERMANOVA), implemented with the `adonis` function of the `vegan` library, using the distance method=`euclidean` and permutations=1000000.