The workflow of the image encoding architecture.
To build a DNA instance-based classifier for the MNIST database the key issue is to encode digit (0–9) images so that similar ones should be encoded by similar DNA sequences. That is, the reversal complementary sequence of one image should be highly possible to hybrid with the sequences of the same class images, and less likely hybrid with those of the other class images. For this goal, the encoding architecture in Fig. 1A includes three blocks: a LeNet-5 backbone for feature extraction, an encoder and a predictor to map similar feature vectors to similar DNA sequences.
LeNet-5 is a classic convolutional neural network composed of seven layers for handwritten digit recognition. Given a 28×28 grayscale image, the output of the second fully-connected layer (FC2) is used as its feature vector. The encoder consists of two fully connected layers to translate a feature vector into the one-hot encoded DNA sequence. The predictor includes two convolutional layers to predict the hybridization degree of DNA duplex, here referred to as “simulated yield”. In biochemistry, yield is the amount of substance that is obtained from a particular biological process or reaction, usually expressed as a percentage. Specifically, the closer the yield is to 1, the stronger the hybridization degree of the DNA duplex.
Figure 1B shows the workflow of the encoding architecture, and the training process includes three stages:
First, train the LeNet-5 backbone to classify the 60,000 images in MNIST database;
Second, train the predictor by 300,000 DNA sequences pairs whose yields are labeled by NUPACK[26];
Finally, train the encoder based on the similarity of the feature vectors, their MNIST labels and the predicted yield.
The encoding performance
The MNIST database is composed of two mutually exclusive subsets: training set of 60, 000 images and testing set of 10, 000 images. Figure 2A shows the clustering result of the training images on a 2-D plane by t-SNE[27], where the left is based on the LeNet-5 feature vectors and the right is on the encoded DNA sequences. In both cases, images within the same class label are almost clustered in the same group. But their neighboring relationship in the feature vector space may be different from that in the DNA sequence space.
This is because the training process considers both the label similarity and the hybridization constraint. This difference in neighboring relationship can be observed for the 4 pair images in Group A and B. In general, the encoded DNA sequences could reflect the similarity of images to a high degree.
To quantify the hybridization degree, we define the yield for a pair of DNA sequences as the concentration ratio of the DNA duplex at equilibrium to that at the initial concentration. This value generally ranges from 0 to 1. Figure 2B shows the distribution of hybridization yield according to the Euclidean distance of their feature vectors. Most of the yield is larger than 0.8 when the Euclidean distance is less than 9, and the yield is close to 0 when the Euclidean distance is larger than 17. When the Euclidean distance ranges from 9 to 16, about one-third is close to one and the rest is close to zero. Figure 2C presents 4 pairs of images to reflect the relationship between their similarity and the Euclidean distance. When the Euclidean distance is larger than 9, most of the images tenders to be different. All in all, the yield distribution reveals that the encoder assures similar images usually have a large hybridization yield (> 0.7) while those dissimilar have a low yield(< 0.2).
Figure 2D shows the top 50 nearest neighbors of a query sequence of image ‘5’ according to the predicted yield by NUPACK. As expected, images of digit ‘5’ are the first majority which is 46%. The second and third majority comes from digits ‘9’ and ‘6’ which is about 14% and 12%. Further, most of the images of digit ‘5’ are in the top 30. This means that the total hybridization yield of the query sequence with those of digit ‘5’ should be significantly larger than that with other digit images. This demonstrates that the total yield of the query sequence with each digit label could serve as an indicator of the similarity degree.
Simulated classification performance by NUPACK
We randomly select 1000 images (100 for each digit number) from the MNIST testing set to verify the performance of the proposed instance-based classifier by the MNIST training set. Given one query image, we calculate the total yield of its encoding sequence with those in each digit number respectively by NUPACK. The largest yield class will be the predicted as its class label. Details of the simulation experiment can be found in Method section.
Figure 3A and 3B show the yield distribution of these query sequences with those in the same digit class(A) and those from other class(B). Comparing 3A and 3B, we could see that these query sequences usually have larger yield with their own class sequences (> 0.8) and smaller yield with other class sequences (< 0.1). These observations indicate that the query sequences tender to have higher yield with their own class instances than with other class ones.
Figure 3C presents the average accuracy for each digit class and the overall accuracy is about 95%. Only digit class ‘4’ has the least accuracy (86%), and this is consistent with the yield distribution in Fig. 3A and 3B. That is, the query sequences have relative low yield with their own class sequences and relative larger yield with other class sequences. The main reason may be attributed to its relative low encoding quality. Back to the clustering results in Fig. 2B, we can see that the 6000 training instances for digit ‘4’ are scattered in two sub-clusters which are surrounded by 6 other clusters. Furthermore, Fig. 3D presents the 51 misclassified query images and most of them are irregular handwritings. Their detailed predicted yields for each class can be found in supplementary Table 1.
Experimental validation of the hybridization yield
We construct a small classifier by randomly selecting 50 sequences for each digit label. Figure 4A shows the average yield of 9 query sequences (0–9) predicted by NUPACK. For each query sequence, the predicted yield with its own label is always the largest (see the diagonal values in each row). That means this small classifier could correctly recognize the digit number for the 10 query sequences (0–9).
To verify this observation, we conduct a wet-lab experiment to measure the real hybridization strength between the 9 query sequences and the 50 training samples for digit 3. Figure 4B shows the measured fluorescence intensity, and the larger the intensity, the higher degree of the hybridization would be. For each case, we repeat the three-time experiments. Columns A and B are used as control groups for TE buffer and TE buffer added with fluorescent dye respectively. The rest columns correspond to query sequences for 0(Q0) to 9(Q9). We could see that column F(Q3) has the largest fluorescence intensity than others. That is, query sequence Q3 has the highest hybridization strength with the training sequences. And columns K(Q8), J(Q7), and H(Q5) have a relative larger intensity. These observed fluorescence intensities are consistent with the average predicted yield for the 10 query sequences in Fig. 4A. This demonstrates that the predicted yield could reflect the hybridization strength, and it is feasible to build a large classifier with DNA sequences which is composed of tens of thousands of instances.