TranSpec architecture. Inspired by the classic Transformer, we build in the TranSpec the encoder-decoder structure (illustrated as Fig. 1b) and utilize the multi-head attention mechanism24,25 to translate the molecular spectra into the corresponding SMILES. As shown in Fig. 1b, the tokens are firstly positional encoded into a \({d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}=256\) dimensional vector representation in the SMILES encoder. Here we use the Kekuler-style instead of canonical SMILES to reduce the kinds of tokens. Considering the coverage of QM9S database, only 16 tokens are used in the present work to represent the start (< BOS>), termination (< EOS>), padding (< PAD>), two kinds of special chemical bonds (<=>, <#>), four kinds of elements (< C>, <N>, <O>, <F>), two kinds of branches (< (>, <)>), and five kinds of ring or connections (< 1>, < 2>, < 3>, < 4>, < 5>).
In the spectra encoder, we firstly use 6 layers of convolutional neural networks and a fully connected linear layer to generate a \({d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\times {d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\) spectra feature matrix for any spectra input. Here the input could be single IR or Raman spectra, or packed IR and Raman spectra (named as IR&Raman hereafter). Finally, the spectral feature matrix produces the V and K vectors of the next decoder model.
$$\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}\left(Q,K,V\right)=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{K}}}\right)V$$
The tokens decoder is composed of a stack of \(N=2\) identical layers. Each layer has three sub-layers. The first one is the masked multi-head self-attention26 calculation on the SMILES vector representation:
$$\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{e}\text{a}\text{d}\left(Q,K,V\right)=\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}\left({\text{h}\text{e}\text{a}\text{d}}_{1},\cdots {\text{h}\text{e}\text{a}\text{d}}_{h}\right){W}^{O}$$
$${\text{h}\text{e}\text{a}\text{d}}_{i}=\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}\left(Q{W}_{i}^{Q},K{W}_{i}^{K},V{W}_{i}^{V}\right)$$
where the projections are parameter matrices \({{W}_{i}^{Q}\in \mathbb{R}}^{{d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\times {d}_{q}}\), \({{W}_{i}^{K}\in \mathbb{R}}^{{d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\times {d}_{k}}\), \({{W}_{i}^{V}\in \mathbb{R}}^{{d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\times {d}_{v}}\) and \({{W}^{O}\in \mathbb{R}}^{{hd}_{v}\times {d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}}\). In the present work we employ \(h=\)8 parallel attention layers, or heads, resulting \({d}_{q}={d}_{k}={d}_{v}=\frac{{d}_{\text{m}\text{o}\text{d}\text{e}\text{l}}}{h}=64\). The multi-head attention ensures the model could jointly attend to information from different representation subspaces at different positions, while the masked coding ensures the prediction at present depends only on already generated tokens. The second sub-layer is the key competitive neural network that utilize the multi-attention mechanism to extract the molecular structural information from the spectra features. As a result, the \(Q\) vector is generated by SMILES self-attention, while the \(V\) and \(K\) vectors from the spectral feature matrix. The third is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the three sub-layers, followed by layer normalization to moderate over-smoothing27. That is, the output of each sub-layer is \(\text{L}\text{a}\text{y}\text{e}\text{r}\text{N}\text{o}\text{r}\text{m}(\text{x}+\text{S}\text{u}\text{b}\text{l}\text{a}\text{y}\text{e}\text{r}(\text{x}\left)\right)\), where \(\text{S}\text{u}\text{b}\text{l}\text{a}\text{y}\text{e}\text{r}\left(\text{x}\right)\) is the function implemented by the sub-layer itself.
Lastly, we apply a MLP28 and a SoftMax29function on the output of the token decoder and produce the probability distributions of the present token. All parameters in the network are optimized by minimizing the cross-entropy loss30 between the output and the target sequence which is defined as:
$$CrossEntropyLoss=\sum _{n=1}^{N}\frac{{l}_{n}}{\sum _{n=1}^{N}{w}_{{y}_{n}}\bullet 1},{l}_{n}=-{\omega }_{{y}_{n}}\text{log}\frac{\text{e}\text{x}\text{p}\left({x}_{n,{y}_{n}}\right)}{{\sum }_{c=1}^{C}\text{e}\text{x}\text{p}\left({x}_{n,c}\right)}\bullet 1$$
where \(x\) is the input, \(y\) is the target, \({w}_{c}\) is the weight, \(C\) is the number of classes, and N spans the minibatch dimension as well as d1, …, dk for the K-dimensional case.
Translation on theoretical spectra datasets. Taking the QM9S dataset that includes IR and Raman spectra of 130K molecules from quantum chemistry calculations as the benchmark, we firstly trained and optimized the TranSpec model. As illustrated in Fig. 1c, TranSpec successively outputs the tokens as well as the corresponding probabilities. Here we should set a threshold value for the probabilities, otherwise the number of the output SMILES will be infinite. As shown in Fig. 2a, a threshold value larger than 60% will output the most probable tokens at every step, which is usually called as the greedy search. It is well known that the greedy search would loss some global information, resulting the translation accuracy from the IR, Raman, and IR&Raman spectra to SMILES located at 55.3%, 57.9% and 63.8%, respectively.
Alternatively, decreasing the threshold value will produce more SMILES candidates (shown in Fig. 2a). For example, the average numbers of the produced SMILES candidates with the threshold of 20% are 3.2, 4.1 and 3.5 for IR, Raman, and IR&Raman spectra, while a threshold of 1% will produce averaged 358.9, 561 and 366.8 candidates. Including all candidates largely increases the computational cost and decreases the analytic efficiency. As a result, we rank the SMILES by multiplying the probabilities at every step to produce Top N candidates. We can regard the translation as successful if the right SMILES is included in the Top N candidates and some other observable properties (such as the mass, dipole moment31, elementary composition, etc) can be used to further filter the SMILES candidates. For example, providing the mass information could largely decrease the number of the candidates and improve the translation efficiency, as shown in Fig. 2b.
Figure 2c-e shows the TranSpec’s translation accuracy rate with varying the Top N candidates at threshold values of 20%, 5% and 1%. We can see from Fig. 2c-d that the translation accuracy from the Raman spectra is a litter higher than that from the IR spectra when the threshold is 20% and 5%. However, the translation accuracy from the IR and Raman spectra is almost the same when the threshold is 1% (shown in Fig. 2e). Infrared&Raman spectroscopy can further improve the translation accuracy. We also found that decreasing the threshold value could slightly decrease the translation accuracy rate of the Top 1 candidate, resulting a translation accuracy rate of 66.8%, 66.3% and 65.3% for the threshold value of 20%, 5% and 1%, respectively. However, using smaller thresholds will increase the translation accuracy rate if more candidates are included. For example, the translation accuracy rate of the Top100 candidate from Raman spectra is 77.8%, 92.3% and 93% with the threshold value of 20%, 5% and 1%, respectively. Using the mass information as the filter could further improve the translation accuracy, making the translation accuracy rate of the Top100 candidate from Raman spectra is 77.8%, 92.8% and 97% with the threshold value of 20%, 5% and 1%, respectively.
Translation on experimental spectra datasets. We also tested the TranSpec’s accuracy by translating the experimental IR spectra extracted from the national institute of standards and technology (NIST) Chemistry WebBook. Here only the 5624 IR spectra observed in the gas phase are used to keep the data’s uniformity. Compared to the 16 tokens used in the theoretical data, three element tokens (< Cl>, <Br>, <S>) are added and two ring or connection tokens are removed (< 4>, < 5>) for the experimental data. The translation accuracy on the experimental IR spectra using the Top 1 and 100 candidates with the threshold of 0.01 is only 8% and 39.9%. This low translation accuracy rate can be attributed to the small size of the experimental dataset. To further improve the translation accuracy, here we also use the molecular relative mass to filter the SMILES candidates. In this way the translation accuracy of Top 1 and 100 candidates with the threshold value of 1% on the experimental IR spectra can be increased to be 21.8% and 66%, respectively (As shown in Fig. 2f).
Table 1
Identification accuracy rate of the 12 most common functional groups using the TranSpec by translating the theoretical IR, Raman, IR&Raman and experimental IR spectra.
Functional Group Types
|
Functional Groups Identification Accuracy Rate using Different Spectra and Datasets (%)
|
Theo. IR
|
Theo. Raman
|
Theo. IR&Raman
|
Exp. IR
|
alkane
|
99.80
|
99.86
|
99.91
|
96.79
|
alkene
|
96.35
|
98.03
|
97.91
|
92.59
|
alkyne
|
98.92
|
99.57
|
99.79
|
100
|
aromatics
|
95.76
|
96.30
|
97.28
|
96.97
|
alcohols
|
96.95
|
96.74
|
98.36
|
100
|
aldehydes
|
94.83
|
94.21
|
96.84
|
60
|
ketones
|
90.43
|
88.88
|
90.25
|
75
|
esters
|
91.54
|
86.35
|
90.44
|
95.45
|
ether
|
96.89
|
93.87
|
97.33
|
94.34
|
amides
|
83.64
|
82.90
|
87.44
|
55.56
|
nitriles
|
99.34
|
99.12
|
99.56
|
100
|
amine
|
88.44
|
87.10
|
92.76
|
76
|
Identification of functional groups. Considering the greedy search could lose some global information, we examined the potential capability of TranSpec in identifying the different functional groups. The identification accuracy of the most common 12 functional groups in the QM9S and the experimental IR dataset are listed in Table 1. We use RDKit32 to determine if each compound has a specific functional group. As can be seen, 90% of the functional groups in the QM9S dataset can be identified. Furthermore, we found the identification of the alkane, alkene, alkyne and aromatics groups could be more easily realized using Raman rather than IR spectra. On the other hand, the identification of the alcohols, aldehydes, ketones, esters, ether, amides, nitriles and amine groups can be more easily realized using IR spectra. Packing the IR and Raman spectra could further enhance the identification accuracy for all groups except the alkene, ketones and esters. The identification accuracy of functional groups in the experimental IR spectroscopy ranges from 76% to 100%. This can be attributed to the small size of the dataset and the low coverage of the corresponding groups.
Distinguishing similar molecules from spectra.
Taking several molecules extracted from the verification and test set of QM9S for instance, we demonstrated the capability of TranSpec in identifying isomers and homologues. As shown in Fig. 3, the IR or Raman spectra of the four isomers or homologues are very similar. However, TranSpec can capture the fine distinction (emphasized in gray box in Fig. 3) and finally output the right SMILES using the greedy searching. From this point view, TranSpec can distinguish the molecules with similar structures from the very similar spectra.
Overall, TranSpec model can quickly and efficiently recognize the molecular species by directly translate the theoretical or experimental IR or Raman spectra into SMILES representations. The translation accuracy rate on the QM9S theoretical spectra datasets was found to be about 60%, while the threshold search can reach about 93% if the Top 100 candidates are included. We found the translation accuracy rate based on the experimental IR spectra is lower because of the low coverage and heterogeneity the dataset. To further increase the translation accuracy, we proposed the IR&Raman packing and mass-filtering strategy. TranSpec could realize the molecular species recognition, the functional groups identification, and isomers from homologues distinguishment, validating the possibility of real-time structural identification based on spectroscopic measurements.