Characteristic of mutant sequences in human cancers

Mutation processes leave different signatures in genes. Previous studies suggested that both the mutated and flanking bases influence somatic mutation characteristics. However, the understanding of how flanking sequences influence somatic mutation characteristics is limited. Here, we constructed A long short-term memory – self organizing map (LSTM-SOM) unsupervised neural network. By extracting mutated sequence features via LSTM and clustering similar features with SOM, we obtained 10 classes of mutant sequences (named mutation blots, MBs) from 2,141,527 somatic mutations in The Cancer Genome Atlas (TCGA) database. Differences features were revealed among MBs. MBs were related to clinical features, including age, sex, and cancer stage. Different kinds of MBs for specific genes may affect patient survival. Finally, we clustered the patients into 7 classes by MB composition. Significant differences in survival and clinical features were observed among different patient classes. This study provides a novel method for understanding mutant sequences and revealing the extensive relationships among mutant sequences, clinical features, and cancer patient survival. the results showed a clear relationship between MB and clinical features, the details of the relationship as well as its mechanism still require further study.


Introduction
The stability of the cell genome is continually threatened by endogenous and exogenous factors that may lead to DNA damage 1,2 . If not repaired properly, DNA damage may result in genetic mutations 3,4 . The development of cancers involves a series of genetic mutations 5 . A number of internal and external factors underlying genetic mutations have been identified, such as smoking, alcohol consumption and mismatch repair deficiency 5,6 . In some kinds of cancers, such as colon cancer and breast cancer, there has been a great deal of research elucidating the relationship between genetic mutations and cancer-related processes 7 . However, in most cases, the role of genetic mutations in tumor progression is still poorly understood.
Genetic mutations include single-base substitutions (SBSs), small insertions and deletions (indels), genome rearrangement and chromosome copy-number changes 8 . In clinical studies, patients with mutations in a given gene show differences in survival and drug susceptibility [9][10][11] . With the development of sequencing technology, large amounts of mutation data from cancer patients have been obtained and made available in relevant databases, such as The Cancer Genome Atlas (TCGA) database. In the context of increasing sample sizes, a number of mutation signatures that are correlated with certain mutation processes have been identified. For example, C:G > A:T transversions predominate in smoking-associated lung cancer 12 , and CC:GG > TT:AA double nucleotide substitutions are common in UV light-associated skin cancer 13 .
SBSs contribute the largest proportion of genetic mutations. Mathematical methods have been used to decipher mutation signatures from somatic mutation catalogs 2,8,[15][16][17][18][19][20][21] . Some of the identified mutational signatures have been proposed to be associated with certain etiologies (e.g., in the classification system of Alexandrov et al. 2 , SBS2 may be caused by tobacco smoking) 2 . The clustering methods applied in some studies have included 1-2 bases next to mutated bases, and the results have suggested that flanking bases influence mutation signatures 2,8 . However, the inclusion of adjacent genes in such analyses leads to an exponential increase in the number of possible classifications, which makes it difficult to analyze the effect of flanking sequences on mutation signatures.
The application of machine learning, especially neural networks, makes it possible to effectively mine information from large amounts of data. A long short-term memory (LSTM) network is a special kind of recurrent neural network (RNN). Compared with a naive RNN, LSTM performs better in extracting features from long sequences, such as sentences 22,23 . LSTM has been used to analysis DNA or RNA sequences information in some studies [24][25][26] . A self-organizing map (SOM) algorithm is an unsupervised clustering algorithm. The method of "competitive learning" can identify interconnections between samples and present their categories in a lower-dimensional form 27,28 . The use of LSTM to extract the features of mutated sequences and the identification of similar features with the SOM algorithm provided an approach for analyzing the characteristics of mutated sequences and their relationship with cancer development.

SBS clustering via the LSTM-SOM model
A total of 2,141,527 somatic SBS data points from 9596 patients were collected from the TCGA database. For each SBS sample, 100 flanking bases (50 bases at the 5' end and 50 at the 3' end) were included in the LSTM training data. Flanking base sequences were obtained from Homo sapiens (Human) genome Assembly GRCh38 (hg38).
In brief, our LSTM-SOM model functions by extracting the features of mutant sequences via the LSTM network and then taking the generated feature vector as the input data for the SOM. Units in competitive layers of the SOM are then refreshed to edges closer to the distribution of the input data. In particular, after each iteration of SOM in our LSTM-SOM model, not only will the units in the competitive layer of SOM be refreshed, but the input data generated by LSTM will also be adjusted in the opposite direction (Fig. 1). Then, the refreshed input data are used as the labels to train the LSTM model. The above steps were repeated until the LSTM outputs formed clear classifications (Methods).
Vectors were used to represent the bases of mutated sequences for training. In the LSTM process, the influence of unit data on the LSTM output results decreased with increasing distance to the ending unit. As the training data included flanking sequences on both sides of the mutation sites, the LSTM process was carried out on both sides of the mutation site in opposite directions. Thus, the mutation site was placed at the end of both sequences to expand its influence on LSTM output and to reflect the difference between the reference allele and mutant allele.
One hundred samples from different cancers were selected randomly in each iteration of training (batch size = 100). To avoid overfitting of the model, the weight of the vectors in the competitive layer was updated after all input data were trained in one batch. Each iteration of training included 2 LSTM iterations and 2 SOM iterations. By adjusting the parameters of the model, mutated sequences were clustered into 2 types after one stage of training. Thus, we obtained 8 classes of mutated sequences (for easy understanding, mutated sequences with different features clustered by LSTM-SOM are referred to as mutation blots, MBs) after 3 rounds of training. Then, an additional round of training was performed for 2 classes of MB with a significantly larger number of samples and ultimately revealed 10 classes of MBs, recorded as MB 1 -MB 10 ( Fig. 1).   Table 1).

Fig. 2 | Mutation type and composition of flanking bases in different MBs. Each bar, except "Reference
Allele" and "Mutation Allele" represents one flanking genetic loci. Bars on the left of "Reference Allele" represent bases on the 5' end of mutation site; bars on the right of "Mutation Allele" represent bases on the 3' end of mutation site.  The composition of the bases flanking the mutation sites differed considerably. Generally, units located far from the endpoint had less influence on the LSTM output 23 . This characteristic was reflected in the flanking bases of the mutation site. In all kinds of MBs, the proportions of A, T, C, and G were quite different among the bases near the mutation site of each MB. With an increase in the distance from the mutation site, the proportions of the four bases tended to become balanced.
The clustering results were strongly influenced by the flanking bases of the mutation site. For example, both MB 5 and MB 7 exhibited C>A mutations, and the flanking bases of MB 5 were dominated by T bases, but MB 7 was dominated by A bases. Differences in flanking bases could also be observed in other classes of MBs with similar mutation features, such as MB 2 and MB 6, MB 4 and MB 8 (Fig. 2). In the analysis of cancers with a high incidence (lung, breast, prostate, colon, stomach, bladder, ovary, cervix uteri, liver, thyroid, skin and kidney cancers), the composition of the bases in the mutation site and the flanking sites of each MB basically followed that in the entire sample (Extended Data Fig. 1). We then counted mutated sequences with high frequency in different kinds of MBs. The results showed differences in the mutated sequences between MBs. For example, when the 3 flanking bases were taken into consideration, the mutated sequences with the highest frequency were TTTCTTT>TTTATTT, ATTCTTT>ATTATTT and ATTCTCT>ATTATCT in MB 5 but AAACAAA>AAAAAAA, AAACAAA>AAAAAAT and AGACAAT>AGAAAAT in MB 7, although the 2 classes of MBs exhibited the same mutation signature. Such differences are common among different kinds of MBs (Extended Data Table 2).

Extended Data Fig. 1 | Mutation type and composition of flanking bases of each MB in different cancers.
Extended Data Tabel

MBs in different cancers
Significant differences in the composition of MBs existed among cancers with different pathologies. Overall, MB 4, MB 5, MB 7 and MB 8 accounted for much greater percentages of the MBs than the other classes of MBs, especially MB 4 and MB 8. Malignant mesenchymal tumors (sarcomas) seemed to present a higher percentage of MB 2 and MB 6 than epithelial malignant tumors (carcinomas). Transitional cell carcinoma of the urinary tract showed a distinctly higher MB 1 incidence than other cancers. Cancers of germ cells and the glomus (paragangliomas) exhibited a high proportion of MB 10. An obvious feature of melanomas was the dominance of MB 4 and MB 8. This finding suggested that these classes of MBs may be correlated with ultraviolet light exposure. In general, a high incidence of MB 4 and MB 8 was also observed in other pathologic types of cancers that are considered more likely to be caused by external mutagenic exposure, such as squamous cell carcinoma (SCC), transitional cell carcinoma, malignant mesothelioma and complex epithelial carcinoma (Fig. 3).
The components of MBs varied in different cancers, and some cancers presented distinct features (Fig. 3). The proportions of MBs in different cancers were influenced by the pathological type to some extent. For example, cancers of the skin and lymph nodes showed extraordinarily high proportions of MB 4 and MB 8 but small proportions of other MBs. In both types of cancers, melanoma is the major pathologic type. Lung cancer presented high proportions of MB 5 and MB 7. Among the 2 major pathological types of lung cancer, adenocarcinoma (AC) exhibited much higher proportions of MB 5 and MB 7 than did SCC. This was consistent with the MB composition in the two pathological types. However, for the same pathological type, differences in the MB composition could be observed in different cancers. For example, AC of the colon presented higher proportions of MB 4 and MB 8 than did AC of the lung. SCC of the lung exhibited more MB5 and MB 7 than SCC in the head and neck (Extended Data Fig. 2). Therefore, MB may comprehensively reflect the difference in cancers in both locations and pathological types. We then counted the high-frequency mutated genes in different classes of MBs. In most classes of MBs, the frequency of genes that were commonly mutated in malignant tumors, such as TTN, TP53, and MUC16, was relatively high. Distinct features existed in some classes of MB. The proportion of TP53 mutations was generally high, but it was relatively low in MB 7 and MB 10. Remarkably, BRAF was the most common mutated gene in MB 9 and takes a high proportion. A higher proportion of PIK3CA mutations was observed in MB 8 and MB 10 than in the other classes of MBs (Extended Data Table 3).
More distinct features could be observed when considering specific cancers. For example, in pancreatic cancer, MB 4 and MB 5 contained a higher frequency of KRAS mutations (>1%) than did the other classes of MBs. In kidney cancer (mainly clear cell carcinoma), the frequency of VHL mutations ranked high in MB 3, MB5, MB7 and MB 9. In cancer of skin and thyroid, BRAF mutations were common in MB 9 but not in the other classes of MBs (Extended Data Fig. 3). The differences in the mutated gene composition of different MBs suggested that attention should be paid to the effect on the characteristics of cancer when different MBs occur in the same gene. In a given gene, SBSs may occur at different bases with different features and present as different kinds of MBs. Each gene with a high mutation frequency contained multiple kinds of MBs. In these mutated genes with a high frequency, the composition of MBs varied between different kinds of cancers. Such differences reflected the overall MB composition of each cancer (Fig. 4). To further study the influence of certain genes with different MBs on survival, we analyzed the survival of patients who exhibited mutations in genes with high mutation frequencies (TTN, MUC16, TP53, DNAH5, USH2A, PIK3CA, SYNE1, etc.). Patients were grouped according to the MB classification of specific genes. The results of survival analysis showed a significant correlation between survival and MB for specific genes. Patients carrying genes with MB 4 and MB 8 mutations usually showed better survival. In contrast, MB 1, 6, and 9 in a gene could predict worse survival ( Fig. 4 and Extended Data Fig. 4, 5).

Relationship between MB and clinical features of cancer patients
Analysis was performed to determine the relationship between MBs and the clinical features of tumor patients, including their age, sex, weight, AJCC stage and TNM stage. The change in MBs showed a nonmonotonic trend with patient age. The percentages of MB 2, MB 5, and MB 7 in single patients increased with age within the first interval (<70 for MB 2; <75 for MB 5 and MB 7) but decreased when age exceeded the threshold. This trend was reversed for MB 4 and MB 8. An exception was observed for MB 9, whose proportion in single patients decreased monotonically with age. The proportion of MBs in a single patient generally varied between the sexes. Female patients were likely to show higher percentages of MB 2, MB 3, MB 5, MB 6 and MB 10, while male patients exhibited higher percentages of MB 4, MB 8 and MB 9. The difference was not significant in MB 1 and MB 7. No apparent rule regarding the relationship between the weight and MB composition of a patient was observed (Fig. 5).   (Fig. 5).
Considering the differences in the clinical significance of staging in different cancers, further analysis was performed for each cancer with high incidence. In most cancers, the MB composition at different ages basically followed the pattern shown in the total samples (i.e., presenting a nonmonotonic relationship). Younger and older patients showed a similar feature of the MB composition in the form of a conic structure in the bar graph. This suggests that the similarity of the cancer biology of young and old patients requires further study. The proportion of MB 2 in most cancers was significantly higher in males than in females. Regarding cancer staging, T and M stages showed obvious tendencies in most kinds of cancers, and their trends were basically consistent with those for the total sample. Stomach cancer and colon cancer, in particular, showed opposite MB tendencies in T and N stages compared with the entire sample and with other cancers with high incidence. This suggests that local and lymph node progression in gastrointestinal cancers may exhibit distinct mechanisms (Extended Data Fig. 6 -9). Generally, although the results showed a clear relationship between MB and clinical features, the details of the relationship as well as its mechanism still require further study.

MBs and survival of cancer patients
Usually, there are multiple kinds of MBs in a single patient. To further analyze the influence of the MB composition on the clinical features of patients, a K-means clustering method was used to classify patients according to MB composition. Different kinds of MBs were recorded according to their proportion rather than their number in a single patient. After analysis via the "elbow method" and the examination of clustering results, K=7 was selected as the number of classes to be distinguished. Clustered patients were designated as Classes 1 -7.  (Fig. 6a).
In the survival analysis, significant differences in survival curves were observed in different classes of patients (Fig. 6b). We then carried out a pairwise comparison of survival by applying the log-rank test between different classes of patients. Patients with Classes 2, 4, and 5 showed better survival, and patients with Classes 1, 3, 6, and 7 showed worse survival (Fig. 6c). In the analysis of specific cancers, survival in different classes of patients generally followed the results obtained for the total sample (Extended Data Fig. 10), but with some discrepancies and not that significant. Class 3 patients, in particular, seemed to show poor survival for most of the analyzed cancers. Considering the relatively balanced composition of MBs in patients of Classes 1, 3 and 5, especially Class 3, these results suggested that a balanced MB composition may predict poor survival in patients. Patients of different classes showed distinct clinical features (Fig. 6d, e). According to AJCC staging, a significantly lower proportion of stage IV patients and a higher proportion of stage I patients were observed in Classes 4 and 5, which may be related to the better survival of these 2 classes of patients. Interestingly, Class 4 included significantly more T4 patients but hardly any M1 patients. This suggests that the MB composition of Class 4 may be associated with the local progression of cancers. Class 6 patients showed the highest percentage of AJCC stage 4 and lowest percentage of AJCC stage I, which may be the reason for the poor survival of these patients. In the analysis of age, patients of Class 3 were found to present significantly greater ages. At the same time, the weight of Class 3 patients was also high. These factors may be partly responsible for the poor survival of Class 3 patients. In contrast to the findings for Class 3, although the patients of Class 1 exhibited a similarly higher age, their weight was not very high. Class 1 patients exhibited a high percentage of AJCC stage 1 and a low percentage of stage IV. Moreover, the proportion of N0 patients in Class 1 was significantly higher than that in other classes, and the proportion of M1 patients in Class 1 was average. Therefore, further study is still needed to determine the mechanism causing Class 1 patients to show poor survival.

Discussion
Previous studies have proven that different mutation characteristics may be associated with different triggers involved in various mutation processes and result in differing biological behaviors of cancers. Some mathematical methods are now used for clustering the signature of mutation. A variety of mutation signatures that may be related to the biology and etiology of cancer have been identified 2,8,[15][16][17][18][19][20][21][31][32][33][34] . LSTM is a machine learning approach that is good at extracting the features of long sequences 35 . This approach provided us with a method for extracting the features of mutated sequences across a wider spatial scope. A follow-up SOM method can then be used to discover internal relationships between the extracted features and ultimately obtain different categories of mutant sequences.
There are still some constraints and limitations of this study. The clustering results obtained with the LSTM-SOM model is largely dependent on the selection of SOM parameters (especially the neighborhood function parameter, and the threshold value determines whether units in the SOM competition layer move to the target). We adjusted the parameters so that we could divide the samples into two classes in one round of training and obtained 10 classes after 3 rounds and an extra round of training. However, there exist the possibility that when training with other parameters, the classification obtained may be related to clinical features that were not included in this study, which needs further study.
The results of our study showed that even among patients with mutations in the same gene, significant differences may exist in their clinical features and survival. The mechanism of machine learning models is difficult to explain 36 . It is meaningful to use a mathematical method for exploring the mechanism whereby LSTM-SOM functions, improving the interpretability of the LSTM-SOM model and explaining the formation of different classes of MB to determine how sequences of bases affect the characteristics of cancers. Moreover, molecular biology methods are helpful for explaining the differences in the characteristics of MBs.
The TCGA database includes a large sample of mutation data. Due to the natural differences in cancer incidence, large differences exist between different cancers. In different cancers, MB may be involved in different kinds of cancer-related processes. Therefore, the analysis of the relationship between MBs and distinctive clinical features in specific kinds of cancer can provide more information about how MBs are related to cancer etiology, processes, prognosis and drug susceptibility. In summary, this study provided a method for classifying the characteristics of mutant sequences and explored their relationships with the clinical features and prognosis. Further study of the mechanism of MBs related to cancer characteristics is suggested.
Step  RNNs have long been used in the analysis of sequence data, for example, for speech recognition. A naive RNN effectively analyzes short sequences. When used for long sequences, a naive RNN may show the problems of gradient disappearance and gradient explosion. An LSTM network is based on the network structure of RNNs 22 . The LSTM approach introduces the mechanisms of "forgetting" and "memory". Thus the capacity of the LSTM network to analyze long sequences is improved by controlling the long-term state 23 . As the "forgetting" mechanism of LSTM, the unit closer to the end of the sequences has a greater influence on the output of LSTM. For the mutation site to have the greatest impact on the output result, LSTM is designed to read from both ends of the mutated sequence. The sequence upstream of the mutation site is read from 5' to 3' (forward sequence), and the sequence downstream of the mutation site is read from 3' to 5' (reverse sequence). In this way, the mutation site is placed at the ends of both sequences to reinforce its influence on the LSTM output.
We used the torch.nn package in PyTorch to construct a neural network. The LSTM procedure that we used consists of two hidden layers, each with 64 nodes. The data subsequently entered a full connection layer, and a 1×8 vector was finally output as the feature vector of a single mutated sequence.

Step 2. Clustering by SOM
The SOM consists of two kinds of layers: an input layer and a competition layer. The randomized units in the competition layer were trained to describe the distribution of units in the input layer via the mechanism of "competitive learning" 28 . In the SOM process of LSTM-SOM model, the feature vector obtained from the LSTM process is used as the input. Units in the competition layer are adjusted continuously according to their distance to the input unit. For one input unit, the unit in the competition layer nearest to it is regarded as the "winning unit", which will move the maximal distance to the input unit (target), and for the other units, their travel distance to the target decreases with the increase in the distance to the winning unit. To avoid an excessive concentration of the results, we set a threshold value in the model. When the distance between the competition layer unit and the target is over the threshold value, the unit will move in the opposite direction to the target. In particular, not only will units in the competition layer be updated in our SOM model, but the input unit will also be updated in the opposite direction of the vector sum of the competition layer unit movement. Then, the updated input unit will be used as a label to train the LSTM model.
First, we obtained feature vectors of 100 samples from LSTM in one batch, and they were used as the input units of the SOM. The settings included 200 units in the SOM competition layer. For each input vector, the Euclidean distance between it (x) and each unit in the competition layer ( ! ) was calculated as follows: The unit closest to x is recorded as '"( , and the distance between '"( and each other competition layer unit is calculated as follows: A threshold of was set in the process of training. If ! ( '"( ) ≤ , ! will move in the direction of x; otherwise, ! will move in the opposite direction. The transportation distance decays with an increase in ! ( '"( ). The neighborhood function is referred to the Gaussian function 37 : In the decay function, is a constant that affects the amplitude of transportation distance decay. The update vector is as follows (where L is the learning rate of SOM)： Δ1 ! 3 = : × ( ! ) × ( ! − ) ! ( '"( ) ≤ − × ( ! ) × ( ! − ) ! ( '"( ) > It can be seen that when the distance between ! and the target is less than S, they will approach each other. Otherwise, they will pull away from each other. Due to the existence of the decay function, the influence of distant units on each other is very small, and no excessive dispersion of units was observed in training. To avoid overfitting, the units in the SOM competition layer are updated after each training batch of 100 samples. The samples in each batch are selected randomly from different cancers. To change the discrete status of the input vectors and cause similar input vectors to aggregate, the input units are updated in the opposite direction ( is the input vector): Step 3. Train the LSTM model The updated ( ) is used as the label to train the LSTM network. In this way, the output feature vectors of LSTM with similar features can be gradually closed.