Insight of characteristic of mutation sequences in human cancers via an unsupervised neural network approach.

Background: Mutation processes leave different signatures in genes. Previous studies have suggested that both the mutated and flanking bases influence somatic mutation characteristics. However, the understanding of how flanking sequences influence somatic mutation characteristics is limited. Materials and methods: We constructed a long short-term memory – self organizing map (LSTM-SOM) unsupervised neural network. By extracting mutated sequence features via LSTM and clustering similar features with SOM, somatic mutations in The Cancer Genome Atlas database were clustered according to their mutation type and flanking sequences. The relationship between MB and cancer characteristics was then analyzed. At last, we clustered the patients into different classes according to the composition of MB by K-means method, and then studied the differences in clinical features and survival between classes. Results: Ten classes of mutant sequences (named mutation blots, MBs) were obtained from 2,141,527 somatic mutations. Different features in mutation bases and flanking sequences were revealed among MBs. MB reflect both the site and pathological features of cancers. MBs were related to clinical features, including age, sex, and cancer stage. Class of MB in a given gene is associated with survival. Finally, patients were clustered into 7 classes according to MB composition. Significant differences in survival and clinical features were observed among different patient classes. Conclusions: Our study provides a novel method for analyzing the information of mutant sequences and reveals the extensive relationships among mutant sequences, clinical features, and cancer patient survival. characteristics of mutant sequences and found the influences of mutation sequences on cancer characteristics. Further study of the mechanism of MBs related to cancer characteristics is suggested.


Introduction
The stability of the cell genome is continually threatened by endogenous and exogenous factors that may lead to DNA damage. 1,2 If not repaired properly, DNA damage may result in genetic mutations. 3,4 The development of cancers involves a series of genetic mutations. 5 A number of internal and external factors underlying genetic mutations have been identified, such as smoking, alcohol consumption and mismatch repair deficiency. 5,6 In some kinds of cancers, such as colon cancer and breast cancer, there has been a great deal of research elucidating the relationship between genetic mutations and cancer-related processes. 7 However, in most cases, the role of genetic mutations in tumor progression is still poorly understood.
Genetic mutations include single-base substitutions (SBSs), small insertions and deletions (indels), genome rearrangement and chromosome copy-number changes. 8 In clinical studies, patients with mutations in a given gene show differences in survival and drug susceptibility. [9][10][11] With the development of sequencing technology, large amounts of mutation data from cancer patients have been obtained and made available in relevant databases, such as The Cancer Genome Atlas (TCGA) database. In the context of increasing sample sizes, a number of mutation signatures that are correlated with certain mutation processes have been identified 12,13 .
SBSs contribute the largest proportion of genetic mutations. Mathematical methods have been used to decipher mutation signatures from somatic mutation catalogs. 2,8,[14][15][16][17][18][19][20] . The clustering methods applied in some studies have included 1-2 bases next to mutated bases, and the results have suggested that flanking bases influence mutation signatures 2,8 . However, the inclusion of adjacent genes in such analyses leads to an exponential increase in the number of possible classifications, which makes it difficult to analyze the effect of flanking sequences on mutation signatures.

5
A long short-term memory (LSTM) network is a special kind of recurrent neural network (RNN). Compared with a naive RNN, LSTM performs better in extracting features from long sequences, such as sentences. 21, 22 LSTM has been used to analyze DNA or RNA sequence information [23][24][25] . A self-organizing map (SOM) algorithm is an unsupervised clustering algorithm. The method of "competitive learning" can identify interconnections between samples and present their categories in a lower-dimensional form. 26,27 The use of LSTM to extract the features of mutated sequences and the identification of similar features with the SOM algorithm provided an approach for analyzing the characteristics of mutated sequences and their relationship with cancer development.

Data availability
SBS data and clinical data of patients involved in this study were obtained from the TCGA database. In the LSTM-SOM model, 100 flanking bases were included in the analysis, and the flanking sequence was obtained from the Genome Reference Consortium human genome build 38(GRCh38) based on the mutation sites of SBSs in TCGA data. Reference bases provided by TCGA were compared with GRCh38 to further ensure accuracy. As the "forgetting" mechanism of LSTM 21, 22 , the unit closer to the end of the sequences has a greater influence on the output of LSTM. In our model, LSTM is designed to read from both ends of the mutated sequence. In this way, the mutation site is placed at the ends of both sequences to reinforce its influence on the LSTM output.

LSTM-SOM model building
We used the torch.nn package in PyTorch to construct a neural network. The LSTM procedure that we used consists of two hidden layers, each with 64 nodes. The data subsequently entered a full connection layer, and a 1×8 vector was finally output as the feature vector of a single mutated sequence.
Step 2. Clustering by SOM The SOM consists of two kinds of layers: an input layer and a competition layer 27 . In the SOM process of the LSTM-SOM model, the feature vector obtained from the LSTM process is used as the input.
The settings included 200 units in the SOM competition layer. For each input vector, the Euclidean distance between it (x) and each unit in the competition layer ( ! ) was calculated as follows: The unit closest to x is recorded as '"( , and the distance between '"( and each other competition layer unit is calculated as follows: A threshold of was set in the process of training. If ! ( '"( ) ≤ , ! will move in the direction of x; otherwise, ! will move in the opposite direction. The transportation distance decays with an increase in ! ( '"( ). The neighborhood function refers to the Gaussian function: 28 In the neighborhood function, is a constant that affects the amplitude of transportation distance decay. The update vector is as follows (where L is the learning rate of SOM): Step 3. Train the LSTM model 8 The updated ( ) is used as the label to train the LSTM network. In this way, the output feature vectors of LSTM with similar features can be gradually closed.
The above three steps are repeated until a clear, stable classification is obtained.

Obtain the classification
We adjusted the parameters to optimize the LSTM-SOM model. The units in the competition layer of the SOM were sorted according to the distance to '"( . S was set as the distance of unit rank 40 (5% of entire units) to '"( . The updated input data were used as labels to train the LSTM model for 2 iterations. The LSTM learning rate was set as 0.001. The SOM learning rate was set as 0.005.
Through the adjustment of parameters, 2 classes could be obtained after one round of training. After 3 rounds of training, a total of 8 clustered classes were obtained. It was observed that there were 2 classes showing significantly larger sample sizes than the other classes.
Therefore, an additional round of clustering was carried out in the 2 classes. Finally, we obtained 10 classes of mutated sequences.

Analysis of clinical features
In the analysis of clinical features, measurement data were expressed as the mean ± standard deviation. In the analysis of differences between groups, an independent-samples T test (number of groups = 2) or analysis of variance (ANOVA) (number of groups > 2) was used. Chisquare analysis was used for difference testing of enumeration data. P<0.05 was considered to indicate a statistically significant difference. 9 The log-rank test was used to analyze the difference in survival between different groups.
In some cases, there were many groups of patients involved in the survival analysis between groups, so a heat map was used to show differences in survival between groups. The difference in survival was reflected in the color. In the survival analysis of different MBs in a single mutant gene with a high incidence, some patients exhibited multiple mutations in the same gene and could be grouped into multiple groups. Such patients were excluded in the survival analysis between groups but were included in the survivorship curve.

Clustering of patients according to the MB composition
Patients were clustered according to their MB composition. Each kind of MB was reflected as the percentage of the entire MB in one patient. The K-means method was used for clustering performed by the K-means method in the scikit-learn package. An "elbow method" was used to evaluate the K value (number of clustered groups). 29,30 The K value evaluated in different cancers, and the entire sample was generally between 5-8. After comparing the clustering results, K=7 was selected as the class number for K-means clustering.

Code available
All mathematical methods were performed with Python.

SBS clustering via the LSTM-SOM model
A total of 2,141,527 somatic SBS data points from 9596 patients were collected from the TCGA database. For each SBS sample, 100 flanking bases (50 bases at the 5' end and 50 at the 3' end) were included in the LSTM training data.
In brief, our LSTM-SOM model functions by extracting the features of mutant sequences via the LSTM network and then taking the generated feature vector as the input data for the SOM. In particular, not only will the units in the competitive layer of SOM be refreshed, but the input data generated by LSTM will also be adjusted in the opposite direction. Then, the refreshed input data are used as the labels to train the LSTM model ( Figure 1A). The above steps were repeated until the LSTM outputs formed clear classifications.
Mutated sequences were clustered into 2 types after one round of training. We obtained 8 classes of mutated sequences (for easy understanding, mutated sequences with different features clustered by LSTM-SOM are referred to as mutation blots, MBs) after 3 rounds of training.
Then, an additional round of training was performed for 2 classes of MB with a significantly larger number of samples and ultimately revealed 10 classes of MBs, recorded as MB 1-MB 10 ( Figure 1B).  Table S1).

Characteristics of different MBs
With an increase in the distance from the mutation site, the proportions of the four bases tended to become balanced.
The clustering results were strongly influenced by the flanking bases of the mutation site.
Differences in flanking bases could be observed in other classes of MBs with similar mutation features, such as MB 2 and MB 6, MB 5 and MB 7, MB 4 and MB 8 ( Figure 2). In the analysis of cancers with a high incidence, the composition of the bases in the mutation site and the flanking sites of each MB basically followed that in the entire sample ( Figure S1).

MBs in different cancers
Significant differences in the composition of MBs existed among cancers with different pathologies ( Figure 3A).

Relationship between MBs and clinical features of cancer patients
Analysis In most kinds of cancers, the MB composition at different ages basically followed the pattern shown in the total samples. This suggests that the similarity of the cancer biology of 14 young and old patients requires further study. Regarding cancer staging, T and M stages showed obvious tendencies basically consistent with those for the total sample. Stomach cancer and colon cancer, in particular, showed opposite MB tendencies in T and N stages compared with the entire sample and with other cancers with high incidence. This suggests that local and lymph node progression in gastrointestinal cancers may exhibit distinct mechanisms ( Figure S6-S9).  Figure 6A).

MBs and survival of cancer patients
Significant differences in survival were observed in different classes of patients ( Figure   6B). In the pairwise comparison of survival, patients with Classes 2, 4, and 5 showed better survival, and patients with Classes 1, 3, 6, and 7 showed worse survival ( Figure 6C). In the analysis of specific cancers, survival in different classes of patients generally followed the results obtained for the total sample ( Figure S10). Class 3 patients, in particular, seemed to show poor survival for most of the analyzed cancers. These results suggested that a balanced MB composition may predict poor survival in patients.
Patients of different classes showed distinct clinical features ( Figure 6D and E).
According to AJCC staging, a significantly lower proportion of stage IV patients and a higher proportion of stage I patients were observed in Classes 4 and 5, which may be related to the better survival of these 2 classes of patients. Class 6 patients showed the highest percentage of AJCC stage 4 and lowest percentage of AJCC stage I, which may be the reason for the poor survival of these patients. Patients of Class 3 were found to present significantly greater ages and higher weights. These factors may be partly responsible for the poor survival of Class 3 patients.
Class 1 patients exhibited a high percentage of AJCC stage 1 and a low percentage of stage IV.
Therefore, further study is still needed to determine the mechanism causing Class 1 patients to show poor survival.

Discussion
Previous studies have identified a variety of mutation signatures that may be associated with different triggers involved in various mutation processes and result in differing biological behaviors of cancers 2,8,[14][15][16][17][18][19][20][32][33][34][35] . It is suggested that both the mutated and flanking bases influence somatic mutation characteristics. However, the relationship between sequence feature and its influence on cancer characteristics is still not adequately explained. were not included in this study. In summary, this study provided a method for classifying the characteristics of mutant sequences and found the influences of mutation sequences on cancer characteristics. Further study of the mechanism of MBs related to cancer characteristics is suggested.     Figure S5). For each gene, the left subgraph shows the proportion of MB in all mutation data points from different cancers; the middle subgraph shows the P value of the log-rank test between groups in the whole population; and the right subgraph shows the P value of the log-rank test between cancers with high incidence. Only P values less than 0.05 are shown in the heat map.    Only P values less than 0.05 are shown in the heat map.

Figure S6
Statistics of the proportion of each MB by age in cancers with high incidence. *: P < 0.05 in the t test or ANOVA between groups; ** P < 0.005 in the t test or ANOVA between groups.

Figure S7
Statistics of the proportion of each MB by sex in cancers with high incidence. *: P < 0.05 in the t test or ANOVA between groups; ** P < 0.005 in the t test or ANOVA between groups.

Figure S8 Statistics of the proportion of each MB by T stage and N stage in cancers with
high incidence. *: P < 0.05 in the t test or ANOVA between groups; ** P < 0.005 in the t test or ANOVA between groups.

Figure S9 Statistics of the proportion of each MB by M stage and AJCC stage in cancers
with high incidence. *: P < 0.05 in the t test or ANOVA between groups; ** P < 0.005 in the t test or ANOVA between groups.

Figure S10
Log-rank test between classes of different cancers. Differences in the P value are reflected in color.