Data acquisition and Data management
NGS Data
The proposed data can be divided into two groups: 44 training samples and 30 testing samples, with an equal split between positive and negative cases. The data of 22 positive glioma patients in the training set and 15 positive glioma patients in the test set were obtained from 2 researches [2, 3]. Additionally, a total of 37 healthy human negative samples from the training and test samples were obtained from IMMUNOSEQ ANALYZER (https://clients.adaptivebiotech.com/).
The raw data from the training samples underwent high-throughput TCR sequencing analysis, resulting in a total of 1,527,623 TCR β sequences. We obtained information about the presence of each sequence in the samples. However, due to the enormous size of the data, it is challenging to incorporate it directly into the classification model. Therefore, we performed dimensionality reduction using Fisher's exact test as we described previously[13]. We screened 602 TCR β sequences with a p-value less than 0.1, which exhibited a higher correlation with glioma diagnosis, to be used as features for determining glioma.
Machine learning algorithms with different datasets
Classification algorithms on TCR Diversity Indices
For these 74 samples, we analyzed TCR diversity, clonality and singleton frequency. According to the common bioindicator system, we can obtain each common immunome sequencing result index corresponding to positive negative samples, such as Clonality index, Shannon diversity index, inverse Simpson diversity index, VJ diversity index, Singleton ratio, and Simpson diversity index these 6 indices.
Of the above indices, the Clonality index is used to measure the extent of TCR clone amplification and to assess the number and proportion of major clones in tumors, with higher values indicating a higher frequency of amplification of some TCR clones. Clonality takes values between 0 and 1, with a value of 0 indicating that each individual is unique and no clone is amplified; a value of 1 indicating that all individuals are copies of the same clone. The index is calculated as follows, where productive unique represent the total number of TCR clones in the sample and diversity is the result of the Shannon diversity index[4].
$$\text{C}\text{l}\text{o}\text{n}\text{a}\text{l}\text{i}\text{t}\text{y} = 1-\frac{\text{d}\text{i}\text{v}\text{e}\text{r}\text{s}\text{i}\text{t}\text{y}}{\text{l}\text{n}\left(\text{p}\text{r}\text{o}\text{d}\text{u}\text{c}\text{t}\text{i}\text{v}\text{e} \text{u}\text{n}\text{i}\text{q}\text{u}\text{e}\text{s}\right)}$$
The Hvj diversity index is used to measure the diversity of V-J gene combinations of TCR clones in the sample. The index measures the frequency of the ith V-J gene fragment combination use type in the sample by the ratio of pi, thus reflecting the degree of T cell expansion. This index can also be used to reflect the replicative proliferation capacity following stress generated by T-specific recognition of antigen, with higher values indicating a greater degree of clonal amplification, which is calculated as follows [5]:
$${H}^{{\prime }}=- \sum {\text{p}}_{\text{i}}\text{l}\text{n}\left({\text{p}}_{\text{i}}\right)$$
The Shannon Index of diversity is used to measure the diversity of TCR clones in the sample. This index incorporates information regarding species richness (the number of different clone types present) and the evenness of the distribution of individuals among these clone types. A higher Shannon index signifies a greater diversity of TCR clones in the sample. The pi in the formula represents the frequency of the ith specific clone type. Thus, the Shannon index considers the number of TCR clone types and the frequency evenness of each clone [6].
$${H}^{{\prime }}=- \sum {\text{p}}_{\text{i}}\text{l}\text{n}\left({\text{p}}_{\text{i}}\right)$$
The Singleton Frequency is used to represent the proportion of Naïve T cells frequency in a sample and is a metric used to assess the degree of backbone of immune system[14]. In TCR studies, several studies have shown that the number of singleton is positively correlated with the body's ability to resist unfamiliar diseases, so this indicator can also reflect the degree of the body's immune response to the disease. The calculation formula is as follows, where n_1 indicates the number of clone types that appear only once in the sample and n indicates the number of all clone types in the sample.
$$\text{S}\text{i}\text{n}\text{g}\text{l}\text{e}\text{t}\text{o}\text{n} = \text{n}\_1/(\text{n}-1)$$
Binary classification algorithms
According to our previous CMV two-dimensional classification paper, it has been demonstrated that extracting two-dimensional features from high-throughput sequencing results can improve diagnostic outcomes [13]. In the context of glioma, the two-dimensional data features extracted from high-throughput sequencing results correspond to a specific threshold of Fisher's exact test p-value. These features include the total clonetypes and associated clonetypes. The total clonetypes represents the total number of distinct TCRβ sequence species observed in each sample, while the associated clonetypes denotes the number of repeated species with associated TCR sequences.
With the two-dimensional data above, we built up the classification system with 16 algorithms including neural network algorithms, integrated learning algorithms and traditional machine learning algorithms.
Multidimensional classification algorithms
The multidimensional dataset of TCR β sequences uses 0–1 format data to represent the presence of each sequence in each sample. If the data equals to 0, the sequence is absent in the subject, else the sequence exists in the sample.
Multidimensional classification algorithms were applied to analyze the tabular data frames of glioma, which contained information on the presence or absence of relevant TCRβs. Several classification algorithms were used, including random forest, support vector machine with linear and polynomial kernel functions, logistic regression, linear discriminant analysis, Bayesian algorithm, GBDT, XGB, and decision tree algorithm.
To identify marker sequences specific to glioma while using fewer markers and maintaining high accuracy, dimensionality reduction techniques were employed. Fisher's exact test, a commonly used method in bioinformatics, was utilized for dimensionality reduction. By applying different cutoffs for Fisher's exact test, groups of sequences with varying degrees of relevance to gliomas were identified. These sequences were then incorporated into the classification algorithm system to achieve higher classification accuracy using a reduced number of sequences. This approach allows for the identification of marker sequences specifically associated with glioma.
Dimension reduction and Core features’ extraction
Due to computational limitations, we did not perform differential gene analysis on the initial millions of sequences. Instead, we focused on 602 sequences corresponding to a Fisher's exact test threshold of 0.1. Based on this subset of sequences, we explored feature extraction methods that outperformed the Fisher's exact test and identified TCRβ sequences that were more significant for glioma diagnosis. Apart from the Fisher's exact test, this study primarily utilized two algorithms for feature extraction: Lasso and RFECV.
Lasso (Least Absolute Shrinkage and Selection Operator) is a regularization method commonly used for linear regression. It can be applied for both feature extraction, which involves generating new features from the original data, and feature selection, which involves identifying important features to retain. Lasso achieves feature selection by introducing an L1 regularization term to the loss function, allowing it to reduce the coefficients of unimportant features to zero. Unlike traditional methods, Lasso considers the collective effect of all features rather than evaluating the importance of each feature individually. In this study, we employed the Lasso algorithm to select features from the multidimensional data obtained through Fisher's exact test using threshold values of 0.1, 0.01, and 0.001.
RFECV (Recursive Feature Elimination with Cross-Validation) is another feature extraction method that incorporates cross-validation. It recursively reduces the number of features and selects the optimal feature subset through cross-validation. Starting with all features, RFECV trains a model at each iteration and eliminates the least important features. Cross-validation is then utilized to evaluate the performance of each feature subset, ultimately selecting the optimal feature subset. RFECV determines the least important features based on model-based feature importance and eliminates them in each iteration. By repeatedly iterating, RFECV identifies the optimal feature subset for feature selection and extraction. One advantage of RFECV is its automatic selection of the number of features and utilization of cross-validation to choose the best feature subset, avoiding the subjectivity and limitations associated with manual feature selection. Additionally, the results obtained from RFECV can enhance the generalization ability and prediction performance of the model. In this study, RFECV was combined with nine classification algorithms for feature extraction, and the effectiveness of feature extraction was evaluated using 12 classification algorithms separately. This approach resulted in improved feature extraction, yielding higher classification accuracy with fewer sequences.