## TCRosetta overview

TCRosetta is a user-friendly and powerful server for analyzing and annotating the TCR repertoire. Because the TRB has higher diversity than TCR alpha chain, most TCR-Seq data only focus on the TRB [20]. Thus, all analyses for TCR repertoire in TCRosetta refer to the TRB. TCRosetta supports several input formats, including CDR3 sequence list, AIRR-compatible formatted file, .csv/.tsv formatted file, as well as output file from TCR extraction software like MiXCR [21], CATT [22], IMSEQ [23] and RTCR[24]. The input data will firstly get through pre-processing to ensure data quality (Figure 1A). TCRosetta provides two modes of analyses for TCR repertoire. The general analyses include repertoire diversity, CDR3 length distribution, V/J gene usage, V-J gene utilization, and clonality (Figure 1B). The advanced analyses contain network analysis, public analysis, embedding analysis, and enrichment analysis (Figure 1C).

## Data pre-processing

All input TCR sequences are further processed to ensure the reliability and quality of the sequences. The quality control will be performed based on the following rules: (i) Identical CDR3 sequences with different V/J genes will be merged and only keep the V/J gene with the highest frequency; (ii) Different alleles of the same V/J family will be merged and only keep the family information; (iii) Only sequences containing the V gene, J gene, and complete in-frame CDR3 sequences will be retained; (iv) The complete CDR3 sequence should begin with the cysteine (C), end with the phenylalanine (F) or tryptophan (W) and contain no stop codon according to the IMGT (ImMunoGeneTics) rules [25]; (v) Then the low-quality CDR3 sequences with length less than 8 and greater than 24 will be removed.

## Measure TCR repertoire diversity by Renyi entropy

The diversity measures the number of distinct clones and their frequencies in the T-cell repertoire. We adopt the widely used Renyi entropy to quantify the TCR repertoire diversity [26]. Renyi entropy can graphically represent the distribution of abundant clones and rare clones within a given repertoire, in addition to assigning a numerical value to repertoire diversity. Here is the frequency of the sequence in TCR repertoire; is the number of unique sequences in TCR repertoire; is the base of the logarithm, which determines the choice of units of entropy measure. The order sets the degree of sensitivity of the diversity index to sequences abundance in TCR repertoire, as shown in formula (1). When , all sequences are weighted equally. The entropy measure becomes a function of the number of unique sequences, but not dependent on their abundance. When , the Renyi entropy is equivalent to the Shannon entropy.

## The TCR repertoire clonality analysis

Aside from the diversity of TCR repertoire, the description of equivalency in species abundance can also be used to measure the dominance of clones in a repertoire; thus, referred as clonal evenness [27]. The clonal evenness of a repertoire can be calculated using Pielou’s index, and the complement of clonal evenness is often used as clonality. Thus, we use 1-Pielou’s index to calculate the TCR repertoire clonality [28]. A clonal score of 0 represents a maximally diverse population with even frequencies, and a value close to 1 means a repertoire driven by clonal dominance, as shown in formula (2). Here, is the frequency of the sequence; is the number of unique sequences, and is the base of the logarithm.

## Cluster TCR sequence and construct TCR network

Similarity in TCR CDR3 sequences implies structural resemblance for antigen recognition, which may share antigen specificity [29]. Therefore, clustering similar CDR3 sequences is an important way to identify antigen-specific receptors. Although there are a few methods for TCR clustering, GIANA is the best tool for clustering large scale TCR repertoire (> 106 sequences) with higher clustering accuracy, specificity and efficiency[17]. GIANA converts the sequence alignment and clustering problem into a classic nearest neighbor search in high-dimensional Euclidean space. Thus, we integrate the GIANA into network analysis in TCRosetta to cluster TCR sequences and then construct TCR network. The workflow of network analysis contains the following steps: (i) Use GIANA to cluster similar CDR3 sequences; (ii) Use the Muscle [30] to align sequences obtained in the previous step and then convert it to a distance matrix by calculating Hamming distance between these sequences; (iii) Transform the distance matrix into a adjacency matrix, where two sequences are connected if their Hamming distance is less than 3; (iv) Construct network by igraph (https://igraph.org/) based on the distance matrix from the previous step; (v) Calculate the betweenness of each node in the network by betweenness centrality, which is a measure of the centrality in a graph based on the shortest paths and reflects the control of the node in graph theory; (vi) Mix the betweenness and degree as the weight of a node in the network; (vii) Use the random walk algorithm [31] to discover network communities in the TCR network.

## Calculate public score of TCR sequence

Each TCR sequence can be generated in a large number of ways, comprising the recombination of V(D)J segments, random insertions and deletions [18], which make the generation process cannot be described exactly. High-throughput sequencing of large TCR repertoires has enabled the development of methods to predict the probability of generation by V(D)J recombination. Among several methods calculating the generation probability of TCR sequence, OLGA [19] is the best one balanced the accuracy and efficiency by dynamic programming methods. Therefore, we apply OLGA to calculate the generation probability in TCRosetta. We define the public score as the generation probability of a TCR sequence generated in healthy individuals to reflect the healthy degree of the sequence. The parameters of OLGA model were trained by a background, which consists of 5,000,000 randomly selected TCR sequences from healthy samples in TCRdb [32]. The Mann–Whitney U test is used to compare the difference in the public score distribution between input data and background to distinguish whether the input data are healthy.

## Embedding analysis based on 3-mer TCR motif

Previous studies demonstrated that only partial CDR3 sequences called “motifs” would contact specific peptides, which forms a part of the TCR specificity [33]. Based on this observation, we employ the k-mer (k = 3) abundance distribution of a TCR repertoire to represent the feature of the TCR repertoire. Each TCR repertoire is represented by a distribution of 3-mer abundance over 203 = 8000 dimensions. Such high dimension has more specificity information of TCR repertoire but also introduces lots of noise in the data. Among those dimensionality-reduction methods, PHATE [34] generates a low-dimensional embedding specific for visualization, which provides an accurate, denoised representation of a TCR repertoire, and is highly scalable both in memory and runtime. Therefore, we apply PHATE to embedding TCR repertoire in TCRosetta.

We use the TCR data of eight diseases (Breast cancer, COVID-19, Crohn’s disease, Melanoma, Yellow fever vaccine, classical Hodgkin lymphoma, Non-small cell lung cancer and Cytomegalovirus) in TCRdb to train a PHATE model to learn the embedding representation of k-mer abundance distribution of TCR repertoire. We perform the following steps for training the model. (i) For each sample, split all sequences into 3-mer using a sliding window with step length 1; (ii) Count all 3-mer motifs to form a TCR motif count matrix and exclude samples with either an extremely large (top 20%) or small number (bottom 20%) of motifs; (iii) Filter out motifs that do not appear in more than 50% of the disease samples because that a disease-specific motif should be present in most of samples of this disease; (iv) We use the Mann–Whitney U test with P-value > 0.05 to calculate the difference between the motifs distribution of healthy samples and disease samples to filter out non-disease-specific motifs; (v) Remove the batch effect by scprep (https://scprep.readthedocs.io/en/stable/index.html) and normalize using Z-score for integrating the motif count matrix from eight diseases; (vi) Use PHATE to train a distribution model. For input data, we split all sequences into 3-mers using a sliding window and count all 3-mers motifs to form a TCR motif count matrix. We then embed the matrix into two-dimensional space by our pre-trained model (described above) to obtain its position in the two-dimensional distribution and to discover its possible disease information.

## Annotation-based disease preference evaluation

The existing amount of TCR sequences with clinical information motivates us to further explore the possibility of annotating clinical information for unknown samples, which is a function missing in all current tools. The annotation function in TCRosetta allows users to search multiple CDR3 sequences in our reference consisting of 244 million high-quality and annotated CDR3 sequences over 55 clinical conditions. We use Elasticsearch (https://www.elastic.co/), a fast and distributed search engine for all types of data, to quickly and exactly search in large scale CDR3 sequence data. Because only a few CDR3 sequences are potential disease-specific sequences and they are usually cloned with relatively high frequencies in a TCR repertoire [35]. TCRosetta only keeps the top 3000 (sorted by frequency) sequences for fuzzy search with 0 or 1 mismatch amino acid for CDR3 sequence annotation. We then perform statistics to access potential disease preferences based on annotation results. Because sequences annotated for different diseases are more likely to be nonspecific sequences, we exclude sequences annotated with more than five different diseases in the above annotation results. Then we calculate the number of unique sequences for each disease in the upload sample. To further investigate the possible disease preference for the upload data, we use the Fisher Exact test to calculate for disease , which represents the significance of difference between input data and reference, as shown in formula (3). Only diseases with will be left. We calculate the for each disease , which represents the number of sequences annotated with this disease. Here is the number of total sequences in annotation result, is number of total sequences in reference. is the number of sequences for disease in reference.