A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2

We describe the establishment and current content of the ImmuneCODE™ database, which includes hundreds of millions of T-cell Receptor (TCR) sequences from over 1,400 subjects exposed to or infected with the SARS-CoV-2 virus, as well as over 135,000 high-confidence SARS-CoV-2-specific TCRs. This database is made freely available, and the data contained in it can be downloaded and analyzed online or offline to assist with the global efforts to understand the immune response to the SARS-CoV-2 virus and develop new interventions.


Introduction
The emergence of SARS-CoV-2 in December of 20191 and the ensuing pandemic declared by the WHO at the end of January 20202 created an urgent need to understand the disease and its causative agent. Initial studies have shown a strong T-cell based adaptive immune response3,4,5, but its detailed nature remains uncharacterized. We therefore applied our previously described immunoSEQ® Assay6,7,8 and MIRA™ tool9,10 to deepen the understanding of the adaptive immune response to SARS-CoV-2 infection in support of COVID-19 research.
To generate these data, we partnered with Microsoft, Illumina, Labcorp/Covance, and health organizations across the world to generate the ImmuneCODE database described herein. These data are being made freely available to the scienti c community so that any researcher, public health o cial or organization can utilize the data to accelerate ongoing global efforts to develop better diagnostics, vaccines and therapeutics, as well as to answer important questions about the virus.
The database consists of two distinct but related datasets. (A) The immunoSEQ dataset includes 1,414 deeply-sampled TCRb repertoires from subjects who at the time of sampling either had been exposed to, were actively suffering from, or had recovered from COVID-19. These data originate from two sources (Table 1): ImmuneRACE (Immune Response Action to COVID-19 Events), an ongoing prospective study enrolling participants across the U.S. to decode how immune systems detect and respond to the virus, which includes self-reported demographic and clinical data, and (2) thousands of de-identi ed geographically and ethnically diverse patient blood samples collected by institutions around the world. (B) The MIRA dataset maps TCRs binding to SARS-Cov-2 virus epitopes, and includes data obtained from exposed subjects and naïve controls. In total, the MIRA dataset includes more than 135,000 high-con dence SARS-CoV-2-speci c TCRs.
The data include varying degrees of demographic and clinical information (as allowed by each institution and corresponding IRB). Additional metadata may be added in the future.
The ImmuneCODE database will continue to grow both as we continue to recruit participants to ImmuneRACE and as we add samples collected by additional institutions. This will result in additional Tcell repertoires of exposed and infected individuals and SARS-CoV-2-speci c TCRs, allowing the association of T-cell signatures with disease and outcomes. We hope that this freely available resource will inform our understanding of the immune response to the virus and that it will be useful for researchers around the world by accelerating their work in basic and applied immunology, thus contributing to the development of new therapeutic and preventive measures.

Dataset Access
The ImmuneCODE database includes both immunoSEQ and MIRA data (Figure 1a). and is being shared through the immuneACCESS® data portal (Figure 1b), which enables the export of complete or selected data, as well as real-time analysis using a rich suite of custom-built tools. Data are available at (https://clients.adaptivebiotech.com/pub/covid-2020; DOI 10.21417/ADPT2020COVID). Note that the dataset will continue to grow over time; subjects described in this article can be identi ed by selecting samples with the "ImmunoCODERelease" tag value "002".
immunoSEQ data The ongoing immueRACE study aims to enroll 1,000 subjects who have been exposed to, are currently infected with, or have recovered from COVID-19. The current release of the database includes T-cell repertoire data from the rst 160 participants in the study (including multiple samples from some subjects); new data will be added as it is generated. This release also includes T-cell repertoire data from 1,254 subjects from 6 global collaborators (Table 1); new T-cell repertoires may be generated both by adding new samples from these ongoing studies, and by incorporating additional institutions to this effort.
These data were generated from participant samples using the TCRb immunoSEQ Assay as previously described6,7,8. They include a list of unique TCRb rearrangements found in each analyzed sample, a count for each rearrangement, and sample-level metadata. Certain pre-con gured analyses we believe will be most used will also be available through immuneACCESS, so that users do not need to recreate them. The data can be exported using dedicated links on the immuneACCESS project page for o ine analysis.
By default the immunoSEQ Analyzer includes many metadata elds that are useful across different research contexts; Tables 2 and 3 describe the key elds most relevant to this dataset and should be useful to users interested in understanding the de nitions of the different elds. Speci cally, Table 2 describes the sample level elds included, whereas Table 3 describes the sequence-level elds. The amount of metadata available varies by source and participant; we include all available, uncurated metadata for each sample in the "sample_tags" eld. In almost all cases, these include de-identi ed subject IDs, COVID-19 status, age in years, and sex.

MIRA data
Antigen-speci c TCRs were identi ed using the 'Multiplex Identi cation of Antigen-Speci c T-Cell Receptors Assay (MIRA)9,10. MIRA is a high-throughput multiplex tool, enabling the identi cation of antigen-speci c TCR to large numbers of query antigens (hundreds to thousands at a time and in parallel) by combining immune assays with T-cell receptor sequencing. We use cell sorting based on the upregulation of activation markers to separate a population of antigen-speci c T cells. This positive population is sequenced via immunoSEQ, and clonotypes speci c to antigen are identi ed by virtue of enrichment in the positive population compared to a sample of unenriched or unsorted T cells.
With the goal of identifying SARS-CoV-2-speci c TCRs, we interrogated T-cell repertoires from both healthy donors and COVID-19 patients. Input cell types used varied and included PBMCs from healthy donors or COVID-19 patients, and naïve T cells from healthy donors. To maximize TCR yield per experiment, we expanded T cells from both types of input cells. When starting with PBMCs from either healthy donors or COVID-19 patients, T cells were expanded polyclonally with soluble anti-CD3. When starting with naïve CD8+ T cells from healthy donors, T cells were expanded following co-culture with monocyte-derived DCs loaded with a pool of all peptides derived from SARS-CoV-2.
We used two different MIRA tool approaches: peptide-or transgene-based. Both enable the identi cation of antigen-speci c TCRs, however the transgene-based approach enables identi cation of TCRs that are speci c to epitopes encoded and presented by APCs following expression upon transfection of transgenes. This approach enables us to distinguish the subset of TCRs that respond to endogenouslypresented epitopes rather than those that only respond to exogenously loaded peptides. Binding or activation following a multimer stain or incubation with peptides is therefore not an indicator of whether a T cell is speci c to an endogenously presented epitope. The underlying assumption for any immunological assay involving multimers or exogenously loaded peptides is that the epitope being tested is actually a presented epitope. For well-characterized epitopes this assumption is reasonable, however when querying large numbers of novel epitopes from a novel virus (SARS-CoV-2, for example) the risk for false positives (de ned as TCRs speci c to a never-before tested peptide that was exogenously loaded), is higher.
In total, the MIRA dataset includes more than 135,000 high-con dence high-con dence SARS-CoV-2speci c TCRs. These data are made available as a set of downloadable les "ImmuneCODE MIRA Release 002.zip", which can also be accessed through the immuneACCESS project page.
The dataset includes experiments from three MIRA panels. Two of these panels, named "minigene_Set1" and "minigene_Set2", targeted large protein sequences intended to narrow down which parts of the genome generally elicit immune response. The third panel, named "C19_cI", targeted individual peptides or small groups of peptides. Most of the MIRA data included in this dataset corresponds to the C19_cI panel.
Tables 4 through 9 describe the MIRA data included in the database, as follows: Table 4 (subjectmetadata.csv) includes available metadata for each sample from subjects included in the MIRA experiments (both in the two minigene and in the peptide panels described above). HLA types are provided when available. Missing values are generally represented with "N/A", except for HLA types, where missing data is represented as an empty string. Note that the metadata contained in this le relates to the MIRA results, and is distinct from the immunoSEQ-related metadata (I.e. "tags" in the tables above). Table 5 (orfs.csv) includes the genomic location of the MIRA targets as per GenBank11. Table 6 (minigene-hits.csv) contains counts of the number of unique TCRs that bound to targets within the "minigene_Set1" and "minigene_Set2" MIRA panels, while Table 7 (minigene-detail.csv) describes the identity of the TCRs bound per target for both minigene MIRA panels. Finally, Table 8 (peptide-hits.csv) contains counts of the number of unique TCRs that bound to targets within the "C19_cI" MIRA panel, while Table 9 (peptide-detail.csv) describes the identity of the TCRs bound per target for the C19_cl MIRA panel.

Discussion
To assist in the understanding of the adaptive immune response to SARS-CoV-2, we generated the freelyavailable ImmuneCODE database described herein, which includes a dataset of TCR rearrangements observed in individuals exposed to, infected with or recovered from COVID-19, and describes the ability of a subset of these TCRs to recognize SARS-CoV-2 epitopes. These data are provided to the scienti c community at large with the goal of contributing to their research efforts to develop novel interventions to prevent and treat COVID-19 infections.
In-depth understanding of the T-cell response to the COVID-19 causative agent may improve the accuracy of existing testing paradigms, and potentially provide an assessment of immunity. These immune response data may help to solve two of the key challenges we are facing in the current diagnostic paradigm, namely (1) detection of the virus in infected people who are asymptomatic, and (2) detection of past infections later than serology and in other cases where antibodies are not present.
Additionally, it is possible that identifying and tracking the T-cell response to the virus may provide insight as to the severity of a patient's illness, the length of any post-infection immunity period, the effect of the infection on individuals with cancer and other conditions conferring higher risk of severity, and the potential e cacy of vaccines in development.

ImmuneRACE experimental cohort and study approval
The ImmuneRACE study is a prospective, single group, multi-cohort, exploratory study of unselected eligible participants exposed to, infected with, or recovering from COVID-19 (NCT04494893). Participants, aged 18 to 89 years and residing in 24 different geographic areas across the United States, were consented and enrolled via a virtual study design. Cohorting was based on participant-reported clinical history following the completion of both a screening survey and study questionnaire.
Cohort 1 included participants exposed within 2 weeks of study entry to someone with a con rmed COVID-19 diagnosis, either based on positive PCR testing or clinician diagnosis. Cohort 2 participants included those clinically diagnosed by a physician or with positive laboratory con rmation of active SARS-CoV-2 infection via PCR testing. Cohort 3 included participants previously diagnosed with COVID-19 disease who have been deemed recovered based on two consecutive negative nasopharyngeal or oropharyngeal (NP/OP) PCR tests, clearance by a healthcare professional, or the resolution of symptoms related to their initial COVID-19 diagnosis. The ImmuneRACE study was approved by Western Institutional Review Board (WIRB reference number 1-1281891-1, Protocol ADAP-006). All participants were consented for sample collection and metadata use via electronic informed consent processes.
Both whole blood and serum and a nasopharyngeal or oropharyngeal swab were collected from participants by trained mobile phlebotomists. Blood samples were shipped frozen or at room temperature to Adaptive Biotechnologies for processing, including, but not limited to, DNA extraction, and TCRb analysis via the immunoSEQ Assay (Adaptive Biotechnologies, Seattle, WA) from DNA extracted from blood samples (Table 1). NP/OP swabs and serum were sent to Covance/Labcorp for further testing. An electronic questionnaire was administered to collect information pertaining to the participant's medical history, symptoms, and diagnostic tests performed for COVID-19 disease. Participants have the option to undergo additional blood draws and questionnaires over 2 months.

Global data collaborations
Whole blood samples were collected in K2EDTA tubes based on each institution's protocol and supervised by their respective Institutional Review Board. Samples were stored at the institution and sent to Adaptive as frozen whole blood, isolated PBMC or DNA extracted from either sample type for TCRb analysis via the immunoSEQ Assay (see Table 1

Sample analysis
A subset of the samples were processed for both T-cell receptor variable beta chain sequencing and MIRA, and another subset was processed only by one of these approaches. For each subject included in the dataset, SubjectID can be used to determine which assay the samples were processed in.

T-cell receptor variable beta chain sequencing
Immunosequencing of the CDR3 regions of human TCRβ chains was performed using the immunoSEQ Assay as previously described6,7,8. In brief, extracted genomic DNA was ampli ed in a bias-controlled multiplex PCR, followed by high-throughput sequencing. Sequences were collapsed and ltered in order to identify and quantitate the absolute abundance of each unique TCRβ CDR3 region for further analysis.
Multiplexed Identi cation of TCR Antigen Speci city (MIRA) To identify antigen-speci c TCRs, T cells derived post-expansion from either of the above input cell types were used for the MIRA tool. Antigen-speci c TCRs were identi ed as previously described9,10. Brie y, T cells were incubated overnight with MIRA peptide pools, and the antigen-speci c subset was identi ed by CD137 upregulation. Following addition of peptides, cells were incubated at 37°C for ~18 hours. At the end of the incubation, replicate wells of cells were harvested from the culture and pooled and then stained with antibodies for analysis and sorting by ow cytometry. Cells were then washed and suspended in PBS containing FBS (2%), 1mM EDTA and 4,6-diamidino-2-phenylindole (DAPI) for exclusion of non-viable cells. Cells were acquired and sorted using a FACS Aria (BD Biosciences) instrument. Sorted antigen-speci c (CD3+CD8+CD137+) T cells were pelleted and lysed in RLT Plus buffer for nucleic acid isolation. Analysis of ow cytometry data les was performed using FlowJo (Ashland, OR).
RNA was isolated using AllPrep DNA/RNA mini and/or micro kits, according to manufacturer's instructions (Qiagen). RNA was reverse transcribed to cDNA using Vilo kits (Life Technologies). TCRβ ampli cation, sequencing and clonotype determination were performed as described in the 'T-cell receptor variable beta chain sequencing' section above.

MIRA tool design
T-cell populations were exposed to pooled peptides or transgenes in a combinatoric format, similar to the approach described in reference 10. According to the MIRA panel design, each antigen is strategically placed in a subset of K unique pools while being omitted from the remaining pools (total pools = N). This design allows for antigens to be placed into a unique combination of N choose K occupancies (or also referred to as "addresses"), and allows for increased economies of scale as the number of replicate pools (N) increases. In order to estimate an empirical false discovery rate and gauge assay quality, we purposefully left > 40% of the unique occupancies empty to assess the rate at which are clones are spuriously sorted and detected in K pools with no query antigen present (hereinafter referred to as invalid TCR associations).
Matching clonotypes to antigens T cells were aliquoted into 11 pools, and activated T cells were sorted using T-cell markers after overnight stimulation, as described previously10. These putative antigen responding cells were set aside to characterize the T-cell clonotypes present in each sorted pool using the immunoSEQ Assay as described above. After immunosequencing, we examined the behavior of T-cell clonotypes by tracking the read counts of each unique TCRb sequence across each sorted pool. True antigen-speci c clones should be speci cally enriched in a unique occupancy pattern that corresponds to the presence of one of the query antigens in K pools. We have reported on methods to assign antigen speci city to TCR clonotypes previously12; in addition we also developed a non-parametric Bayesian model to compute the posterior probability that a given clonotype is antigen speci c. This model uses the available read counts of TCRs to estimate a mean-variance relationship within a given experiment and as well as the probability that a clone will have zero read counts due to incomplete sampling of low frequency clones. Together, this model takes the observed read counts of a clonotype across all N pools and estimates the posterior probability of a clone responding to all possible N choose K addresses and an additional hypothesis that a clone is activated in all pools (truly activated, but no speci c to any of our query antigens). To de ne antigen speci c clones, we identi ed TCR clonotypes assigned to a query antigen from this model with a posterior probability >= 0.9.

Declarations Data and Software Availability
All immunosequencing data underlying this study are freely available for analysis and download from the Adaptive Biotechnologies immuneACCESS siIte under the immuneACCESS Terms of Use at https://clients.adaptivebiotech.com/pub/covid-2020.    The functional state of a rearrangement: in-frame (productive), out-of-frame, or containing a stop codon.

Rearrangement
Type rearrangement_type string The type of rearrangement process that generated a speci c rearrangement.

Productive
Frequency productive_frequency fraction (0.0 - The frequency of a speci c productive rearrangement among all Productive Rearrangements within a sample. Calculated as the Templates for a speci c rearrangement divided by the Sum of Productive Templates for a sample.

CDR1 Index cdr1_start_index integer
The index into the Extended Rearrangement string at which the CDR1 region begins.

CDR1
Rearrangement Length cdr1_rearrangement_length integer The length (in characters) of the CDR1 region within Extended Rearrangement.

CDR2 Index cdr2_start_index integer
The index into the Extended Rearrangement string at which the CDR2 region begins.

CDR2
Rearrangement Length cdr2_rearrangement_length integer The length (in characters) of the CDR2 region within Extended Rearrangement.

CDR3 Index cdr3_start_index integer
The index into the Extended Rearrangement string at which the CDR3 region begins.
The length of the CDR3 in nucleotides, CDR3 Length cdr3_length integer starting from the rst base of the codon for the conserved cysteine in the V gene through the last base of the codon for the conserved residue in the J gene that ends the CDR3.

V Index v_index integer
The index within the full nucleotide sequence that denotes the Cysteine beginning the CDR3.

N1 Index n1_index integer
The index within the full nucleotide sequence that denotes the start of the N1 (VD) region.

D Index d_index integer
The index within the full nucleotide sequence that denotes the start of the D region.

N2 Index n2_index integer
The index within the full nucleotide sequence that denotes the start of the N2 (DJ) region.

J Index j_index integer
The index within the full nucleotide sequence that denotes the start of the J region.    index_genome Integer The 1-based index of the rst base of the ORF within the genome.
end_index_inclusive Integer The 1-based index of the last base of the ORF within the genome. The ORF in which this target is located.

ORF Genebank ID String
The identi er for the sequence from which the target was selected.

Amino Acid String
The protein sequence of the target.  The unique TCRB sequence identi ed as binding to the target.

Experiment String
The experiment in which the binding was observed (joins to the subject-metadata.csv le).

ORF String
The ORF in which this minigene target is located.
ORF Genebank ID

String
The identi er for the sequence from which the target was selected.
Amino Acid

String
The protein sequence of the minigene target.

Start Index in
Genome Integer The 1-based index of the rst base of the target within the genome.

End Index in
Genome Integer The 1-based index of the last base of the target within the genome.  The unique TCRB sequence identi ed as binding to the target.

Experiment String
The experiment in which the binding was observed (joins to the subject-metadata.csv le).

String
The ORFs in which this target is located. Note some targets sit on multiple ORFs.

String
The protein sequences that make up this target. Note some targets include multiple peptides.