Systematic assessment of long-read RNA-seq methods for transcript identi cation and quanti cation

The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

the reduced sequencing depth and higher error rates of long-read sequencing approaches may 68 offset the improvements. 69 70 To evaluate long-read approaches for transcriptome analysis, we formed the Long-read RNA-71 Seq Genome Annotation Assessment Project (LRGASP) Consortium modeled after the 72 previous GASP 11 , EGASP 12 , and RGASP 9,10 efforts. For this project, we aim for an open 73 community effort in order to be as transparent and inclusive as possible in evaluating 74 technologies and computational methods (Fig 1). 75 76 The LRGASP Consortium will evaluate three fundamental aspects of transcriptome analysis. 77 First, we will assess the reconstruction of full-length transcripts expressed in a given sample 78 from a well-curated eukaryotic genome such as human and mouse. Second, we will evaluate 79 the quantification of the abundance of each transcript. Finally, we will assess de novo 80 reconstruction of full-length transcripts from samples without a high-quality genome, which 81 would be beneficial for annotating genes in non-model organisms. These evaluations became 82 the basis of the three challenges that comprise the LRGASP effort (Box 1). 83 84

Challenge 1: Transcript isoform detection with a high-quality genome
Goal: Identify which sequencing platform, library prep, and computational tool(s) combination gives the highest sensitivity and precision for transcript detection.

Challenge 2: Transcript isoform quantification
Goal: Identify which sequencing platform, library prep, and computational tool(s) combination gives the most accurate expression estimates.

Challenge 3: De novo transcript isoform identification
Goal: Identify which sequencing platform, library prep, and computational tool(s) combination gives the highest sensitivity and precision for transcript detection without a high-quality annotated genome. The LRGASP Challenges will use data produced by the LRGASP Consortium Organizers (Fig  87   1b being sequenced on all platforms; however, those reads will not be released until after the end 93 of the challenge. All samples were grown as biological triplicates with the RNA extracted at one 94 site, spiked with 5'-capped Spike-In RNA Variants (Lexogen SIRV-Set 4), and distributed to all 95 production groups. After sequencing, reads for human and mouse samples were deposited at 96 the ENCODE Data Coordination Center (DCC) for community access, including but not limited 97 to the challenges. A single replicate of manatee whole blood transcriptome was generated for 98 Challenge 3. For each sample, we performed different cDNA preparation methods, including an 99 early-access ONT cDNA kit (PCS110), ENCODE PacBio cDNA, R2C2 13 for increased sequence 100 accuracy of ONT data, and CapTrap to enrich for 5'-capped RNAs. CapTrap is derived from the 101 CAGE technique 14 and was adapated for lrRNA-seq (manuscript in preparation). We also 102 performed direct RNA sequencing (dRNA) with ONT. 103  submissions and timeline) and in any or all challenges. We will compare solutions where only 108 lrRNA-seq data was used and solutions that include additional publicly-available data. 109 Depending on the challenge, they will submit either a GTF or quantification file, additional 110 metadata, and a link to a repository (e.g., Github) where a working copy of the exact analysis 111 pipeline used to generate their results can be downloaded. We expect to re-run analysis 112 pipelines for well-performing submissions to help ensure reproducibility. The evaluation of the 113 challenge will comprise both bioinformatics and experimental approaches. SQANTI3 114 (https://github.com/ConesaLab/SQANTI3) will be used to obtain transcript features and 115 performance metrics that will be computed on the basis of SIRV-Set 4 spike-ins, simulated data, 116 and a set of undisclosed, manually curated transcript models defined by GENCODE 15 . Human 117 models will further be compared to histone modification ChIP-seq, open chromatin, CAGE, and 118 poly(A)-seq results. Experimental validation will be performed on a select number of loci with 119 either high agreement or disagreement between sequencing platforms or analysis pipelines. 120 Evaluation scripts and experimental protocols will be publicly available in advance of submission 121 deadlines (Data and code availability). 122

123
Additional details of all protocols for library preparation and sequencing can be found at the 124 ENCODE DCC and is linked to each dataset produced by LRGASP (Supplementary Table 1). 125

Capping SIRVs 126
Exogenous synthetic RNA references (spike-ins) are widely used to calibrate measurements in 127 RNA assays, but they lack the 7-Methylguanosine (m 7 G) cap structure that most natural 128 eukaryotic RNA transcripts bear at their 5' end. This characteristic makes commercial spike-in 129 mixes unsuitable for library preparation protocols involving 5' cap enrichment steps. Therefore, 130 we enzymatically added the appropriate m 7 G structure to the SIRV standards used in this 131 challenge. Specifically, the pp5'N structure present at the 5' end of spike-in sequence was used 132 as a template for the Vaccinia capping enzyme (catalog num M2080S, New England BioLabs) 133 to add the m 7 G structure to SIRV-Set 4 (Iso Mix E0 / ERCC / Long SIRVs, catalog num 141.03, 134 Lexogen). A total of ten vials of SIRV-Set 4 (100 µl) were employed to perform the capping 135 reaction (final total mass of 535 ng). The reaction was performed following the 136 recommendations of the manufacturer's capping protocol with two minor changes: 3.5 µl of 137 The genome of the Florida manatee Lorelei was sequenced using Nanopore and Pacbio. 183 Lorelei is the same individual manatee for which an Illumina-based genome assembly was 184 released by the Broad Institute in 2012 16 . An EDTA, -80ºC whole blood sample aliquot was 185 used. gDNA was extracted from 1400 µl of blood using the DNeasy kit (QIAGEN, MD, USA) 186 following the companies' specifications for 100 µl aliquots of blood. Thawed blood was diluted 187 1:1 with RNA free Phosphate buffered saline 1x (Gibco, UK), 20 µl of proteinase K (QIAGEN, 188 MD, USA), and 200 ul of AL lysis buffer (QIAGEN, MD, USA) and vortexed immediately. It was 189 incubated at 56 °C for 10 minutes. Then, we added 200 µl of ethanol 96% and mixed it 190 thoroughly. The mixture was added to the DNeasy mini spin-column and centrifuged at 6,000 x 191 g for 1 minute. The column was washed with 500 µl of AW1 solution (QIAGEN, MD, USA) and 192 centrifuged at 6,000 x g for 1 minute and followed with a wash with 500 µl AW2 (QIAGEN, MD, 193 USA ) and centrifuged 20,000 x g for 3 minutes. gDNA was eluted twice with 100 µl of AE 194 buffer added to the center of the column, incubated for 1 minute, and centrifuged 6,000 x g for 1 195 minute. The first and second elution from the DNeasy mini spin-column were pooled and 196 concentrated using a speed vacuum for 20 minutes in which each preparation was reduced 197 from 200 to 50 µl. All gDNA tubes were pooled and the DNA was cleaned with AM Pure 198 magnetic beads (Beckman Coulter-Life Sciences, IN, USA) at a ratio of 0.5:1, beads volume to 199 gDNA volume (50 µl of beads to 100 µl of gDNA). gDNA bound to the beads was washed twice 200 with 1 ml of 70% ethanol. Ethanol traces were removed by quick spin to the bottom of the tube 201 and removed with a pipette. Then, the beads were dried for 2 minutes and gDNA was eluded in 202 55 µl of EB buffer (QIAGEN, MD, USA) at 37 °C with 10 minutes of incubation. This process 203 was repeated twice. Quantification of gDNA was performed with a QubitTM fluorometer 204 (Thermo Fisher Scientific) and the quality of the gDNA was assessed using a Genomic Agilent 205 transcripts without a cap. 2 µl of exonuclease-treated RNA were mixed with a priming reaction 213 (RNAse inhibitor, dNTP's and water)was incubated at 72ºC for 3 minutes, then ramps down to 214 50ºC. While in the PCR block we added oligo dT (stock concentration 10 nM) and were 215 incubated 3 min at 50ºC. We then added a first strand synthesis buffer (5x RT buffer, TSOligo, 216 water) that had previously been incubated at 50ºC for one minute. The previous reaction was 217 then incubated in the PCR block (Extension at 50ºC for 90 min, 85ºC for 5 min and held at 4ºC). 218 To the same reaction we added a mix for amplification (2x reaction buffer, IS primers -20 nM 219 stock, water and SeqAmp polymerase). Then we ran a PCR program to amplify the cDNA (95ºC 220 1 min, 98ºC 15 sec, 65ºC 30 sec and 68ºC 13 min. The cycle repeats 10 times, which is 221 followed by incubation at 72ºC for 10 min and holding at 4ºC. The amplified products were 222 purified using SPRI beads and checked for quality in a bioanalyzer.

R2C2 preparation for ONT sequencing of human and mouse 299
For each biological replicate, two libraries were created, a regular (non-size selected), and a 300 size selected library of cDNA over 2 kb in length to achieve higher coverage of longer 301 transcripts. For each RNA sample, 400 ng was used to generate full-length single stranded 302 cDNA using an indexed oligo(dT) primer and a template switching oligo (TSO). PCR was used 303 to generate the second strand and amplify the library. The cDNA was then isolated by SPRI 304 bead clean up. For the size selected libraries, cDNA was run on a 1% low melt agarose gel. A 305 smear in the range of 2-10 kb was excised from the gel and digested with beta-agarase 306 followed by SPRI bead clean up. At this point, indexed cDNA from each biological replicate was 307 pooled together equally. cDNA was circularized using a short DNA splint with sequence 308 complementary to the cDNA ends by Gibson Assembly (NEBuilder, NEB) with a 1:1 cDNA:splint 309 ratio (100 ng each). After Gibson assembly, a linear digestion (ExoI, ExoIII, and Lambda 310 Exonuclease) was performed to eliminate non-circularized DNA. The circular Gibson assembly 311 product was cleaned up using SPRI beads. The circularized library was used as template for 312 rolling circle amplification (RCA) using Phi29 polymerase and random hexamer primers. 313 Following the RCA reaction, T7 endonuclease was used to debranch the DNA product. A DNA 314 clean and concentrator column was used to purify the DNA. Purified RCA product was size-315 selected using a 1% low melt agarose gel. The main band just over the 10 kb marker was 316 excised from the gel and digested with beta-agarase followed by SPRI bead clean up. The 317 cleaned and size selected RCA product was sequenced using the ONT 1D Genomic DNA by 318 Template Switching reaction and water. Amplified cDNA was purified by AMPure, one round at 392 0.8 to 1.0 beads to sample ratio and one round at 0.65:1.0 ratio. The yield of amplified cDNA by 393 this modified protocol (300-400 ng) was about 10-fold lower than the standard protocol (i.e., 394 without globin-removal). The average cDNA size was ~1400 bp. When increased amounts of 395 cDNA were desired the cDNA was amplified by 5 additional PCR cycles. 396 Two preps obtained with the above described protocol were pooled together and 500 ng were 397 loaded on an electrophoretic lateral fractionation system (ELF, SageScience). Fragments above 398 2.5 kb were collected, re-amplified (10 cycles), and re-pooled equimolarly with non-size-399 selected cDNA fragments. This re-pooled cDNA prep is referred to as "enriched cDNA_>2.5kb". 400 Both non_enriched cDNA and enriched cDNA_>2.5kb cDNA were used for SMRT bell library 401 was constructed starting with 1 µg of cDNA as described (PacBio IsoSeq protocol  Full manual annotation will be undertaken on 50 selected loci on both the human and mouse 540 reference genomes. Transcript models will only be annotated during this exercise based on their 541 support from long transcriptomic datasets generated by the consortium specifically for LRGASP. 542 That is, no transcript annotation will be based on transcriptomic data from externally produced 543 datasets, although annotators will use any publicly available orthogonal data to aid interpretation 544 of aligned consortium data. For example, Fantom 5 CAGE datasets will be used to help identify 545 transcription start sites and transcript 5' ends and RNA-seq-supported introns derived from high 546 throughput reanalysis pipelines such as Recount will be used to support putative introns 547 identified in the alignments of long transcriptomic data. 548 549 Manual annotation will be performed according to the guidelines of the HAVANA (Human And 550 Vertebrate Analysis aNd Annotation) group 15,27 . Transcriptomic data will be aligned to the 551 human and mouse reference genome using appropriate methods. We will test the benefits of 552 aligning the transcriptomic data using multiple methods to reduce the impact of alignment errors 553 and artefacts. 554 555 Annotators will also take advantage of local alignment tools integrated into annotation software 556 to give further alternative views of alignments and improve annotation accuracy. Transcript 557 models will be manually extrapolated from the alignments by annotators using the otter 558 annotation interface 28 . Alignments will be navigated using the Blixem alignment viewer 29,30 and 559 where required visual inspection of the dot-plot output from the Dotter tool 31 will be used to 560 resolve any alignment with the genomic sequence that was unclear or absent from Blixem. 561 Short alignments (<15 bases) that cannot be visualized using Dotter will be detected using 562 Zmap DNA Search 31 (essentially a pattern matching tool). The construction of exon-intron 563 boundaries will require the presence of canonical splice sites (defined as GT-AG, GC-AG and 564 AT-AC) and any deviations from this rule will be given clear explanatory tags (for example non-565 canonical splice site supported by evolutionary conservation). All non-redundant splicing 566 transcripts at an individual locus will be used to build transcript models, and all alternatively 567 spliced transcripts will be assigned an individual biotype based on their putative functional 568 potential. Once the correct transcript structure has been ascertained the protein-coding potential 569 of the transcript will be determined on the basis of its context within the locus, similarity to 570 known protein sequences, the sequences of orthologous and paralogous proteins, candidate 571 coding regions (CCRs) identified by PhyloCSF, evidence of translation from mass spectrometry 572 and Ribo-seq data, the presence of Pfam functional domains, the presence of possible 573 alternative ORFs, the presence of retained intronic sequence and the likely susceptibility of the 574 transcript to nonsense-mediated mRNA decay (NMD). Although the annotation of transcript 575 functional biotype and CDS is not required of submitters, it will be added to transcripts as a 576 matter of routine manual annotation and may be used to investigate the detection or non-577 detection of groups of transcripts by submitters. Where necessary, annotations will be checked 578 by a second annotator to ensure completeness and consistency of annotation between the 579 genes annotated for LRGASP and the remainder of the Ensembl/GENCODE geneset. 580

Challenge 1 Evaluation: Transcript isoform detection 582
Four sets of transcripts will be used for evaluation of transcript calls made on human and mouse 583 lrRNA-seq data 584 1. Lexogen SIRV-Set 4 (SIRV-Set 3 plus 15 new long SIRVs with sizes ranging from 4 to 585 12 kb) 586 2. Comprehensive GENCODE annotation: human v39, mouse vM28. GENCODE human 587 v28 and vM27 are available at the time of the LRGASP data release and new versions of 588 GENCODE will be released after the close of LRGASP submissions. 589 3. A set of transcripts from a subset of undisclosed genes which will be manually curated 590 by GENCODE. These transcripts will thus be considered high-quality models derived 591 from LRGASP data 592 4. Simulated data for both Nanopore (Nanosim) and PacBio (Iso-SeqSim) reads 593 594 The rationale for including these different types of transcript data is that each set creates a 595 different evaluation opportunity, but also has its particular limitations. For example, SIRVs and 596 simulated data provide a clear ground truth that allows the calculation of standard performance 597 metrics such as sensitivity, precision or false discovery rate. Evaluation of SIRVs can identify 598 potential limitations of both library preparation as well as sequencing, but the SIRVs themselves 599 represent a dataset of limited complexity. Higher complexity can be generated when simulating 600 long reads based on actual sample data. However, read simulation algorithms only capture some 601 potential biases of the sequencing technologies (e.g., error profiles) and not of the library 602 preparation protocols. In any case, both types of data approximate, but do not fully recapitulate 603 real-world datasets. Evaluation against the GENCODE annotation 15 represents this real dataset 604 scenario, although in this case the ground truth is not entirely known. This limitation will be partially 605 mitigated by the identification of a subset of GENCODE transcript models that will be revised and 606 deemed as high-confidence by GENCODE curators, and by follow-up experimental validation for 607 a small set of transcripts using semi-quantitative RT-PCR and quantitative PCR (qPCR) 608 approaches. In this way, although an exhaustive validation of the real data is not possible, 609 estimates of the methods' performances can be inferred. By putting together evaluation results 610 obtained with all these different benchmarking datasets, insights will be gained on the 611 performance of the library preparation, sequencing and analysis approaches both in absolute and 612 in relative terms. 613

614
The evaluation of the transcript models will be guided by the use of SQANTI categories 20 (Fig  615   2a), implemented in the SQANTI3 software (https://github.com/ConesaLab/SQANTI3), and will 616 incorporate additional definitions and performance metrics to provide a comprehensive 617 framework for transcript model assessment ( Table 2). The evaluation considers the accuracy of 618 the transcript models both at splice junctions and at 3'/ 5' transcript ends. It will take into 619 account external sources of evidence such as CAGE data, polyA annotation and support by 620 Illumina reads (Fig 2b). A number of novel transcripts detected by all or most pipelines, as well 621 as pipeline-, platform-, or library-preparation specific transcripts will be selected for 622 experimental validation and manual review by the GENCODE project. The evaluation script is 623 provided to participants (Data and code availability). 624 625 626 In order to evaluate SIRVs, we will extract from each submission all transcript models that 634 associate to SIRV sequences after SQANTI3 analysis. This not only includes FSM and ISM 635 isoforms of SIRVs, but also NIC, NNC, antisense and fusion transcripts mapping to SIRV loci. 636 The metrics for SIRV evaluation are defined as follows. 637 638

Simulated Data 642
The simulated data contains both transcript models based on the current GENCODE annotation 643 and a number of simulated novel transcripts that will result in true NIC and NNC annotations. 644 Transcript models generated from simulated data will be analysed by SQANTI3 providing a GTF 645 file that includes all simulated transcripts (GENCODE and novel) and excludes all transcripts for 646 which reads were not simulated. The evaluation metrics for simulated data are defined as 647 follows: 648 649 Comprehensive GENCODE annotation 653 Submitted transcript models will be analyzed with SQANTI3 using the newly released 654 GENCODE annotation and different metrics will be obtained for FSM, ISM, NIC, NNC and Other 655 models according to the scheme depicted below. Transcripts from new genes included in the 656 latest annotation release will be catalogued as "Intergenic" initially, but considered FSM, ISM, 657 NIC or NNC with an updated GENCODE annotation. This will allow evaluation of gene and 658 transcript discovery on unannotated regions. 659 660 We will evaluate transcript isoform quantification performance with both simulated and real 671 sequencing data, which includes SIRV-Set 4. While the ground truth is known for the simulated 672 data and SIRV-Set4, we will experimentally quantify the abundances of transcript isoforms from 673 select loci (genes) within the LRGASP samples. Specifically, we will interrogate the presence of 674 specific transcript isoforms using qPCR measurements of isoform-specific regions, and will 675 obtain such data using an aliquot of the exact same RNA which was used to generate the 676 LRGASP datasets (human and mouse). 677 678

Evaluation metrics 679
We evaluate the quantification performance for different data scenarios (Figure 3): SCC evaluates the monotonic relationship between the estimation and the ground truth, which 697 is based on the rank for transcript isoform abundance (Supplementary Fig. S1). Next, based on the ground truth values and a given threshold (e.g., 1 as below), we can define 733 whether a transcript isoform is truly differentially expressed or not: 734 Positives (truly differentially expressed) 735 736 Negatives (not truly differentially expressed) 737 738 Based on the estimated values, we can also obtain the "predicted positives" and "predicted 739 negatives" with the same threshold. Therefore, we can identify "true positives", "true negatives", 740 "false positives" and "false negatives" to calculate the ROC-based statistics, including precision, 741 recall, accuracy, F1-score, AUC and pAUC, and also plot ROC (Supplementary Fig. S2). 742

743
The above metrics will be used for SIRVs and a subset of isoforms whose abundances were 744 experimentally determined. In the case of SIRV sequencing, we would not expect fold change 745 differences in different conditions, as the SIRVs were spiked in at relatively the same 746 concentration in all samples. The reproducibility statistic characterizes the average standard deviation of abundance 756 estimates among different replicates (Supplementary Fig. S3) With a small value of this metric, the method has high reproducibility. We can also plot 764 versus average abundance to examine how standard deviation changes with respect to the 765 abundance and the area under the curve is calculated as a secondary statistic. 766

•Consistency 768
A good quantification method tends to have the consistency of characterizing abundance 769 patterns in different replicates. Here, we propose a consistency measure to examine the 770 similarity of abundance profiles between mutual pairs of replicates (Supplementary Fig. S4) Most methods for transcript isoform quantification assign sequencing coverage to isoforms; 795 therefore, the exon-isoform structure of a gene is a key factor influencing quantification 796 accuracy. Here, we use a statistic K-value (manuscript in preparation, Supplementary Fig. S6 Challenge 3 will evaluate the applicability of lrRNA-seq for de novo delineation of transcriptomes 816 in non-model organisms. The evaluation will assess the capacity of technologies and analysis 817 pipelines for both defining accurate transcript models and for correctly identifying the complexity 818 of expressed transcripts at genomic loci, when genome information is limited. We will evaluate 819 two different scenarios: a) availability of a genome sequence but no gene annotation is 820 available, and b) no genome assembly is available at all. 821

822
The challenge includes three types of datasets. The mouse ES transcriptome data ( Table 1) will 823 be used to request the reconstruction of mouse transcripts without making use of the available 824 genome or transcriptome resources for this species. Models will be compared to the true set of 825 annotations with the same set of parameters as in Challenge 1. While this dataset allows for a 826 quantitative evaluation of transcript predictions in Challenge 3, it might deliver unrealistic results 827 if analysis pipelines were somehow biased by information derived from prior knowledge of the 828 mouse genome. To avoid this problem, a second dataset is used that corresponds to the whole 829 blood transcriptome of the Floridian manatee (Trichechus matatus). An Illumina draft genome of 830 this organism exists (https://www.ncbi.nlm.nih.gov/assembly/GCF_000243295.1/) and the 831 LRGASP consortium has generated a long-read genome assembly to support transcript 832 predictions for this species. Additionally, Illumina data has been generated for this challenge 833 and an existing set of 454 transcriptome data will be used. Again, we will evaluate pipelines that 834 obtain transcript models without genome annotation but with these draft genome sequences, 835 and without genome assembly data at all. Since no curated gene models exist for the manatee, 836 Challenge 1 metrics cannot be applied. Instead, the evaluation of this dataset will involve 837 comparative assessment of the reconstructed transcriptomes and experimental validation. For 838 comparative assessment the following parameters will be calculated. We expect that good-performing pipelines will obtain longer transcripts, well supported by 850 Illumina data, with high mapping rate to the draft genomes, most of them coding, and with 851 higher BUSCO completeness and Blast2GO annotation potential. 852 853 Finally, the manatee long reads data also contain spiked-in SIRVs, which will be used to 854 compute performance metrics for Challenge 3 analysis settings, using the same type of metrics 855 as described for Challenge 1. Florida manatees from blood samples, specifically for interleukin (IL)-2, -6, -10, interferon-862 gamma (INF-gamma) and Tumor necrosis-alpha (TNF-alpha), and these methods will be 863 adopted for development of isoform-specific assays. 864 865

Experimental validation of transcript models and expression estimates 866
Independent experimental validation will be performed to assess the accuracy of novel features 867 and transcript isoforms characterized from the lrRNA-seq data from all challenges. In the 868 evaluation of full-length transcripts, several local and long-range elements must be considered. 869 Local elements include the 5' end of the transcript, splice site, junctions, novel exons, retained 870 introns, and polyA sites. Long-range elements include chained series of junctions. We will 871 employ a suite of several assays in order to validate both the local and long-range elements. 872

Challenge 1 Evaluation: Transcript isoform detection 873
The goal of this challenge is to assess the comprehensive and reliable detection of all 874 transcripts in biological samples. Similar to past studies that have employed lrRNA-seq 875 approaches towards characterizing the transcriptome, we expect that participants for this 876 challenge will produce a large number of novel isoforms. Therefore, the approaches to assess 877 the accuracy of transcript isoforms that were previously described (e.g., SIRV standards, 878 GENCODE manual annotation) will be complemented with experimental validation. 879 We will employ several high-throughput sequencing-based assays to validate local elements, 880 such as novel 5' ends, splice junctions, and polyA sites, on a "global" scale. Note that these 881 experimental assays have or will be carried out using the same aliquot of total RNA as was 882 used to generate the LRGASP datasets, minimizing differences in detected features due to 883 biological or inter-laboratory variability. To validate novel 5' ends, we will use a recently 884 generated a deep coverage CAGE data on the WTC-11 line. To validate novel splice junctions, 885 we will also use Illumina RNA-seq to validate novel junctions and, wherever possible, exons or 886 series of connected exons. To validate novel polyadenylation sites, we will collect polyA-seq 887 data using the Quant-Seq method from Lexogen, which can map polyA sites de novo. 888 Additionally, in select cases, novel 5' ends will be further corroborated through chromatin-based 889 functional information derived from ENCODE data, such as the presence of PolII or histone 890 marks that are indicative of active promoters. 891 Longer-range features within a transcript, such as chains of junctions, are difficult and 892 sometimes impossible to detect through short-read sequencing approaches or traditional qPCR; 893 therefore, we will employ targeted amplicon sequencing followed by ONT, PacBio, and Sanger 894 sequencing. 895 We plan to select 96 targets from human WTC-11 cells and 96 targets from the mouse 896 129/Casteneus cells. Each target will comprise a sequence region 300 to 1500 bp long. Two 897 replicates each from the WTC-11 and 129/Casteneus sample will be apportioned for a reverse-898 transcriptase reaction followed by target amplification using isoform-specific primers. We will 899 conduct the assay in plate format to allow for high-throughput processing. All products following 900 RT-PCR will be pooled and subjected to long-read sequencing for validation. A subset of these 901 samples will be selected for Sanger sequencing.  Positive controls will be selected as subsegments of isoforms which are found in GENCODE 908 human v39 and mouse vM28, all long-read datasets across the ONT and PacBio platforms, and 909 a majority (>50%) of the computational pipelines. Negative controls will also be selected, which 910 would involve isoforms that are detected in other human and mouse cell types (e.g., pancreas 911 cells), but for which there is no evidence of expression across any of the long-read datasets in 912

LRGASP. 913
An open question in the field is the accuracy of novel isoforms that are frequently detected on 914 long-read platforms, and so we will devote substantial effort towards validation of novel 915 isoforms. At least 12 targets will involve junction chains that are novel (not in GENCODE) but 916 found across all lrRNA-seq library types. We also reserve resources to validate platform-specific 917 isoforms, in case they should arise. And, lastly, we reserve at least 24 targets for miscellaneous 918 categories, such as if there is the appearance of certain isoforms in specific computational 919 pipelines. 920 For novel target selection, preference will be given to select targets that correspond to the pre-921 selected 50 loci that will be manually annotated by GENCODE, and there will be close 922 coordination between the working groups. 923 In addition to the validation using a PCR-based approach ( qPCR of 10-20 transcript models will be performed. Due to the difficulty of properly resolving 940 and apportioning signals for short junctions or exons to the full-length transcript isoforms they 941 arose from, we will choose isoforms with low and high K-values, representing various levels of 942 identifiability. In some cases, we will increase the length of qPCR targets up to the 500-600 bp 943 ranges so as to increase the resolution and specificity of isoform measurements. Internal 944 standards will be spiked in for highest accuracy and precision of isoform abundance 945 estimates.Targeted amplicon sequencing with long-read platforms will also be performed on 946 these transcript models to determine fold-change differences. 947 Due to the challenges of isoform-level quantification and the lack of a gold standard, we devised 948 a mixture sample, in which an undisclosed ratio of two samples is mixed before sequencing. For 949 validation, we sequenced H1 and H1-DE samples individually to establish the isoforms present 950 in only one or the other sample before mixing. In essence, the pre-mixed sample represents the 951 "ground truth" of isoform expression before the mix. After the close of LRGASP submissions, 952 the H1 and H1-DE long-read data will be released. Participants of Challenge 2, will need to 953 provide transcript quantification from these additional datasets. Libraries and computational 954 pipelines can then be evaluated based on how well the transcript quantification in the H1:H1-DE 955 mix sample represents the expected ratios determined from quantification from the individual 956 cell lines. 957

Challenge 3 Evaluation: De-novo transcript isoform detection without a high-quality genome 958
Similarly to Challenge 1, the primary goal of experimental validation in this challenge is to 959 confirm the identity of de novo assembled isoforms, of which many will be novel. 960 A number of loci from well-studied immune-related genes will be selected for experimental PCR 961 validation as in the mouse/human data. 962 To validate isoforms containing novel junction chains, we will employ a similar amplicon 963 sequencing strategy as described in Challenge 1, in which up to 96 primer pairs will be used to 964 amplify isoform-specific regions for subsequent detection on a sequencing platform. 965 In addition, there exists 454 sequencing data from these same samples which can also be 966 leveraged for orthogonal validation. 967

971
The following is an overview of the data used for each challenge and the result files that will be 972 submitted (Supplementary Figure S7)