Assembly of the IG lambda chain locus
We first generated high-quality assemblies from six 1KGP individuals(24)(Fig. Methods Overview) (NA18956, NA19240, NA18555, NA12878, NA19129, NA12156) using fosmid clones and long-read sequencing data derived from selectively sequencing DNA enriched for IGL as described in the Materials and Methods. An average of 80 fosmid clones per individual (Table 1) whose end-sequences mapped to the IGL locus were sequenced using SMRT long-read sequencing on the Pacific Biosciences RSII and/or Sequel IIe systems. Each fosmid was uniquely assembled and then used to construct haplotype-resolved contigs spanning the IGL locus. These initial assemblies were further augmented using contigs assembled from targeted long-read sequencing to fill any remaining gaps from the fosmid-only assemblies. Finally, these targeted HiFi reads were also used to polish the assemblies. In the cases of NA19240 and NA12878, publicly available whole genome long-read sequencing data were also used to resolve any residual gaps. Using familial data, haplotype phasing accuracy was confirmed for NA19240, NA12878, NA19128, and NA12156.
Among the haplotypes assembled from 6 donors, we observed a range in locus coverage (20-100%; Table S1). The majority of identified gaps were the result of V(D)J recombination, as these samples are all derived from EBV-transformed B cell lymphoblastoid cell lines (Table S1). Seven haplotypes, including those from NA18956, NA19240, NA18555, NA12156, spanned the entire IGL locus with zero gaps. Locus coverage for the remaining assemblies ranged from 99.53% in NA19129 to 20.2% in NA12878 (overall mean=92.64%). The most significant example of this was for NA12878, in which a single V(D)J recombination resulted in the loss of sequence spanning IGLV3-6 to IGLC2 for the first haplotype and >80% of the locus from the second haplotype.
To further investigate sequence diversity in IGL we used IG-Capture to sequence 10 additional individuals from the 1KGP cohort (Fig. Methods Overview). In these individuals, average read depth across IGL ranged from 105x to 453x coverage (mean=183x; Table S2). Assembly contigs generated by IGenotyper(16) covered 97.8% to 99.9% of the IGL locus for individuals with an overall average coverage of 99%. Due to V(D)J recombination artifacts, gaps of 108 KB and 203 KB were observed in one haplotype each of HG02061 and NA10831, respectively (Table S2). Average phasing block size over IG regions ranged from 36 KB (HG01258) to 119 KB (NA18508) (mean=72 Kb). Samples without V(D)J artifacts had 62% (HG01258) to 94% (NA18508) of IG regions covered by phased blocks (mean=81.75%; Table S3).
Allelic Diversity in the IGLV Genes
We were able to haplotype resolve all functional and ORF alleles for five of the six hybrid assembly samples (Fig. 1A). Due to aforementioned V(D)J recombination associated artifacts, NA12878 was missing IGLV3-1 and IGLV4-3 on both haplotypes, and an additional 26 genes on one of the haplotypes (haplotype “2”; Fig. 1A). For the ten capture samples, we were able to fully resolve all genes from both haplotypes in eight samples. Again, due to artifacts associated with V(D)J recombination, HG02061 was missing nine genes on one haplotype and NA10831 was missing 14 genes on one haplotype (Fig. 1B).
The number of non-IMGT novel alleles identified in these assemblies ranged from three to ten per individual, with a total of 36 unique novel alleles found collectively across all samples. Of the 36 novel alleles identified, 12 were present in >1 individual (Fig. 1D). Relative to the closest IMGT allele, eight of the identified novel alleles were defined by only synonymous amino acid substitutions. Conversely, 25 novel alleles contained one or more amino acid (AA) substitutions, with 3 alleles containing 2 AA substitutions, and 2 alleles containing 3 AA substitutions. Interestingly, three identified alleles harbored insertion/frameshifts that resulted in non-productive coding sequences (Table S4, Fig. S1). IMGT annotated framework (FR) regions contained 18 of the 32 AA substitutions, and CDR regions contained 14 AA substitutions. Substitutions in the CDR regions were evenly distributed between CDR1, CDR2, and CDR3, with FR substitutions also evenly distributed between FR1, FR2, and FR3(Table S4).
We also observed high levels of heterozygosity across genes within the 16 individuals characterized. Excluding samples affected by V(D)J recombination, we found that the percent of IGLV genes at which heterozygosity was observed ranged from 22.2% to 44% (mean=33.1%) across individuals (Fig. 1A, 1B). Interestingly, seven IGLV genes were heterozygous in more than half of the individuals; specifically, the genes IGLV10-54, IGLV5-45, and IGLV2-23 were heterozygous in ≥ 12 individuals (Fig. 1C). Conversely, five IGLV genes were homozygous across all individuals (IGLV3-9, IGLV3-27, IGLV3-32, IGLV1-40, and IGLV9-49), with IGLV1-40 being notable as it is utilized at high frequencies in naive repertoires(11,25).
Structural Variation Across the IGLV Region
The IGLV region contains a single known V gene-containing SV(7,13). This SV spans a 9.1 KB region and includes the functional gene IGLV5-39, and two pseudogenes, IGLVII-41-1 and IGLV1-41(7,13,26). Notably, this region is deleted in the GRCh38 reference. Nine of the 16 individuals in our cohort carried at least one copy of the IGLV5-39 SV (Table S5). Every haplotype with the IGLV5-39 SV contained the same allele, IGLV5-39*01. The insertion breakpoints are consistent between individuals, with every individual examined demonstrating the same breakpoint (Fig. S2). To determine if the SV was due to non-allelic homologous recombination (NAHR) we searched for homologous sequence within a 1 KB region flanking the insertion breakpoints in individuals with the insertion, but did not identify putative homologous content. The lack of homology in the sequence surrounding the SV breakpoints suggests that NAHR is not the primary mechanism of IGLV5-39 deletion. Investigating the SV breakpoints reveals a 181 bp stretch of 87.3% A/T content that is 240 bp from the proximal breakpoint (Fig. S3). As suggested by others previously, it is possible that AT-rich regions can cause replication fork slippage, which may have resulted in the IGLV5-39 deletion haplotype(27,28).
To further investigate the origin of the IGLV5-39 SV, we compared the SV sequence to the 60 KB region surrounding the insertion breakpoint and found two proximal regions and one distal region with ~6.7 KB of sequence homology surrounding the genes IGLV5-35, IGLV5-45, and IGLV5-48 (86%, 87%, and 86% sequence similarity, respectively). Additionally, there is a ~14-20 KB region surrounding IGLV5-35, IGLV5-45, and IGLV5-48 that have 84-88% sequence similarity (Fig. S4). These regions surrounding IGLV5-35, IGLV5-45, and IGLV5-48 have been identified as an ancient segmental duplication event in GRCh38 by the tool SDquest(29) (Fig. S5). Using SDquest on the NA19240 haplotype containing the IGLV5-39 insertion, the insertion, surrounding region, and IGLV5-35, IGLV5-45, and IGLV5-48 were identified as ancient segmental duplications(Fig. S5). Segmental duplications are common in the IG loci and have been reported in IGL(3,6). It is likely that the IGLV5-39 SV is the result of an ancient segmental duplication involving the region surrounding IGLV5-45 and/or IGLV5-49 and was later lost, potentially due to replication fork slippage.
Additional SVs were observed in our cohort, including two separate ~6 KB LINE-insertion haplotypes, the first adjacent to IGLV1-47 and the second adjacent to IGLV4-60 and IGLV8-61 (Table S6). Further, we observed multiple regions with small SVs ranging in size from 175-320 bp (Table S7, Fig. 6S A-D). While some of these SVs are adjacent to functional genes, none are predicted to interfere with functionality.
Single Nucleotide Variation across the IGLV region and within IGLV genes
We then characterized SNVs across the assemblies by identifying nucleotide differences relative to the GRCh38 reference spanning the 920 KB IGL region (chr22:22,024,092-22,944,092). To better characterize variant calls, the IGL region was further divided into three regions, IGLV (chr22:22024092-22886736, 862,644 bp), a 594,267 bp subset of the IGLV region, deemed IGLV IG-only, which excludes non-IG gene regions (Fig. 2A), and IGLC (chr22:22886736-22944092, 57,356 bp) which was excluded from this analysis as we chose to focus specifically on the IGLV region. These non-IG gene regions are devoid of IGLV genes, and harbor the genes VPREB1, ZNF208A, ZNF208B, and GGTLC2. Three of the 16 individuals (NA12878, NA10831, HG02061) had moderate or significant V(D)J recombination associated artifacts affecting one or both haplotypes and were excluded from SNV analysis. NA18555 had the highest number of SNVs, with 3078 across the IGL locus, 1800 of which were within IGLV IG-only regions. HG01258 had the lowest SNV count with 1351 across the IGL locus and 868 in IGLV IG-only regions (Table S8). The majority of SNV positions were heterozygous in each sample, with a mean of 65.4% (54.3% for HG02059 and 80% for NA18517); remaining positions were homozygous for an alternative nucleotide (Table S8). Across the 13 samples, there were 7599 unique SNVs in IGLV with 4298 unique SNVs in IG-only regions. Of the 4298 unique IG-only SNVs, 939 (21.8%) were not found in the dbSNP “common” database, while 124 appeared to be completely novel and not found in any dbSNP database (Table S9).
We calculated the density of SNVs per 1 KB in the IGLV IG-only regions and compared this to the average background SNV density of chr22 within a 22.9 MB window (q11.21-q13.31). We chose this region specifically to represent basal chromosomal variance background, as it contained no assembly gaps in the GRCh38 reference assembly. Nine of the 13 samples had a higher density of SNVs in the IGLV IG-only region when compared to the chr 22 background (Fig. 2B). We further examined SNV density in the components of V gene segments, including leader part 1 (LP1), introns, V exons, and recombination signal sequences (RSS). Interestingly, LP1, introns, V exons, and RSS sequences on average have higher density of SNVs when compared to both the chr22 background and the IGLV intergenic space (Fig. 2C). On average, SNV density in gene components was 43% higher than IGLV intergenic sequence, and 70% higher than the chromosome 22 background.
IGLC Haplotype Diversity
The IGL constant region of the GRCh38 reference(6) contains seven J and C gene pairs (Fig. 3A). IGLJ genes are 38 bp in length, while IGLC genes vary from 304 bp to 322 bp; with all functional C genes being 318 bp. Each J and C gene pair is part of a larger cassette, referred to here as an IGLJ-C cassette. Cassettes IGLJ-C1, IGLJ-C2, IGLJ-C3, and IGLJ-C7 are functional, whereas IGLJ-C4, IGLJ-C5, and IGLJ-C6 are nonfunctional. Both IGLJ-C4 and IGLJ-C5 contain deletions that disrupt the RSS site of their respective J genes while IGLC5 contains an 11 bp deletion, which renders it nonfunctional. IGLJ-C6 is able to participate in recombination, but the IGLC6 gene contains a premature stop codon, resulting in a truncated protein. Each cassette breakpoint is flanked by homologous sequence, a characteristic feature of NAHR. IGLJ-C1, IGLJ-C2, and IGLJ-C3 cassettes range in size from 5300 bp to 5900 bp, while IGLJ-C4 through IGLJ-C7 are 3200 bp to 3900 bp. The GRCh38 reference contains one copy of each IGLJ-C cassette (Fig. 3A).
Previous studies have reported duplications of the IGLJ-C2 and IGLJ-C3 gene regions(13), with duplication cassettes having been individually Sanger sequenced using “gene walking” methodologies(14). Analysis of the IGLC haplotypes resolved from our cohort revealed significant variation in this region. Our haplotypes included one to three copies of the IGLJ-C3 cassette, which was further associated with ~5.4 KB and ~10.7 KB of additional sequence (Fig. 3B). Eight of the 22 IGLC haplotypes carried a IGLJ-C3 duplication. Considering both IGLJ-C3 IGLJ/IGLC allelic variants and copy number, we characterized nine unique IGLJ-C3 haplotypes, four with a single IGLJ-C3 copy and four with multiple IGLJ-C3 copies. All IGLJ-C3 duplication haplotypes had >1 unique IGLC3 allele containing one or more nonsynonymous nucleotide variants. Notably, NA18956 carried two different duplication haplotypes, including a single copy of the IGLC3*04 allele, and two copies each of the novel alleles, IGLC3*03-Novel and IGLC3*04-Novel (Fig. 3B). Across all resolved IGLJ-C3 cassettes, we observed two IGLJ3 alleles and four IGLC3 alleles, two of which were novel. Additionally, we identified two functional novel IGLC1 alleles harboring nonsynonymous substitutions, with one allele appearing in three individuals. Two functional novel IGLC2 alleles were also identified, one of which harbored a nonsynonymous substitution and the other matching the sequence of IGLC3*04. Finally, we identified five novel pseudogene alleles, one in IGLC4, two in IGLC5, and two in IGLC6. Overall we identified 11 novel IGLC alleles, six in functional genes and five in pseudogenes; this total represents a 50% increase in total IGLC alleles known to date.
We compared the similarity of each resolved IGLJ-C cassette sequence using multiple sequence alignment and phylogenetic analyses(30)(Fig. 3C & 3D). The phylogenetic tree in Fig. 3C revealed that the IGLJ-C2 and IGLJ-C3 cassettes are most similar, sharing ~85-90% sequence identity over ~5.5 Kb. While all IGLJ-C3 duplication blocks in each haplotype are highly similar (>98%)(Fig. 3D), there were two sub-clades identified within the larger IGLJ-C3 clade (Fig. S7). The IGLJ-C3 cassettes that are adjacent to a IGLJ-C2 cassette (proximal) form a sub-clade, while the distal IGLJ-C3 cassettes bordering the IGLJ-C4 cassette form a separate sub-clade. IGLJ-C3 cassettes within the proximal clade differ from one another by only 0-6 nt (>99.89%) across 5.6 Kb. IGLJ-C3 cassettes within the distal clade differ by 0-33 nt (>99.38%). This is in contrast to differences observed between members of the two sub-clades, which differ by 62-85 nt (98.03-98.84%)(Fig. S7). Beyond sequence identity, both IGLJ-C2 and IGLJ-C3 cassettes have more nucleotide diversity than the others. When comparing sequence similarity within each cassette group (e.g., comparing between only IGLJ-C2 cassettes), IGLJ-C2 and IGLJ-C3 have the largest disparity within cassette groups at 98.722% and 98.025%, respectively. Comparatively, all other cassettes have >99% sequence identity within their respective groups (Table S10).
Further, the IGLJ-C2 and IGLJ-C3 cassette groups were observed to have high sequence homology with each other at ~85-90%. Specifically, a ~2 KB high homology region stretching from 150 bp proximal of IGLJ2/3 to 190 bp distal of IGLC2/3 are >99% similar between the two cassette groups (Fig. S8). Due to the high degree of sequence similarity, it is possible that the IGLJ-C2/3 duplications are a result of NAHR(31), which has been reported in other IG regions harboring segmental duplications(32). When we applied phylogenetic analyses to this homology region, we found that homologous segments of similar duplication haplotype and location would primarily cluster together (Fig. S9). Additionally, elevated GC content is observed in this homology region (average 57.8%) that is significantly higher than the average for Chr22 (47%) and genomic average (41.9%)(33). High GC content is often a signature of repeated gene conversions due to GC-biased repair of A:C and G:T mismatches(34). Taken together, these findings suggest that a combination of NAHR and gene conversion may maintain the high homology between IGLJ-C2 and IGLJ-C3.
IGL SNV 1KGP comparison
Previously, Rodriguez et al., have demonstrated the accuracy of long-read assemblies in IG regions and highlighted discrepancies between SNVs called from long-read assemblies of these regions and those from short-read datasets, such as the 1KGP 30x high-coverage SNV call sets. To evaluate our data as compared to matched high-coverage SNV call sets, we compared capture and hybrid assembly IGL SNVs of the 13 individuals under examination to their respective 1KGP 30x high-coverage SNV call sets (Table 2-4)(15). Non-IG regions were masked for the IGLV comparisons and only regions which were covered by capture or hybrid assemblies were considered. Overall, 1KGP SNV calls differed greatly from capture and hybrid assembly calls. The highly accurate hybrid assemblies called between 87 (NA12156) and 145 (NA18555) unique SNVs in IG-only regions that were not present in their respective 1KGP datasets. Additionally, the 1KGP dataset falsely called between 27 (NA19129) and 433 (NA19240) SNVs in these regions. Analysis of capture-based assemblies revealed similar discrepancies with the 1KGP short-read based variant call sets. Capture assemblies had between 27 (NA18515) and 103 (HG02572) SNVs not found in the 1KGP dataset, while the 1KGP dataset called between 31 (NA18508) and 532 (HG01258) SNVs not present in the capture assemblies (Table 2). Notably, six of the eight individuals with >190 1KGP-unique SNVs carry the 9.1 KB IGLV5-39 insertion. Indeed, a large fraction of 1KGP-unique SNVs fell within and around the 14-20 KB homology regions spanning IGLV5-45, IGLV5-48, and the IGLV5-39 insertion regions annotated in Fig. S4. When we assessed the locations of false-positive SNVs in the 1KGP datasets, we found that each individual had at least one V exon harboring a false-positive SNV. Overall, 11 genes were affected across the 13 individuals. IGLV5-48 and IGLV5-45 have the highest number of erroneously called SNVs with 27 and 21, respectively; false SNVs in these two genes were observed in 10 and 7 individuals, respectively (Table 3). Additionally, IGLV5-45 and IGLV5-48 harbored 17 false-negatives, where the 1KGP datasets were unable to determine the true haplotype.
For the IGLC region, we compared SNV calls from hybrid assemblies to the 1KGP datasets for the same individuals. The discrepancy between the datasets was more prominent in the IGLC region, in which the false-negative and -positive calls comprised a mean of 19.4% and 23.6% of genotypes in the 1KGP dataset (Table 4). This elevated false-positive and -negative rate is likely due to the highly repetitive and structurally variable nature of the IGLC region (Fig. 3B & 3D). In addition, NA18555, NA18956, and NA12156 contain IGLJ-C3 duplications in the IGLC region which are not present in the GRCh38 reference assembly. These results suggest that the lack of a representative reference containing multiple IGLJ-C segmental duplications results in an abundance of miscalled 1KGP SNVs.