Tag SNP selection for improved genome-wide coverage
In JPA NEO, our updated version of the Japonica Array, we used the maximum number on a single array of the Axiom 96-array layout, and the total of nearly 670,000 markers were divided into about 650,000 tag SNPs and tens of thousands of disease-related markers. The selection process of JPA NEO is essentially the same as previous versions of the Japonica Array. However, we have selected these markers by using the latest version of our genome reference panel, which contains the genomes of 3,552 Japanese individuals (3.5KJPNv2) [6], which is about three times greater than that used for the previous versions (JPAv1 and JPAv2). Of note, while the previous two versions of the Japonica Array used mutual information for tag SNP selection [19], in JPA NEO we decided to change the method for selecting tag SNPs to one based on the standard protocol using pairwise r2 [9] (Table 1). This has the advantage of allowing us to harmonize our data with those of other studies. We believe that it is of great importance to perform meta-analyses with other large-scale GWAS utilizing the same concept. A comparison of the design of JPA NEO with those of JPAv1 and JPAv2 is summarized in Table 1.
To optimize the selection of tag SNPs, we first selected tag SNPs from chromosome 10 of the 3.5KJPNv2 reference panel by using greedy pairwise algorithm [9] with different combinations of thresholds of MAF; i.e., ≥ 0.005, ≥ 0.01, or ≥ 0.05 and pairwise r2 of LD measures; r2 ≥ 0.5 or ≥ 0.8. Two metrics were used to evaluate tag SNP performance: 1) genomic coverage, which is the proportion of untyped sites with at least one tag SNP with r2 greater than a given threshold; and 2) the number of variants obtained by genotype imputation above the threshold of a given INFO score, which is an index of imputation accuracy. When tag SNPs were selected by pairwise r2 ≥ 0.8 and MAF ≥ 0.01, the genomic coverage with r2 ≥ 0.8 and the number of imputed variants from the 2KJPN reference panel (2,049 Japanese genomes) with INFO >0.9 were better or comparable to those of JPAv2 and Infinium Omni2.5-8 (Fig. 1). Based on these results, we decided to select tag SNPs with pairwise r2 of LD measures ≥0.8 and MAF ≥0.01 from the target set of autosomes and the X chromosome. For the design of JPA NEO, a substantial number, more than 1,000 of sex-chromosome SNPs on two pseudoautosomal regions were newly selected, whereas only about 10 SNPs on these regions were available in JPAv1 and JPAv2.
We also selected Y chromosomal markers for the Y haplogroup classification of the International Society of Genetic Genealogy [20] and from those in JPAv1 and JPAv2, which were selected using pre-existing Axiom arrays for Asian populations. Mitochondrial markers were extracted mainly from 3.5KJPNv2 by removing those with MAF <0.5% as well as those with multiple alleles. Most markers corresponding to the HLA and KIR regions were taken over from those adopted for JPAv1 and JPAv2.
Selection of disease-related SNPs based on published evidence
For the selection of disease-related markers, we picked approximately 9,000 SNPs present in the Japanese population, mainly from among published lists of disease genes and GWAS-identified risk variants. The former includes known and candidate functional variants on gene lists from the American College of Medical Genetics and Genomics [21] and 1,866 pharmacogenomics markers in 38 genes, 18 of which were obtained from drug guidelines published by the Clinical Pharmacogenetics Implementation Consortium as of April 2020 [22]. The latter includes published risk variants for various complex diseases identified by GWAS of the Japanese population and a meta-analysis of East Asian populations. Representative examples are shown in a supplementary table [see Additional file 1], which includes 99 markers (96 genes) of type 2 diabetes [23], 100 markers (94 genes) of lipid metabolism, 45 markers (35 genes) of obesity, as well as 12 markers (7 genes) and 33 markers (24 genes) of late-onset Alzheimer’s disease identified by GWAS of the Japanese population and meta-analyses of European populations, respectively.
Moreover, approximately 13,000 and 12,000 markers were selected from the NHGRI GWAS catalog [24] and UK Biobank Array [14], respectively. We used reference panel 3.5KJPNv2 to extract the markers present in the Japanese population. The novel Axiom SNP Array specific to the Japanese population was developed as JPA NEO.
Japonica Array NEO has genome-wide coverage and contains disease risk SNPs
The developed JPA NEO contains a total of 666,883 SNPs; the number of markers in each category is shown in Table 2 in comparison with JPAv1/JPAv2. In JPAv1/JPAv2, tag SNPs from autosomes and the X chromosome account for approximately 98% (>650,000 SNPs). In contrast, nearly 8,500 SNPs from the Y chromosome (779 markers), mitochondria (409 markers), and HLA and KIR regions (6,757 and 532 markers, respectively) were also included to realize genome-wide coverage and genotyping of specific functional variants.
Although there is some overlap with the above SNPs, a total of 28,298 disease-related SNPs in 12 disease categories and pharmacogenomics are included as well (Table 3). These SNPs include risk alleles for complex diseases, including dementia, depression, and autism spectrum disorder among psycho-neurologic diseases (5,556 markers), type 2 diabetes and hyperlipidemia among metabolic diseases (2,948 markers), and asthma and atopic dermatitis among immunological diseases (6,426 markers). In addition, variants related to physical traits (height, blood protein levels, etc.), expression quantitative trait locus, and so on are categorized as ‘others.’
Of note, 3,472 markers (0.52%) in JPA NEO were MAF < 1% as confirmed by 3.5KJPNv2 [see the supplementary table in Additional file 2]. This is due to the adoption of some disease-related markers regardless of their MAF in 3.5KJPNv2. We have compiled the full list of disease-related markers with keywords and disease categories as a supplementary table [see Additional file 3], which can be downloaded from the jMorp website [25].
High imputation performance of Japonica Array NEO
We modified the tag SNPs for JPA NEO from the previous versions with the aim of improving the imputation coverage of the microarray. To verify this point, we analyzed the performance of JPA NEO in comparison with that of JPAv2. To this end, the same 286 samples, which were not included in the 3.5KJPNv2 reference panel, were genotyped using both JPA NEO and JPAv2. We found that the median call rates of JPAv2 and JPA NEO for all markers per sample were more than 99.6% and 99.8%, respectively [see the supplementary table in Additional file 4], indicating that the call rate of JPA NEO is slightly better than that of the JPAv2.
More than 99% of markers were polymorphic in both JPAv2 and JPA NEO, as we intended (Table 4). Some microarrays are designed to cover a wide range of ethnicities, which is in contrast to the aim and scope of our Japanese-specific arrays. We hypothesized that the former type of microarrays may have lower performance compared with ethnic-specific ones. To address this point, we compared the performance of JPAv2 and the Infinium Asian Screening Array (ASA), which covers a wide range of Asian populations, including Japan, by using the genomes of 191 Japanese in the TMM cohorts. We found that more than 17 percent of markers were monomorphic in the ASA array, while >99% worked as polymorphic markers in JPAv2 (Table 4) with a median call rate of >99% for both arrays [see the supplementary table in Additional File 4]. This observation supports our contention that ethnic-specific microarrays are critical for analyzing each ethnic population.
When we closely inspected the MAF distributions of JPA NEO in comparison with those of JPAv2, we noticed that JPAv2 showed low numbers of MAF markers (15%–25%) compared with JPA NEO (Fig. 2). We envisage that this may be due to the method for selecting tag SNPs. However, our new selection method has significantly improved the marker distribution in this region.
We performed genotype imputation of autosomes by using the haplotype reference panel of 3.5KJPNv2 and evaluated the imputation accuracy according to two metrics, imputation quality r2 and INFO score. The mean r2 and INFO score were more than 0.9 and 0.8, respectively, in MAF bin >2.5%–5% of two arrays (Fig. 3), indicating reliable imputation accuracy for both JPAv2 and JPA NEO. However, importantly, we also noticed that there was a significant decrease in mean r2 in the region over MAF 20% in JPAv2. Whereas the precise reason for this decrease remains to be clarified, the decrease has been abrogated in JPA NEO.
As shown in Table 5, slightly but clearly more imputed markers with INFO >0.8 were obtained from genotyping data by JPA NEO than JPAv2, especially those with MAF <1% (1.08-fold). We found that a total of >12 million markers were imputed by the small-scale analyses of the two arrays. These results indicate that while both JPA NEO and JPAv2 provide sufficient power for genotyping the Japanese population and following genotype imputation, JPA NEO shows better imputation performance without any bias throughout MAF bins. Thus, we conclude that JPA NEO is the most reliable imputation array ever developed for the Japanese population.
Large-scale genotyping by Japonica Arrays in the TMM project
To establish a solid research infrastructure for genomic medicine in Japan, the TMM project aimed to generate as much genotype data as possible from its 150,000 participants. To this end, we have been genotyping TMM cohort participants using the Japonica Array since 2014. To complete such as large-scale genotyping efficiently, we established an elaborate three-group system from sample selection to genotyping, which connects to the data qualification.
We prepared our own special workflow for the ToMMo analysis, which ensures efficient and reliable sample processing and supports high-throughput measurement (Fig. 4). The first step is preparing the target sample lists containing the thousands of participants corresponding to a specific purpose, such as the TMM CommCohort participants with respiratory function data. The selection of participants and availability of DNA samples or biospecimens are supported by a laboratory information management system (LIMS) at the TMM biobank [26]. This step is conducted by Center for Genome Platform Projects. The second step is extracting and dispensing the DNA into 96-well plates. To divide samples into individual plates in a well-ordered and formulated manner, the correspondence between sample identifier (ID) and well position is manifested by creating the plate map before dispensing the DNA samples. This step is conducted by Group of Biobank. The final step is transporting the DNA plates and plate maps to the genotyping facility attached to the TMM Biobank, which is operated using LIMS by Group of Microarray-based Genotyping Analysis. For security control, different sample IDs were used for sample collection, storage, and analysis [27].
Capitalizing on this workflow, in May 2020, we obtained JPA data of approximately 130,000 participants who met the criteria for quality control (QC) analysis using control markers. The dataset comprises approximately 2,000 JPAv1, 101,000 JPAv2, and 27,000 JPA NEO data (Table 6). We have already analyzed more than 63,000 samples from the TMM CommCohort by using JPAv2, whereas the TMM BirThree Cohort samples were analyzed by either JPAv2 or JPA NEO. Considering further association analyses, we are in the process of designing a rigid protocol that would allow each family unit to be analyzed by the same JPA.
We have been using DNA samples obtained primarily from peripheral or cord blood. When samples from these sources were not available, mostly those from the children of TMM BirThree Cohort participants, DNA from saliva samples was used and analyzed separately with the one from blood. In our operation, the QC pass rate has been more than 99% for blood samples using both JPAv2 and JPA NEO. In contrast, that of saliva samples as approximately 90% using JPAv2, likely due to the presence of lower-quality samples. We believe that with this accomplishment, JPA NEO now has enough control data of the resident population to be an important and useful array for the entire Japanese population.