Patients and samples
The AA subjects were recruited from the department of general medicine, SMS Medical College and Hospital, Jaipur during 2016–’20 with ethics approval from the institutional ethics committee and Indian Council of Medical Research (ICMR)/Department of Health Research (DHR), Government of India, New Delhi. All methods were carried out in accordance with relevant guidelines and regulations. An informed consent was obtained from all the patients after fully explaining them about the study and process. The patients’ 2 ml blood sample was drawn through the peripheral vein for WES. The standard management procedures for collecting AA samples were judiciously followed with ethics clearance wherever stated. A total of 36 AA subject samples(21 male/ 15 female) and 41 healthy controls (25 male/16 female) with a mean age of 32 across two cohorts have been sequenced. Those subjects who responded to CsA and Danazol treatment were considered as CsA responders while those who underwent blood transfusion were non-responders ( Supplementary Table 1). The treatment regimen ranged from 6 months to 1 year on follow-up cases. All raw data is analysed in-house using our local 1 TB RAM/ 64 processors server.
Exome capture and sequencing
The WES was performed on 36 subjects using QIAamp® DNA Blood Mini Kit (Cat No :51104). The quality check (QC) of the genomic DNA (gDNA) was done using Qubit® 2.0 Fluorometer followed by agarose gel electrophoresis respectively. Briefly, we used 200 ng of gDNA to generate 300 bp to 350 bp fragments and performed end repair, adapter ligation and amplification, and adapter ligated DNA was hybridized using Agilent V5 + UTRs chemistry (Human All Exon 75 Mb kit) with paired end reads and an approximate 110× depth of coverage was obtained resulting in 6–8 GB of data per sample.
Quality control (QC) and variant calling
We obtained significant reads for 33 samples, as the samples were checked for QC following which human genome reference (hg38) was used to align through bowtie2 incorporated in an in-house pipeline for downstream analyses11. The VarScan prediction tool was employed to check variants filtered for false discovery rate (FDR). The quality assessment was done for all the samples using FastQC with raw reads checked for quality, GC bias and duplication levels. After bowtie2 mapping with hg38 reference, variant calling was performed using VarScan with mutations counted as heterozygous (“het”) through awk/bash scripts. The VarScan somatic command was used with the mpileup option meeting the minimum coverage of 10x to identify possible somatic variants. In this process, their corresponding genotypes for the samples were checked to infer the somatic status. Further predictions identifying “deleterious” mutations were screened for Sanger validation.
Telomere length analysis and Validation of SNPs
To check whether or not the AA subjects have shortened telomere length, we used Scicell’s Absolute Human Telomere Length Quantification qPCR Assay Kit (AHTLQ: Catalog #8918). A single copy reference (SCR) primer set was used to amplify a 100 bp-long region on human chromosome 17, and served as a reference for data normalization. The primer sets were validated by qPCR and the gel electrophoresis was run to check the amplification efficiency. All SNPs were mapped to the publicly available databases and the variant effect predictor (VEP) Ensembl suite was used to detect the RefSeq (rs) ids with for all the SNPs that have minor allele frequency (MAF) cutoff 0.05. Based on the WES data, the shortlisted variants were cross-checked using downstream validation databases, viz. Varsome, CADD, GERP scores and finally the common variants along with those select variants were visualized using Integrated Genome Viewer (IGV) browser and a set of SNPs that exhibited significant association with AA were validated using Sanger sequencing (Supplementary information).
Statistical analyses and population stratification
All statistical tests were done keeping the sample relationship checks. The bcf/vcf files with a MAF < = 0.05 and a minimum DP > = 5 were used for mapping the pathogenic variants. We had multiple instances of checking this wherein we first calculated the mean number of heterozygous rare (MAF < = 0.5%) variants observed in the entire datasets and then calculated the mean/variance of the means for these genes to tabulate.VerifyBamID was used to infer whether or not the reads are contaminated across the samples12. For developing an interaction network, we used Cytoscape Cytohubba tool to infer the clustering coefficient.30