Characterizing a focused landscape of familial acute respiratory distress syndrome

Background Acute respiratory distress syndrome (ARDS) affects approximately 190,600 patients per year in the United States, with mortality up to 45%. ARDS can occur as primary disease due to various factors (e.g. bacterial or viral pneumonia, gastric aspiration, lung contusion, toxic inhalation, and near drowning) or as secondary disease due to sepsis, pancreatitis, severe trauma, massive blood transfusion, and burn. We hypothesized that ARDS-affected individuals have patterns of variants in their physiological repertoire that can be tracked and utilized to complement clinical diagnosis and/or clinical monitoring. Methods The goals of this study were to: (1) characterize the landscape of variants within protein coding but we also studied UTR regions in ARDS using an Exome sequencing approach; (2), determine the variations in signaling pathways across ARDS; and (3) use computational approaches to explore the functional consequences of ARDS. Towards this we assessed an ARDS-affected individual in the context of unaffected individuals from the same family as well as unrelated ARDS cases, in order to elucidate underlying inheritance patterns of “private variants”. Private variants consist of variants shared by ARDS cases but not found in unaffected individuals. Results Whole exome sequencing yielded 3,516 variants represented by 2,354 genes. Of these, 128 variants were shared across all ARDS cases. Of these, there were 24 unique variants represented by 9 ARDS genes shared by the primary ARDS-case and unrelated individuals with ARDS. The overall genes identified and subsequent analysis, demonstrate that there are important biological pathways that distinguish ARDS cases from or from non-ARDS. These pathways include: cell-to-cell signaling interaction, cell growth and proliferation and cell morphology. The data also show a coordinated effort amongst biological processes such as liver hyperproliferation and cell death that underlie the pathogenesis of ARDS. Conclusions These in-silico discoveries demonstrate a role for private variants shared by

ARDS cases to be leveraged as biomarkers for clinical diagnosis and/or monitoring of ARDS.
Background Acute respiratory distress syndrome (ARDS) is a syndrome of hypoxic respiratory failure characterized by diffuse pulmonary infiltrates and accumulation of protein-rich pulmonary edema that cause reduction in lung compliance alveolar collapse and ventilation-perfusion mismatch [1][2][3][4][5][6]. ARDS affects approximately 190,600 patients per year in the United States, with mortality up to 45% [7]. Despite improvements in intensive care during the last fifteen years, ARDS is still the major cause of mortality and morbidity in intensive care [1,2,[5][6][7]. In fact, ARDS therapy has seen limited progress since its initial description in 1967 and management is still largely supportive, with no established therapies targeted at the primary disease processes [8]. Accordingly, there is a need for methods of early detection [9]. There has been recent recognition of the clinical and biological heterogeneity within ARDS [10][11][12] that reflects our incomplete understanding of the biology of ARDS. Additional contributions to the knowledge about inheritance of ARDS and/or pathogenesis will be of great benefit in moving forward with successful clinical translation of new diagnostic, preventive, and therapeutic strategies [13][14][15][16].
ARDS occurs within one week of a known clinical insult or after worsening of respiratory symptoms. It is a consequence of various risk factors including direct (e.g., bacterial or viral pneumonia, gastric aspiration, lung contusion, toxic inhalation, and near drowning) or indirect (e.g., sepsis, pancreatitis, severe trauma, massive blood transfusion, and burn) lung injury [1][2][3][4][5][6][17][18][19]. There is little knowledge on the temporal relationship between detectable inflammatory changes and the onset of increased lung density, which is the current radiographic diagnostic marker for ARDS. A better understanding of these key temporal and topographic processes may contribute to advance diagnostic and prognostic biomarkers and effective therapies [7,[20][21][22]. Genome sequencing studies on ARDS have concentrated on identification of plasma biomarkers that may facilitate diagnosis of ARDS which could, in theory, improve clinical care, enhance our understanding of pathophysiology, and be used to enroll more homogeneous groups of patients in clinical trials of new therapies, increasing the likelihood of detecting a treatment effect [2,[23][24][25][26][27][28].
Most recently, exome sequencing studies have been reported for ARDS as part of an outgrowth of the NHLBI's Exome Sequencing Project [29,30]. A potential limitation of these studies is that they have rarely been conducted on families with sample collection across several generations. An advantage of exome sequencing is that it allows for the analysis of "private" gene variants-variants that may have arisen de novo in one individual or family and thus would not be detected in another [30]. We hypothesized that ARDS-affected individuals have patterns of variants expressed in their physiological repertoire that can be tracked and utilized to complement approaches to clinical diagnosis and/or clinical monitoring. To address this hypothesis, we utilized an exome sequencing approach which focuses on just the protein coding but also UTRs sequences in a given sample. Our data indicates that, there are unique variants and signaling pathways in ARDS cases which differ from those observed in unaffected individuals; and that the variant expression patterns observed in the familial cohort are markedly different from that of unrelated ARDS cases.

Results
Variants identified from a family with one ARDS case and unrelated ARDS cases. Figure 1 shows the workflow for data collection, data filtering, and data analysis utilized in the study. The family pedigree is shown in Figure 2. The individual that was the primary focus of the study is denoted as GP7. GP1, GP2, and GP6 were the youngest family members. Variants from each sample were identified and their frequency was calculated The primary case, GP7, comprised of 3,516 highly enriched variants that were represented   by 2,354 genes. A summary of the annotated functions from the full list of variants yielded   the following characteristics: 343 variants were exonic non-synonymous, 8 exonic   frameshift, 16 exonic non-frameshift, 356 exonic synonymous, 16 exonic unknown, 66 ncRNA exonic, 2 exonic stopgain, 118 in the 5'UTR region, 782 in the 3'UTR, 1343 were intronic/ncRNA intronic, and the remaining variants were categorized as having upstream, downstream or intergenic functions (Table 1).
Heatmaps were generated from a ranked frequency occurrence assessment from the list of variants found to be shared between GP7 and each of the GP family samples ( Figure 3A) or JF unrelated ARDS samples ( Figure 3B). Figure 3A illustrates the variant expression patterns and clustered relationships between GP7 and each of the GP family members.
Whereas, Figure 3B illustrates the variant expression patterns between GP7 and each of the ten unrelated ARDS cases (JF9-18). The red color in each of the heatmaps denotes a presence of the same variant as found in GP7, and the green color denotes absence of the variant. Compared to the GP7 the next highest frequency of variants found amongst the family cohort was in the GP1 sample, which represented one of the younger family members ( Figure 3A). This was accompanied by hierarchical clustering, which identified GP1 as the closest in variant expression pattern to GP7 ( Figure 3A).
Additional hierarchical clustering assessments showed that there were 2 clades within the family cohort. The first clade was comprised of GP7, GP1, and GP4. The second clade was comprised of GP3, GP6, GP8, GP5 and GP2. The clustering profiles indicate that the most similar sample to GP7 from the second clade was GP2 ( Figure 3A). Furthermore, compared to GP7, JF14 was the highest scoring sample in terms of frequency of variants present as compared to any of the other unrelated controls ( Table 2). Quantification of the variant occurrence frequency showed that JF14 was ~85.5% similar to GP7 (Table 2). Clustering profiles also demonstrated that JF14 was the most similar to GP7 in variant expression pattern, thereby it was clustered closest to GP7 ( Figure 3B). For the unrelated ARDS samples, JF10 had the lowest detected frequency of variants and was clustered to the farthest left of the heatmap with respect to GP7 (Table 2 and Figure 3B). The variant expression patterns show that JF10, JF16, and JF18 were the least similar to GP7, as depicted by the green colored regions denoting absence of the variant. As a result, these were all clustered to the farthest left ( Figure 3B). Additional comparisons of the GP7 case and the unrelated ARDS showed that there were 128 variants shared between GP7 and all unrelated ARDS, which were represented by 104 genes (Table 3). From this 104 genes, we found there were 27 variants represented by 10 genes shared amongst GP7 and ARDS cases. Of these, there were 24 unique variants represented by 9 genes (Table 4) not found in any of the other groups shown in Table 3. The remaining 3 variants were also found in group B and were represented by the MYH14 gene.
Prediction of ARDS-related biological pathways and functions from ARDS case enriched variants We extracted all pathway predictions for gene lists from groups (A), (B), (C) and (D) ( Table   3). These were then compiled to review the statistical outcomes and determine unique pathways (Additional file 1). Variants were assessed for presence or absence in group (D) vs (A), (C) vs (A), and (B) vs (A), (Additional file 2). The outcomes from each assessment was a list comprised of variants identified along with their function and exonic information where appropriate (Additional file 2). Next, using the filtered genes from the GP7 case, we

Emergence of clusters of orthologous genes across ARDS cases
A series of data mining experiments were performed using the DAVID database resource.
We conducted 3 separate experiments with genes from groups (B), (C), and (D). Genes from group (A) were excluded as these were proprietary and we did not have access to the raw file. We clustered genes from list (B) and (C) using the search parameters applied for high stringency similarity threshold (85%). We successfully matched 308 and 956 genes from gene lists (B) and (C), respectively, based upon the availability of experimental and curated information in the DAVID database. The genes from (B) were categorized into 44 clusters (Additional file 5). The genes from (C) were categorized into 92 clusters (Additional file 6). In addition, a significant majority of group (B) (i.e. 67%) and group (C) (i.e. 85%) was comprised of genes that had previously been experimentally verified and identified as having polymorphisms. For group (D), 104 genes were utilized for our initial database search and 101 genes were successfully matched to having DAVID ids.
Using the same stringency parameter as previously, the genes were categorized into 8 clusters (Additional file 7). The largest cluster for group (D) was comprised of 28% of the genes (29 genes out of 101). This cluster was annotated as having functions in transmembrane composition. Additional information from these output suggested that

Discussion
Previous genomic studies in ARDS have focused on characterization of subjects unrelated to each other in an effort to identify global genomic patterns and signatures that could be informative for diagnosis or prognosis [20,22,[29][30][31]. Numerous studies have suggested that there is a delicate relationship between the gene and local environment in ARDS pathogenesis. It has also been noted that there is a physiological balancing act that plays a role in this dynamic process [3,4,7,14,20,32]. The relationship is such that the following two events must occur to give rise to the disease: First, an individual must have a genetic propensity and second their microenvironment must undergo a struggle to execute an appropriate response to local insult such as trauma, sepsis, or infectious trigger [1][2][3][4][5][6][17][18][19]. It is the combinatorial effect from these two events that preferentially activates the disease. We hypothesized that ARDS-affected individuals have patterns of variants in their physiological repertoire that can be tracked, and then these would complement clinical diagnosis and/or clinical monitoring. The goals here were (1) characterize the landscape of variants within protein coding but we also studied UTR regions in ARDS using an Exome sequencing approach; (2) The strategy undertaken in this study was to focus on a structured landscape. The structured landscape in this case was defined by the shared variant inheritance pattern represented across ARDS cases, and it consisted of two components: private variants and public variants. For ARDS, we believe the interaction between variant inheritance pattern and diagnosis defines the ARDS landscape in part. The primary motivation was to elucidate underlying inheritance patterns of "private variants" that could be hidden within a larger cohort. Private variants consist of variants shared amongst ARDS cases but not found in the cohort of related family members. This is in contrast to public variants, which would consist of variants shared by ARDS cases and not present in other individuals. The question posed was, does a focused landscape of ARDS display a distinct signature from person to person or is there a shared signature? An important consideration for this study, was that there was a documented presence of ARDS based on family history. This past medical history could be tracked across a generation as demonstrated by the family pedigree from Figure 2.
The variants identified support our hypothesis that there are important biological pathways that could help distinguish ARDS cases from one another. In order to place these variants in the context of the available curated literature, it was important to parse out subsets of gene lists from the larger data. We observed that there was an increase in significant pathways predicted for the GP7 case when the search window of genes was increased by the addition of groups (B) and (C) genes to the ARDS-literature curated genes. The 3,516 variants from the study contributed a much more robust pool of candidate genes to utilize in performing pathway predictions. The increase in sample size (i.e. # of genes included in search criteria) contributed to an increase in detection power.
This was due to the signal-to-detection ratio becoming much higher, thereby leading to observed increases in significant pathways predicted. The groups (B) and (C) genes served as representative subsets of the GP7 enriched variants. By using these subsets of variants, we captured a novel pathway prediction: liver hyperplasia and hyperproliferation, which we hadn't previously observed when searching solely within group A. Liver hyperplasia and hyper-proliferation is widely implicated in respiratory complications [33][34][35][36] and has also been documented in cases of secondary ARDS due to conditions such as acute pancreatitis [37,38]. Follow up studies would be important to better understand if the genes within this pathway persist broadly across other ARDS cases. In an effort to characterize variant patterns across all ARDS cases from this study, we parsed out the group (D) genes. The group (D) genes represented an important subset because it represented a much more condensed list of variants shared across ARDS cases Pathway assessments demonstrated that with the suppression of background noise and subsequent screening of variants highly enriched in the GP7 case, we observed some novel pathways that previously weren't recognized as statistically significant for ARDS such as C-C chemokine receptor type 3 (CCR3) signaling, phosphoribosyl pyrophosphate (PRPP) biosynthesis, inhibition of matrix metalloproteases, choline degradation, glutamine biosynthesis, actin cytoskeleton signaling and epithelial junction signaling. These molecules have documented roles in lung pathogenesis and physiological signaling [24, 27, [43][44][45]. This indicates that there are potentially several signaling molecules derived from GP7 that could be prevalent due to the disease process. Of the 2,354 genes that were highly enriched in the GP7 case, we were able to confirm some overlap with the preexisting ones in IPA based upon our assessments with gene list A. The observed increases in the pool of ARDS genes from gene lists B (~2-fold) and C (~4.8-fold) is remarkable and contributed to the identification of these novel pathways. Our data also confirm the presence of shared pathways with group A and GP7 enriched gene subsets, including many implicated in lung pathogenesis such as airway pathology, acute phase response signaling, and hepatic fibrosis. These findings are in line with current information on ARDS from the available curated literature genes [26, [45][46][47][48][49][50].
Interestingly, the family heatmap comparisons illustrate that the second highest frequency of variants found was in the GP1 sample, representing one of the younger family members. This finding was strengthened by hierarchical clustering analysis shown in Figure 3A, which identified GP1 as having the highest similarity in variant expression pattern to GP7. This was a particularly curious finding because GP1 was the youngest family member, and had never been diagnosed with ARDS [15,51]. Recent findings indicate that younger persons can also get ARDS [52]. However, it is unclear what the consequences are for the variant inheritance pattern we observed in GP1 as this individual has not been diagnosed with ARDS, however further studies would be important so as to better determine what this discovery means. Additionally, a closer examination of the clade structure based upon clustering analysis shown in Figure 3A, displays a vastly different image from that of Figure 3B. To quantify the differences observed in Figure 3B Figure   3A and 3B is such that there was a higher proportion of variants shared amongst the unrelated ARDS than within the family cohort. This indicates that basal expression patterns of variants exist, and these preferentially shared amongst ARDS cases. This presents an area of interest in order to better understand in the future whether this is predictive of clinical outcomes or disease severity for the outlier samples (i.e. samples clustered farther away from GP7 thereby denoting less similarity). Additional speculation about the clinical implications of this current observation however, is beyond the scope of this study.
In a step towards developing a more generalized model, we applied clustering analysis for relationships between the groups of ARDS genes. The assessment of clusters of orthologous genes (COGS) categories yielded pre-existing experimental information about the genes. The DAVID resource provided a streamlined approach for identifying the roles of previously discovered variants from ARDS-related genes, which was of benefit in our classification of the variants. Functional annotations retrieved showed that much of the genes, such as the myosin light chain genes, from groups A, B, and C had previously been identified as containing polymorphisms. These data are in line with previously published information while also highlight new possibilities for exploratory work and data sharing with the broader scientific community [53]. ARDS studies done on families with previous generations are very few [27]. As the genes identified carry specific variants, it will be important to understand the relationship of these variants across multiple ARDS cases.
Based upon our findings, it is important to validate the variants whose function is currently unknown. The OMIM database, which is the primary inclusion criteria for the OMIM ontology, had a lack of enough experimental information on genes from groups B, C and D as relates to ARDS,. This represents a gap in coverage within the literature for which one could contribute to experimental knowledge by validation and deposition of the variants from this study.

Conclusions
Taken together, the findings from this study promote the idea that there is a coordinated effort amongst signaling processes that underlie the pathogenesis of ARDS. This combinatorial signaling behavior has been well documented for lung pathogenesis and is implicated in ARDS. The study population represents an important demographic in helping understand the disease prevalence within a specific family in comparison to the unrelated ARDS controls. That said, the frequency of individuals sampled in the family cohort though not a large number, does show that there are some family members with a high degree of shared variant expression as compared to the primary case, GP7. We speculate that more work is needed to validate these additional variants and place them in the larger context of familial ARDS cases. The potential outcomes would contribute to efforts geared towards clinical diagnosis and/or monitoring of cases for which family history indicates the presence of a genetic inheritance pattern of ARDS.

Study information and Sample Collection
Family members and control cohort were recruited, enrolled after informed consent under a protocol approved by the Human Subjects Protection Office and Institutional Review Board from the Pennsylvania State University College of Medicine. Three sets of samples were collected: primary ARDS case, and related family members; a third set of samples from unrelated ARDS subjects used in the present study were collected as previously described [40,41]. Together, we analyzed a total of 18 samples.

Whole Exome sequencing analysis and data collection
The workflow in Figure 1 outlines the key steps involved in sample data collection, data filtering and the downstream data analysis processes that we applied to the sequencing output in order to perform comparisons. Variants for each sample were identified based on the GATK [54] best practices pipeline. The bwa v0.7.3a software were used to align the paired end exome sequences to the hg19 reference and the Picard v1.102 Mark Duplicates tool was used to remove duplicates. Local realignment around indels was performed by running the GATK Realigner Target Creator and Indel Realigner tools, using the Mills and 1000G Gold Standard indels as the known indels. Base quality score recalibration was performed using the GATK Base Recalibrator tool with dbSNP build 138 and the Mills and 1000G Gold Standard indels as known sites, followed by GATK PrintReads. Variants were called using the GATK Haplotype Caller tools with the following parameters: ERC GVCF, LINEAR variant index type and 128000 variant index parameter followed by the GATK Joint Genotyping tool. The ANNOVAR v2015-03-22 [55] was used to functionally annotate the genetic variants, including when applicable gene membership (e.g. intron), conservation, nonsynonymous amino acid substitution, SIFT prediction, Polyphen2 prediction, etc.

Filtering and Classification of Variants
Ingenuity Pathways Analysis (IPA) was used to determine genes associated with ARDS in the literature, and to identify significantly enriched canonical pathways, networks, diseases and biological functions and upstream regulators from amongst filtered list of variants [56]. For IPA, we applied a filter to extract the most promising variants based on one case (GP7), samples from the family, and ten unrelated ARDS cases, (JF9-18) as well as control samples GP3-5 and GP8 from the same family as the case (GP7). While samples GP1, GP2, and GP6 were disease free, we excluded them from our controls since these originated from younger individuals and we were focused on late onset of ARDS during adulthood. However, these samples were later utilized in all follow-up analysis done on family variant inheritance patterns. For IPA assessments, the list of pre-processed variants was further filtered based upon the following set of criterias: (1) that GP7, which was the Using the biological relationships from the IPA pathway and the top genes in the canonical pathway as references, we derived a correlation trend between 'within patient expression changes' (WPEC) and 'ordered categorical Multiple Organ Failure' (ocMOF) labels that were biologically driven for all genes in the same pathway [23]. If this trend is consistent with the one computed from the data, the gene is retained and used to compute the dominant trajectory. The IPA pathway provides a graphical representation of the biological relationships between genes in a canonical pathway, where nodes represent genes and edges represent the biological relationships. Each edge is supported by at least one reference from the literature, a textbook, or canonical information stored in the Ingenuity Pathways Knowledge Base, providing us a relationship summary between genes. For this study, a -log p value of 1.30 represents the cutoff for significance; any pathways with values less than this would not be considered as significant across the dataset.

Clustering of variants and heatmap assessments
The filtered variants were extracted and parsed for all samples from the study. These tables were then imported into an excel spreadsheet and calculations were done for the presence (assigned a 1) or absence (assigned a 0) of variants within a specific sample using VB scripting [57]. The final csv file was uploaded into R [58]. We then applied the 'Heatmap' and 'clustergram' scripts using R packages gplots, gdata, gtools, and rcolorbrewer to render a 2-d color image of the data showing the samples on the x-axis.
To organize these data and identify potential relationships among presence/absence of specific variants, we utilized a hierarchical clustering with Euclidean distance metric and average linkage to generate the hierarchical tree. This type of clustering enabled us to find the similarity or dissimilarity between every pair of objects in the data set, group the objects into a binary, hierarchical cluster tree, and determine where to differentiate the hierarchical tree into clusters.
Clustering of genes and annotation analysis DAVID (Database for Annotation, Visualization, and Integrated Discovery) is a Web-based application that provides a high-throughput and integrative gene functional annotation environment to systematically extract biological themes behind large gene lists. Highthroughput gene functional analysis with DAVID helps to provide important insights that allow investigators to understand the biological themes within their given genomic study [59][60][61]. We took gene lists from groups: (B), (C) and (D) ( Table 2) and performed clustering analysis using the DAVID tool. We were unable to perform this type of analysis for group A, as this gene list is proprietary information and is retained by Agilent Biotechnologies. We performed all searches using the default parameters, as referenced in the software manual, based upon the "highest" classification stringency cutoff to obtain functionally related gene groups. For the annotation analysis generated from DAVID, we searched all human sequence related metadata available in the database and we then extracted the data to explore gene cluster relationships for each of our gene lists. DAVID uses a set of fuzzy classification algorithms to group genes based on their co-occurrences in annotation terms and ranks the gene groups using an internal (EASE) score [62].

Consent for publication
Not applicable.

Availability of data and material
The datasets used and/or analyzed during the current study are available from the corresponding authors on reasonable request.

Competing interests
The authors declare that they have no competing interests.  All functions were categorized based upon the filtered list of variants derived from those highly enriched in GP7.   *Denotes the same gene represented by 3 variants which was also found present in group B genes (Table 3). All other genes and variants listed in the table were unique to ARDS cases (i.e. group D) and not found in any other group. The column labeled "CHROM" denotes chromosomal location; the column labeled "POS" denotes the coordinate for the position of the variant within the chromosomal location. Figure 1 Analysis Workflow. Outline of the workflow steps for data collection starting from the sample processing level, data filtering steps including pre-processing of the raw sequence data file and culminating in the final data analysis steps of assessments with the post-processed data. Family pedigree. The pedigree represents the family for our primary case, GP7.

Figures
Circles denote females and squares denote males. The empty circles and square are indicating those family members whose samples were not available for the study. GP7 is a female and represents our primary case. GP4, GP8, GP6, and GP3 are males. GP1 and GP2 are females. GP1, GP2, and GP6 are the offspring of GP7 and GP3 and are also the youngest family members as they were <30 years old at the study's inception.   Novel pathway function identified from variants. Liver hyperplasia/hyperproliferation was a novel function predicted for the IPA assessed groups C, B, and D (as represented in the bar graph by C, B, and D respectively), but was not a significant pathway predicted in group A, the ARDS literature gene list. The total number of genes predicted was highest in group C, which represents all genes from group B, plus genes found within the 3' UTR, 5' UTR, and non-coding RNAs. genes were found to be shared amongst groups A, B and C. Forty-one additional genes were identified from Group B + Group A assessments, and 155 additional genes were identified in Group C + Group A assessments. Each pair of bars represents the frequency of genes in the search criteria and the frequency of genes identified as having a function in cell death and survival, respectively.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.