GeneLab spaceflight murine datasets
Five RNA-Seq datasets (OSD-238, OSD-239, OSD-240, OSD-241, OSD-254) were downloaded from the NASA OSDR (https://osdr.nasa.gov/bio/repo) via the API, and full dataset descriptions can be found on the dataset pages. These datasets are derived from murine skin from the MHU-2, RR-5, and RR-7 spaceflight experiments. For MHU-2, singly-housed male C57BL/6J mice were 9 weeks of age when flown on the ISS for 30 days; they were euthanized less than 1 day after return to Earth. Dorsal and femoral skin samples were extracted from MHU-2 mice, and 6 replicates in spaceflight microgravity and the matching 6 replicates in the 1G ground control were split into 2 sets of 3, with half fed a JAXA chow diet and the other half fed JAXA chow with supplemental FOS18. For RR-5, 30-week-old BALB/c mice in shared housing were flown on the ISS for 30 days; following return to Earth, mice were given 30 days to recover before euthanasia and extraction of dorsal and femoral skin tissue. Finally, for RR-7, 11-week-old C57BL/6J mice and C3H/HeJ mice were flown on the ISS for either 25 or 75 days before being euthanized on-orbit.
RNA-Seq analysis of rodent datasets
Raw counts data, derived via a previously reported pipeline69, were downloaded from the NASA OSDR. ERCC genes were filtered out, as were genes with low counts across all samples (<10). DGE analysis using the Wald significance test, and was performed on spaceflight microgravity and ground control samples using the R package DESeq270 (v1.32.0). For each data subset, microgravity spaceflight samples were contrasted with ground control samples and cook’s cut off and independent filtering were not used. The R Package biomaRt (v2.48.3) was used to map ENSEMBL gene ids to gene symbols and HGNC symbols. All heatmaps were generated using the R package ComplexHeatmap (v2.9.4) and circular heatmaps were created using a slightly modified version of the R package Circlize (v0.4.15), to support cosmetic changes.
Pathway analysis of rodent datasets
For ORA, the R package clusterProfiler (v4.0.5) was used on vectors of genes and bar plots were made via ggplot2 (v3.4.0). For GSEA, the R package FGSEA (v1.18.0) was used. GSEA was performed on all DEGs for all data subsets, with rank vectors consisting of HGNC symbols and corresponding t-scores in the data subset. For GSEA performed prior to derivation of the key genes, the skin health pathways and genes, and the DNA damage repair pathways, MSigDB human collections H, C2, and C5 were used (v7.4) and set sizes >15 were permitted. For the mitochondrial pathway analysis, human pathways from MitoPathways (v3.0) were used and set sizes >1 were permitted. To filter to pathways during derivations of the key genes, pathways that were highly significant (FDR ≤ 0.05) in at least 8/10 data subsets were selected (8/10 was chosen based on the experimental supplemented diet in the MHU-2 mission), and then leading-edge genes that were significant (FDR ≤ 0.1) in at least 2 datasets from different missions were accepted. All heatmaps were generated using the R package ComplexHeatmap (v2.9.4)
RNA-Seq data analysis on Twin Study samples
Longitudinal samples were collected from a male astronaut aboard the ISS and his identical twin on Earth during a 340 day mission including 6 months preflight and 6 months postflight follow-up, for a total of 19 timepoints for the flight subject and 13 timepoints for the ground subject. Blood was collected using CPT vacutainers (BD Biosciences Cat # 362760) per manufacturer's recommendations. For full details of sample separation and processing see15. Briefly, samples collected on ISS were either frozen in -80°C after separation of mononuclear cells by centrifugation (referred to as CPT), or returned to Earth in 4°C in a Soyuz capsule and sorted into CD4, CD8, CD19 populations and a lymphocyte depleted (LD) fraction. Samples collected on Earth were either frozen for mononuclear cells or processed when fresh for sorted cell populations. To correct for the effects of ambient temperature exposure on RNA (approximately 36 hours including landing and repatriation) control samples were created by simulating similar conditions to those that may occur during the ambient return and were compared to fresh blood collections from the same individual. RNA extraction, library prep and sequencing were completed per15 using both ribodepletion or polyA selection kits.
Generated sequences were trimmed using Trim Galore! (v0.4.1) and quantified to genes using kallisto71 on ENSEMBL transcripts. Differentially expressed genes were called using DESeq270 on each cell type separately by comparing preflight, inflight and postflight groups, controlling for the normal biological variance within the 24 months using the longitudinal data of the ground twin and using the simulated ambient control samples as another covariate for sorted cells 70.
JAXA Cell-Free Epigenome (CFE) Study RNA quantification data
Aggregated RNA differential expression data and study protocols were shared through the NASA OSDR with accession number: OSD-53072. Plasma cell-free RNA samples for RNA-seq analysis were derived from blood samples collected from 6 astronauts before, during, and after the spaceflight on the ISS. Mean expression values were obtained from normalized read counts of 6 astronauts for each time point. Heatmaps were made for the 21 genes key genes on the normalized values per time point using R package pheatmap (v1.0.12).
JAXA astronaut hair follicle data
Gene expression data from 10 JAXA astronauts’ hair follicles11 was downloaded from the NASA OSDR (OSD-174). Raw data for 60 total samples was processed using LIMMA with R/bioconductor73. Briefly, duplicate sample single-color Agilent microarray data was background corrected, filtered for low expression probes, and quantile normalized. Differential gene expression was measured between pre-flight, in-flight, and post-flight data points using p-values adjusted for False Discovery Rates (FDR) with the Benjamini–Hochberg method.
Inspiration4 (i4) astronaut sample collection
Four civilians, two males and two females, spent three days in Low-Earth Orbit (LEO) at 585 km above Earth. The mission launched from NASA Kennedy Space Center on September 15th, 2021 and splashed down in the Atlantic Ocean near Cape Canaveral on September 18th, 2021. Several human health and performance related experiments were carried out in collaboration with SpaceX, the Translational Research Institute for Space Health (TRISH) at Baylor College of Medicine (BCM), and Weill Cornell Medicine. Experiments were performed in accordance with the relevant guidelines at the principal investigators’ institutions. Moreover, the different study designs and the corresponding methods to collect and analyze the biological samples were approved by BCM IRB. All biological data derived from the Inspiration4 mission were collected pre and post flights. For this study, only data from blood samples and skin biopsies were used. Pre-flight samples were collected at L-92, L-44, and L-3 days prior to launch to space. Upon return, post-flight samples were collected at R+1, R+45, and R+82 days.
Blood samples were collected before (Pre-launch: L-92, L-44, and L-3) and after (Return; R+1, R+45, and R+82) the spaceflight. Chromium Next GEM Single Cell 5’ v2, 10x Genomics was used to generate single cell data from isolated PBMCs. Subpopulations were annotated based on Azimuth human PBMC reference74.
For skin spatial transcriptomics data, 4mm diameter skin biopsies were obtained from all Inspiration4 crew members, once before flight and as soon as possible after return (L-44 and R+1). These biopsies were flash frozen and processed with the NanoString GeoMx platform. Based on staining images using fluorescent antibodies, a total of 95 freeform regions of interest (ROIs) were profiled across outer epidermal (OE), inner epidermal (IE), outer dermal (OD) and vascular (VA) regions. GeoMx WTA sequencing reads from NovaSeq6000 were compiled into FASTQ files corresponding to each ROI and converted to digital count conversion files using the NanoString GeoMx NGS DnD Pipeline. From the Q3 normalized count matrix that accounts for factors such as capture area, cellularity, and read quality, the DESeq2 method was used to perform DGE analysis.
Astronaut Physiological data
Data are reported from three human subject experiments conducted on the International Space Station: Nutritional Status Assessment (2006-2012), Dietary Intake Can Predict and Protect Against Changes in Bone Metabolism During Space Flight and Recovery (Pro K) (2010-2015), and Biochemical Profile (2013-2018). All protocols were reviewed and approved by the NASA Institutional Review Board and all subjects provided written informed consent.
These missions were 4-6 months in length, and these studies included blood and urine collections before, during, and after flight, with analysis of an array of nutritional and biochemical markers. Blood and urine samples were collected 2 or 3 times before flight: approximately Launch minus (L-) 180 days and L-45 days. In some cases, a third blood sample was collected (typically along with the L-45 collection), and these tubes were centrifuged and frozen for aliquoting after flight batched with the samples collected inflight. Blood samples were collected inflight, at approximately Flight Day (FD) 15, 30, 60, 120, and 180. Postflight samples were collected in the first 24-h after landing (designated return+0, R+0) and again 30-d later (R+30). The R+0 samples were not necessarily fasting, given the time of day and nature of return from flight. Of the 59 crewmembers reported herein: 8 returned on the Space Shuttle, with blood collection 2-4 hours after landing; 51 landed in Kazakhstan, with 7 of them returning to Star City, Russia, with blood collection 8-10 hours after landing; 44 were transported directly back to the Johnson Space Center in Houston, with blood collection approximately 24-h after landing. Pre and postflight collections included two 24-h urine collections, and inflight collections included one 24-h urine collection. These collection techniques have been previously described75.
We report here vitamins and metabolites, oxidative stress and damage markers, inflammatory markers and cytokines, liver enzymes and endocrine indices. These were analyzed using standard techniques as previously reported76.
As of this writing, data were available for 59 crewmembers (47 male, 12 female). Age at launch was 47.0 ± 5.6 y, body mass at launch was 79.2 ± 11.8 kg (M: 83.3 ± 9.3; F: 63.0 ± 4.5). Body mass index was 25.5 ± 2.9 kg/m2 (M: 26.4 ± 2.6; F: 22.3 ± 1.5).
All available data are reported here, although the reported n for any given test or session varies for a number of reasons, including: not all experiments had all analytes included, mission length differences for some crewmembers, schedule or other issues occasionally precluded sample collection, and methods changes over time. Repeated measures analysis of variance was conducted to test for differences during and after flight compared to preflight, and comparisons among time points were made using a Bonferroni t-test. Multiple comparisons were accounted for, and only those tests with p<0.001 are reported. The data was plotted using R package ggplot2 (v3.3.5).
QLattice symbolic regression modeling
We used symbolic regression (QLattice v3.0.120) to construct both single-gene models and models involving combinations of synergistic genes which map from the gene expressions to spaceflight status. For these models, we only distinguish between mice that went to space and mice that didn't. Conventional statistical methods typically allow for calculating the effect of a single gene at a time through metrics such as p-values and false discovery rates. In contrast, symbolic regression models can reveal combinations of genes and modules that best predict spaceflight status. These could be linear combinations involving two or more genes that have previously also been shown to be statistically significant, or it could be non-linear combinations that reveal features that on their own were non-significant but in synergy with a second feature become highly predictive. In addition, known biological functions of genes included in models, as well as the resultant mathematical relationship between them, can potentially be interpreted to reveal regulations or interactions that are affected by spaceflight.
In biological pathway analysis, it is well-known that up- or down-regulation of one gene can have cascading effects such that the function of one gene becomes sensitive to that of another77. It has previously been demonstrated that parsimonious machine learning models are able to provide accurate outcome prediction in omics data, while preserving interpretability78,79. The interpretability often results from the fact that models might demonstrate otherwise opaque relations which become clear when combined effects are taken into account. An example could be an interaction where the regulation of a single gene is itself unimportant for functional changes, unless another gene is simultaneously regulated. This would indicate a synergistic compound effect, which naturally can expand to a 3-, 4- or 5-gene synergy and beyond. The combined effect of two or more individually insignificant gene expression levels may thus theoretically be more powerful than the effect of a single gene with low p-value. Thus, traditional statistical analysis masks such effects by the use of metrics related to the individual gene.
UMAP dimensional reduction and gene clustering
For the clustering of genes shown in Fig. 2A, we performed a uniqueness filtering by first dropping all genes where more than 1/4 of samples had identical expression levels. Subsequently, we filtered out all genes with a variance of less than 1.7, resulting in 2184 filtered genes with high variability. Similarly, prior to the clustering in Fig. 2D, we filtered out all genes with a variance of less than 2.5, to which we manually added all key genes that were not present in this set (all but 4), resulting in a final set of 1060 genes for clustering analysis. The rationale behind the stricter variance filtering in Fig. 2D was that it increases the relative frequency of key genes in the clustering, and allows for a higher resolution in the final plot. Thus, the 102 key genes comprise roughly a tenth of the genes included in Fig. 2D, allowing for a detailed analysis of the ontological themes present in the key gene set.
In both clustering figures, a dimensional reduction was performed subsequent to filtering by uniformly distributing the filtered data on a Riemannian manifold, using the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). UMAP is a general purpose manifold learning and dimension reduction algorithm which is similar to t-SNE in that it predominantly preserves local structure. Yet, UMAP preserves more global structure, which makes it a more suited algorithm when the objective of the dimensional reduction is more than simple visualization (in this case the objective is clustering).
Genes were then clustered using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)80. A primary advantage of HDBSCAN is that it is always deterministic for the same hyperparameters, and will thus always return the same clustering, all else being equal. In addition, comparable clustering algorithms such as k-means do not perform well unless clusters are of equal size and density with few outliers. With biological data such as gene expressions, we expect large variation in cluster size and density, making HDBSCAN the ideal choice.
Once genes were clustered, the gene sets belonging to each cluster were extracted and analyzed using three ontology databases in the Python implementation, GSEAPY, of the Gene Set Enrichment Analysis81 tool Enrichr82. We used the Elsevier Pathway Collection, the 2021 WikiPathway Collection, and the 2021 MGI Mammalian Phenotype Level 4 Collection. These enrichment analyses were used to provide context to each cluster by appending an annotation if a notable amount of hits showed up for a particular association. Cluster 2 in Fig. 2B, for example, consists of 59 genes, of which 10 are found as hits for associations to decreased IgG1 level in the MGI Mammalian Phenotype Level 4 database from 2021, as well as more than 20 hits for B-cell receptor signaling in the Elsevier Pathway Collection, and thus these annotations are appended in Fig. 2B. In contrast, Cluster 0, consisting of 47 genes, has no more than one hit for any association in any database, and is thus deemed to have no significant interpretation.
QLattice synergy network
For the models involved in constructing the QLattice (https://pypi.org/project/feyn/) synergy network in Fig. 2C, we generally used the "Area Under Curve" (AUC) measure as our performance gain indicator. The AUC can in this case be interpreted as the probability that a classifier will be able to correctly distinguish a spaceflight mouse from a ground mouse for two randomly chosen data samples; one of each type.
We used a high confidence (0.700) and all interaction sources besides textmining in StringDB, which allowed us to identify three primary biological biological pathways. To visualize the correspondence of this synergy network to these established biological relations, we finally color each gene node according to which of the pathways they belong to.
Regulatory Network Analysis
We used the expression data from all key genes across each comparison to perform an upstream regulator analysis. We identified the biological or chemical drugs in the QIAGEN IPA library of regulators, defined as having a P < 0.05 (Fisher’s Exact Test) and an absolute z-score > 1. Drugs from the data set that met these criteria and were associated with similar gene expression changes in the QIAGEN Knowledge Base were visualized by two-way hierarchical clustering of the unstandardized z-scores with Ward linkages in both dimensions.