Graph Fourier transform for spatial omics representation and analyses of complex organs

Spatial omics technologies are capable of deciphering detailed components of complex organs or tissue in cellular and subcellular resolution. A robust, interpretable, and unbiased representation method for spatial omics is necessary to illuminate novel investigations into biological functions, whereas a mathematical theory deficiency still exists. We present SpaGFT (Spatial Graph Fourier Transform), which provides a unique analytical feature representation of spatial omics data and elucidates molecular signatures linked to critical biological processes within tissues and cells. It outperformed existing tools in spatially variable gene prediction and gene expression imputation across human/mouse Visium data. Integrating SpaGFT representation into existing machine learning frameworks can enhance up to 40% accuracy of spatial domain identification, cell type annotation, cell-to-spot alignment, and subcellular hallmark inference. SpaGFT identified immunological regions for B cell maturation in human lymph node Visium data, characterized secondary follicle variations from in-house human tonsil CODEX data, and detected extremely rare subcellular organelles such as Cajal body and Set1/COMPASS. This new method lays the groundwork for a new theoretical model in explainable AI, advancing our understanding of tissue organization and function.

transcriptome datasets from the public domain.The first column shows the data ID in the original paper or data source; the second column shows the use of the data (i.e., for grid-search optimization, independent test, or case study); the third column shows the sequencing platform; the fourth to the sixth columns show the sample information, including species, conditions, and tissue sources; the rest of the columns shows the statistical information of each data, including the number of spots, the number of genes, the number of total reads, the mean read per spot, the standard deviation of the number of reads per spot, the mean number of genes per spot, and the standard deviation of genes per spots. 2 | 849 SVG candidates collected from the public domain.The table collects 849 unique cell-type-or layer-specific markers from five different kinds of literature.The first column records the mouse gene symbol.The second column records the paper source.The third column records the experiment object in each gene, where "M," "H," and "M&H" represent mouse, human, and both.The fourth column records the human gene symbol.The fifth column records the original source in the paper for each gene, either figures or supplementary files.

Supplementary Table 3 | 458 curated benchmarking SVGs validated by the Allen Brain
Atlas.The first six columns correspond to general information on gene identifiers, including gene symbol (mouse), gene symbol (human), UniqueID, probe name, plane, and the experiment ID in the ISH database.The ISH intensity on 12 brain regions was recorded from column G to Column R, respectively, including Isocortex, Olfactory area (OLF), Hippocampal formation (HPF), Cortical subplate (CTXsp), Striatum (STR), Pallidum (PAL), Thalamus (TH), Hypothalamus (HY), Midbrain (MB), Pons (P), Medulla (MY), and Cerebellum (CB).All the records were downloaded from the ISH database.Column S records the mean ISH intensity of 12 mouse brain regions.Column T records whether the gene is considered a curated benchmarking SVG in this paper.

Supplementary Table 4 | Grid-search of parameter combination for SVG prediction.
The table records the details of the performance comparison in terms of the grid-search of parameter optimization.The first four columns correspond to sample ID, tested software, sequence technology, and parameter combinations.The rest of the columns record eight evaluation matrices, including the Jaccard index, Tversky index, the odds ratio of Fisher's exact test, precision, recall, F1 score, Moran's I, and Geary's C. If an element in this table is "NA," the software shows an error or ran out of time (running time was greater than 48 hours) during testing.

Supplementary Table 5 | Running time of
SpaGFT and other tools on the three grid-search test data.The table records the running time and memory cost of SpaGFT, SPARK, SPARK-X, MERINGUE, SpatialDE, and SpaGCN on the HE-coronal, 151673, and Puck-200115-08 datasets.All tools and experiments were carried out in the same computing environment introduced in Methods.Columns A and B show tool names and sample names; Column C and D records the running time with the unit as second (S) and log10(S), respectively; Column E is memory cost with the unit as a megabyte.For any experiments that spent over 24 hours, we labeled them as "NA."Supplementary Table 6 | SVG prediction performance on 28 independent test datasets using default parameters.The table records the details of the performance comparison in terms of the independent test.The first column indicates the dataset ID, corresponding to the Dataset ID in Supplementary Table 1.The second column shows eight evaluation matrices, including the Jaccard index, the Tversky index, the odds ratio of Fisher's exact test, precision, recall, F1 score, Moran's I, and Geary's C. The other columns are the software.If an element in this table is "NA," the software shows an error or runs out of time (running time was greater than 48 hours) during testing.

Supplementary Table 7 | Summary of top 500 genes identified by SpaGFT and the fix benchmarking tools.
The table records the unique and consistent SVGs of the top 458 SVGs identified by six tools for mouse brain data (HE-coronal).The first column is the gene name; Columns B, C, D, E, F, and G are software names; The values in Columns B to G indicate whether the gene is identified by this tool.If the value is equal to 1, it means the gene is the output of the top 458 SVGs in this software, and vice versa; Column H is the sum of values from Columns B to G, indicating the consistency of identified genes (the higher value, the higher consistency).When the value in Column H is "1," it means that this gene is uniquely identified by one of the tools from Columns B to G; Column I indicates whether this SVG is from 458 ground truth.8 | Gene enhancement results.The table demonstrates the performance of different tools, including Sprod, SAVER-X, scVI, netNMF-sc, MAGIC, and DCA, using 18 human datasets (2-3, 2-5, 2-8, 18-64, 1-1, T4857, 151507, 151508, 151509, 151510, 151669, 151670, 151671, 151672, 151673, 151674, 151675, and 151676).Samples 151510， 151672， and 151673 are used for grid search.Other 13 datasets are used for independent tests.The first column is the method name.The second column is the data usage.The third column is the sample ID.The fourth column is the parameter information for either grid search or independent test.The fifth column is the ARI score calculated by inputting the ground truth label and predicted label.

Supplementary Table 9 | SVG clustering results in the human lymph node data.
The table records all SVGs predicted from SpaGFT on the human lymph node data.Column A is the gene name; Columns B-D are gene interpretations; Column E is the number of spots having this SVG expressed; Column F is the corresponding GFTscore; Column G is the ranking of GFTscore; Columns H and I are the p-value and q-value of SVG, respectively; Column J is the SVG cluster labels.SVGs are arranged based on the GFTscore from high to low.

Supplementary Table 10 | Deconvolution results for human lymph node sample. The table
shows the proportions of 34 cell types calculated by cell2location.The first column is the spot ID of the mouse sample.The rest of the columns are the cell proportions in 34 cell types.