Development of quantitative direct prediction algorithm for the human target organ 1 similarity of human pluripotent stem cell-derived organoids and cells

1 Human pluripotent stem cell (hPSC)-derived organoids and differentiated cells have similar 2 characteristics, such as cell types, structure, and functions, to human organs and tissues. Thus, 3 in vitro human organoids and tissue-specific cells serve as a superior alternative to conventional 4 cell lines and animal models in drug development and regenerative medicine. However, since 5 hPSC-derived organoids and differentiated cells show fetal-like features, further differentiation 6 and maturation methods have been developed for the generation of high-quality in vitro models 7 of the corresponding human organs and tissues. Therefore, for a simple and reproducible 8 analysis of the quality of organoids and cells to compensate for the shortcomings of existing 9 experimental validation studies, a quantitative evaluation method should be developed. In this 10 study, using the GTEx database (a total of 8,555 samples in 53 tissues), we constructed a 11 quantitative calculation system (organ-specific panels and calculation algorithm) to assess the 12 similarity to the human lung, stomach, and heart and confirmed the algorithm using in-house 13 RNA-seq data (total RNA from 20 tissues). To evaluate our system, we generated hPSC- 14 derived lung organoids, gastric organoids, and cardiomyocytes and detected 33.4%, 51.7%, and 15 83.4% similarity, respectively, to the corresponding human target organs. To facilitate access 16 and use of our system for researchers, we developed the web-based user interface (Web-based 17 Similarity Analysis System, W-SAS; for liver, lung, stomach, and heart) presenting similarity 18 to the appropriate organs as percentages and specific gene expression patterns. Thus, the W- 19 SAS system could provide valuable information for the generation of high-quality and readily 20 available organoids/cells differentiated from hPSCs and a strategy to guide proper lineage- 21 oriented differentiation.


Introduction
Animal models have served as essential tools for elucidating the pathogenesis of human disease 2 and investigating potential therapeutic targets before entering clinical trials. However, animal 3 models cannot completely reproduce human pathophysiology and have frequently failed to 4 predict human responses because of several human-specific characteristics not present in 5 animal species, such as immune-related responses and pharmacokinetics 1,2 . Human primary 6 cells are considered the gold standard in vitro models for studying human physiology, disease, 7 and drug response 3,4 . However, isolation or in vitro expansion of primary cells is difficult, 8 limiting their use in research for human disease and drug development. Therefore, robust and 9 representative in vitro human organ/tissue models are urgently needed to overcome these 10 limitations.

11
Recent stem cell biology studies have focused on the in vitro generation of tissue-12 specific functional cells or tissue analogs by inducing a transition of cellular states 5, 6 . The 13 most common method is to differentiate human pluripotent stem cells (hPSCs) into tissue- 14 specific cells by regulating developmental signaling 5 . The advancement of stem cell 15 differentiation technology has led to the efficient differentiation of hPSCs into various cell 16 types, such as neurons 7 , cardiomyocytes (CMs) 8 , hepatocytes 9 , and lung airway epithelial 17 cells 10 . Alternatively, differentiated somatic cells such as fibroblasts and peripheral blood cells 18 can be directly converted into target cells by combinational expression of tissue-specific 19 transcription factors 11 . Furthermore, advances in three-dimensional (3D) culture systems have 20 enabled the development of complex organotypic models using 3D organoids that differentiate 21 from stem cells into tissue-like miniature analogs recapitulating complex tissue-specific cell 22 compositions, architectures, and functions 12,13,14 . These 3D organoid technologies provide an 23 opportunity to study human development and disease in depth by assessing cellular interactions, 1 location, and structural changes 15,16 . Overall, technologies to manipulate cell fate provide 2 various options that can produce in vitro models with similar properties to the corresponding 3 tissues, and these strategies can be broadly applicable for in vitro disease modeling, especially 4 human infectious disease modeling, an important topic recently, drug development, and cell-5 based therapies. 6 However, multiple studies of in vitro generation of tissue-specific cells or organoids 7 from hPSCs have demonstrated critical limitations of current technology, including immature 8 characteristics 17, 18 and variation in quality 19,20 . Transcriptome analysis of diverse tissue 9 organoids demonstrated that organoids from hPSCs have fetus-like characteristics even in long-10 term culture 20, 21,22 . In the case of lung-bud organoids, long-term culture up to 7 months results 11 in lung-like organoids composed of tissue-specific cells, but they persist in a fetus-like state 23 . 12 Thus, the development of additional maturation methods is required for advanced in vitro 13 human models that more closely mimic human tissue. Moreover, heterogeneous production 14 critically limits their utility in various biomedical applications 19,20 . 15 Currently, the development of tissue-specific differentiation methods and quality 16 control of differentiated cells/organoids rely on expression analysis of tissue-specific markers 17 by histology and gene expression analysis. Evaluation of differentiation using key tissue- 18 specific markers can be an efficient strategy for the design and optimization of differentiation 19 methods, but evaluating the similarity between human tissue and differentiated cells/organoids 20 is difficult because experimental validation is laborious and time consuming. Although 21 clustering analysis of the global transcriptome provides insight into the lineage markers and 22 molecular similarity of differentiated cells or organoids with their counterpart human tissue, 23 5 this method does not provide a quantitative and standardized assessment of the similarity 1 between these structures.

2
Previously, we developed a quantitative prediction system to assess the similarity to 3 liver (LiGEP; Liver-specific Gene Expression Panel) of hPSC-derived hepatocytes and liver 4 organoids to improve on qualitative quality assessment. Using the LiGEP algorithm, we can 5 calculate the similarity between liver organoids or differentiated hepatocytes and human liver 6 and provide valuable information for generating high-quality liver organoids and hepatocytes 7 to researchers 24, 25 . However, because the LiGEP algorithm can only calculate similarity to 8 liver, it is necessary to expand our study to other human organs. Additionally, we need to build 9 a web-based analytics platform that is readily available to researchers.

10
Here, we developed quantitative calculation systems to assess similar to organs based 11 on organ-specific gene panels using the GTEx public database (total of 8,555 samples, 53 12 tissues), including LuGEP for lung, StGEP for stomach and HtGEP for heart, and an analytical 13 algorithm for direct comparison to human organs. We validated this analytical system with in-14 house RNA-seq data of 20 total RNA samples from different tissues. Moreover, we obtained 15 the similarity (%) of hPSC-derived lung organoids, gastric organoids and CMs compared to the 16 corresponding human organs. Finally, we developed a web-based user interface named Web-17 based Similarity Analytics System (W-SAS) to provide an analytical algorithm, similarity 18 (percentage) and gene expression patterns for a direct comparison to human target organs (liver, 19 lung, stomach, and heart). Thus, our quantitative calculation system and W-SAS can provide a 20 useful platform for evaluating and improving the quality of differentiated organoids/cells.

1
Development of a quantitative calculation system to assess similarity of hPSC-derived 2 organoids and cells to organs 3 To assess the similarity of hPSC-derived organoids and cells to human organs at a quantitative 4 level, in this study, we developed the W-SAS program for the quantitative assessment of 5 similarity and quality of the hPSC-derived organoids and cells. First, the researcher will 6 perform the RNA-seq analysis. Using raw RNA-seq data (TPM, FPKM/ RPKM values), the 7 W-SAS program (organ-specific panels and algorithms) can calculate the similarity to the 8 appropriate organ and provide a quantitative organ similarity score (%) and information on the 9 gene expression patterns in organ-specific panels, directly comparing the target organs to the 10 hPSC-derived organoids and cells. Using our system, researchers can receive important 11 information for quality control of hPSC-derived organoids and cells (Fig. 1a). For selection organ-specific genes for each tissue (heart, lung, stomach), the analysis was 16 performed in three steps (t-test, confidence interval, quantile comparison).
Step 1: Gene 17 selection was performed by comparing the mean and variance between heart, lung or stomach 18 tissues and the remaining tissues. We performed paired t-tests to identify differentially 19 expressed genes between two tissues (the heart, lung, or stomach vs. one of 42 tissues) for all 20 possible cases. We defined a set of tissue-specific genes through the intersection of genes that 21 showed a significant difference (p-value < 0.05) in the t-test results. We defined 2,843 heart-22 specific genes, 1,049 lung-specific genes and 466 stomach-specific genes.
Step 2: Since the 23 genes acquired in the first step were chosen based on the difference in the mean and the variance 1 between tissues, these genes were not only expressed uniquely in the particular tissue but also 2 showed large variances in other tissues. Therefore, we used the confidence interval (CI) to 3 overcome this problem. The CI is an estimate calculated from the statistics of the observed data, 4 suggesting a range of possible values for the parameter; thus, it is used to identify genes that 5 are specifically highly expressed in particular tissues. We calculated the lower bound of the 99% 6 confidence interval (LCI) O Li (ith gene's LCI) of the genes obtained in the first step for each 7 tissue (heart, lung and stomach) and calculated the upper bound of the 99% confidence interval 8 (UCI) T 1Ui , ⋯ , T 42Ui for the remaining 42 tissues. Then, we extracted the genes that had a 9 higher LCI of the tissue (heart, lung and stomach) than the maximum UCI among the 42 tissues 10 (O Li > max (T 1Ui , ⋯ , T 42Ui )). As a result of CI filtering, we identified candidate genes that are 11 highly expressed in particular tissues (153 genes in heart, 189 genes in lung and 73 genes in 12 stomach).
Step 3: Although highly expressed tissue-specific genes of the three organs were 13 previously identified, among the rest of the tissues, some genes had expression values in the 14 top 25% of the samples that were higher than that of each tissue (heart, lung, stomach) because 15 of a large variation in the expression values. To eliminate these false positive results, we 16 performed quantile comparison between one of 3 tissues (heart, lung, stomach) and the 17 remaining 42 tissues. First, we set the top 25% RPKM value of each tissue (heart, lung, stomach) 18 as O qi and set the top 25% RPKM values of the remaining 42 tissues as T 1qi , ⋯ , T 42qi . Then, 19 we selected the genes that met the following conditions: O qi > t × max(T 1qi , ⋯ , T 42qi ) , t = 20 1.05. Through quantile comparison, we defined the final organ-specific expressed genes (143 21 genes of heart, 145 genes of lung and 73 genes of stomach). To construct a gene panel that can 22 reflect the characteristics and functions of each tissue, we added not only organ-specific 23 expressed genes but also genes related to tissue functions. As a result, three organ-specific gene 1 panels (heart-specific gene expression panel with 144 genes, lung-specific gene expression 2 panel with 149 genes and stomach-specific gene expression panel with 73 genes) were 3 constructed (Supplementary Table S3).

5
Validation and characterization of Organ-GEPs 6 In the heatmap, the expression of each panel presents the specificity of its own tissue compared 7 with other tissues. Moreover, in multigroup discriminant analysis (MDA), the expression 8 panels for lung, heart, and stomach are clearly divided from those of other tissues, suggesting 9 that we constructed organ-GEPs for calculation of similarity to the specific organ ( Fig. 1b and   10 1c). To confirm that each organ-GEP reflected the organ-specific functions and 11 characterizations, we performed Ingenuity Pathway Analysis (IPA) using each organ-GEP. The

12
LuGEP was associated with "inflammatory response, formation lung, inflammation of function. The HtGEP was related to "contraction of heart, heart rate, contraction of cardiac 16 muscle, cardiogenesis, organization of muscle cells, familial cardiomyopathy, muscle 17 contraction, and enlargement of heart." Finally, StGEP was associated with "secretion of gastric 18 acid, morphology of gastric mucosa, gastric lesion, secretion molecule, morphology of 19 digestive system, and peptic disorder" (Fig. 1d). Thus, to calculate the similarity of hPSC-20 derived organoids to organs, we constructed organ-specific gene expression panels, and each 21 panel reflected the functions of each organ. We established an organ-specific calculation algorithm for determining cut-off values 3 according to organ-specific genes via calculation of the distance based on the standard gene 4 expression vectors between organoids using organ-specific expression gene panels (see 5 Materials and Methods). To assess the algorithm, we used GTEx public data. In Figure 2a, 6 LuGEP was used to calculate 100% similarity to lung of 320 lung tissues (P > 0.001); other 7 tissues showed less than 20% similarity to lung, implying that the LuGEP algorithm could 8 specifically distinguish the lung tissues among all human tissues. In addition, similar to LuGEP, 9 HtGEP presented 100% similarity to 412 heart tissues (P > 0.001), but the HtGEP score showed 10 a high percentage in muscle. In the StGEP analysis, the similarity to stomach represented 100%  To validate the tissue similarity of human lung organoids by LuGEP and the algorithm, we 1 generated a LBO by a stepwise differentiation method that mimics the development process of 2 the human lung, as previously described (see Materials and Methods) 23 . First, undifferentiated 3 hESCs were differentiated into DE expressing the surface markers CXCR4 and C-kit (Fig. 3a).  To calculate the similarity of lung organoids to lung using the LuGEP algorithm, we 22 performed LuGEP analysis with human ES, DE, lung organoid (days 21, 56), and human lung 23 tissues. Figure 3D shows that human ES cells and lung tissues presented 2.7%~8.5% and 100% 1 similarity to lung, implying that LuGEP analysis could accurately distinguish lung tissue and 2 other cells. However, day 4 DE had 17.9% similarity, and 21-and 56-day-old lung organoids 3 presented 27.5% and 33.4% similarity, respectively, compared with the human lung (Fig. 3d). or HIF-2 alpha) and NAPSA (aspartic proteinase), which are related to the inflammatory 10 response, lung formation, and lung cancer, were gradually induced via IPA (Fig. 3e). Thus, we 11 could identify specific gene expression patterns of LuGEP during the formation of LBO 12 derived from hPSCs. To analyze the tissue similarity between the hGOs using StGEP, we generated hESC-derived 17 antrum-like GOs (see Materials and Methods). First, we differentiated hESCs into DE and 18 hESCs exposed to Activin A (Activin/nodal signaling activator) for 3 days, and we observed 19 the expression of endoderm-specific markers such as SOX17 and FOXA2 (Fig. 4a). 20 Additionally, in the presence of FGF4, CHIR99021 (WNT activator), Noggin (BMP inhibitor), 21 and treatment with retinoic acid (RA) for the last day, DE cells were differentiated into 22 posterior-foregut (PF) cells, and PF cells generated spherical shapes and expressed SOX2, 23 PDX1, HNF1β and HNF6 via qRT-PCR and immunocytochemical analysis ( Fig. 4a and b). 1 After 30~34 days, we observed 3D-cultured single spheroids, which transformed into spherical 2 organoids, including stomach antrum-like glandular morphologies in Matrigel (Fig. 4c). In  (Fig. 4d). Using hGO, we performed gastric similarity analysis using the StGEP 7 and the algorithm. Figure 4d shows that hESCs and PF cells presented 2% and 10.3% similarity, 8 respectively. However, hGO showed 51.7% and 33.9% similarity compared with the human 9 stomach. In addition, the expression patterns of the StGEP in hGOs and PF cells showed 10 different patterns compared with the human stomach (Fig. 4e). In PCA analysis, the expression 11 of TFF1/2 and GKN1 gradually increased between PF and GO. Moreover, as GO_1 and GO_2 12 showed 51.7% and 33.9% similarity to stomach, respectively, we could identify a gene 13 expression difference between GO_1 and GO_2. Thus, using the StGEP analysis results, we 14 quantitatively evaluated the quality of hGOs between each sample directly compared to the 15 human stomach. Additionally, in immunocytochemical analysis, we detected high expression of cardiac 1 transcription factors (NKX2.5), cardiac muscle marker (cTnT), ventricular CM markers 2 (MYL2), and atrial CM markers (MLC2a) in the differentiated CMs with BMP4 compared to 3 the control CMs (Fig. 5b). Moreover, in FACS analysis with cardiac muscle-specific markers 4 (cardiac troponin T, cTnT), hESCs and control CM showed 0.28% and 29% differentiated 5 yields. However, differentiated CMs with BMP4 presented a 90% differentiated yield (Fig. 5c). 6 Using hESCs and differentiated CMs, we performed HtGEP analysis. In Figure 4C In this study, we also established a similarity calculation system for hPSC-derived lung and 18 gastric organoids and CMs. To directly provide the calculation algorithm to the researcher, we 19 developed a user-friendly interface (W-SAS, http://kobic.re.kr/wsas) for the calculation of 20 similarity of organs to hPSC-derived organoids and cells (Fig. 6a). W-SAS is easy to use for 21 the calculation of similarity. Figure 6b shows the RNA-seq results of each hPSC-derived 22 organoid and cell uploaded to the W-SAS interface and the calculated and presented results of 23 similarity to the appropriate organ for each sample. 1 The W-SAS was constructed as a two-tiered server-client architecture. The server side 2 was developed using Java Spring Framework with Java Development Kit (JDK) 1.7, and the   1 We constructed a system for prediction of similarity to three organs and a user interface (W-2 SAS) for hPSC-derived organoids and cells based on RNA-seq analysis. After calculation of 3 the organoid data, researchers can receive information to improve the quality of organoids, 4 such as percent similarity (%) and gene expression patterns, via a direct comparison between 5 the target human organs and organoids. 6 The first advantage of W-SAS is a direct comparison between human target organs and 7 differentiated organoids/cells. After the generation of hPSC-derived organoids/cells, W-SAS 8 was used to calculate similarity to the target organ, showing a high percentage of similarity of 9 the differentiated organoids/cells with the corresponding target organs. Second, W-SAS 10 provides a researcher with an adequate number and names of target organ-specific genes 11 between target human organs and organoids. Using this information can help to develop high-12 quality organoids via regulation of insufficient target organ-specific genes, such as genes with 13 overexpression or depletion. The third advantage is the infinite expandability. For calculation 14 of the similarity to a target organ, a target-specific gene expression panel and algorithm are 15 needed. Here, we developed calculation algorithms and organ-specific gene-screening 16 pipelines using RNA-seq data and applied them to all human organs to calculate the similarity 17 with hPSC-derived organoids. Finally, the W-SAS interface is a user-friendly interface. We total RNA is not translated as proteins 28 . W-SAS is an algorithm to analyze whether organ-9 specific genes are acquired at each stage of differentiation from hPSCs. Thus, in this study, 10 organ development and function-associated genes were manually collected by keyword 11 searches of the PubMed database to compensate for the disadvantage of the W-SAS calculation 12 system. In addition, for organ-specific genes in Organ-GEPs, we included organ function-13 associated genes that show significant temporal expression in the stomach, heart, and lung and 14 that show core signatures enriched with functional genes. However, to verify the cell function 15 regrading similarity percentage, researchers must perform biochemical analyses to compensate 16 for transcriptome analysis shortcomings.

17
Researchers generally performed FACS analysis with CM-specific markers (cTnT) to 18 evaluate the CM differentiation rate derived from hPSCs 29 . In this study, the CM control and 19 CM+BMP4 groups showed 29% and 92% in FACS analysis, but HtGEP analysis showed 34.3% 20 and 83.4% (Fig. 5d). The heart is composed of various cell types, such as CMs, smooth muscle 21 cells, and endothelial cells, and mainly contains 70~85% CMs in the heart 30 . Therefore, we 22 could expect that the HtGEP score will not exceed 85% with CMs only. Although cTnT analysis 23 is an important method for the evaluation of CM differentiation derived from hPSCs, we 1 considered that a more accurate differentiation and similarity assessment will be needed for a 2 comprehensive analysis of the gene panel rather than measuring the differentiation with a single 3 gene. Because HtGEP is a gene panel made from heart-specific genes, we suggest that HtGEP 4 may reflect various features of heart-specific cell types. Therefore, the study of the remaining 5 17% similarity to heart will allow the generation of more heart-like heart organoids. In summary, to generate high-quality organoids derived from hPSCs, researchers can visit the W-2 SAS interface to assess the organ similarity of their own hPSC-derived organoids and will receive 3 information (similarity, heat map, PCA plot, expression patterns of panel) that can help generate high-4 quality organoids by engineering of gene expression. Thus, the feedback between W-SAS and 5 researchers can become a vital troubleshooting step in the organoid research field. Transcriptome data acquisition and data preprocessing 12 We developed organ-specific gene expression panels that can reflect the characteristics and 13 functions of each organ to develop a model that quantitatively predicts the differentiation of 14 heart, lung, and stomach tissues. A total of 8,555 RNA-seq datasets (transcript RPKM of GTEx 15 version 6) of 53 tissues were obtained through the publicly available GTEx database to select 16 genes that could reflect the characteristics and functions of each tissue. While the RNA-seq 17 data of the lung and stomach provide only one RPKM dataset each, the heart is divided into 18 two subtissues (atrial appendage and left ventricle). We used both heart subtissues as a single 19 heart sample, and all other subtissues of the remaining tissues were used as independent data.

20
For the classification of 52 tissues, the MDS plot was calculated with 56,238 RPKM values of 21 all genes, and the testis was excluded because it is clearly separated from other tissues. Genes 22 specifically expressed in testis can lead to false positive results when identifying organ-specific 1 genes of three tissues. In addition, gender-specific tissues such as the ovary, uterus, vagina, 2 fallopian tube and cervix were excluded. We also excluded whole blood and blood cell tissues 3 because they include peripheral blood leukocytes. Finally, we prepared a total of 43 tissues 4 (7,579 samples) to construct a panel of genes to reflect the characteristics and function of the 5 heart, lung and stomach, excluding 7 gender-specific tissues and 2 blood tissues. Data 6 preprocessing was performed before extracting the tissue-specific expression genes. In the data 7 preprocessing, the genes were limited to protein-coding genes, and nonexpressed genes and/or 8 genes with extremely low expression were filtered out. The gene sets of 43 tissues were 9 matched to compare the gene expression differences between tissues. First, for identification 10 of genes that could play a major role in tissue differentiation, 18,818 protein-coding genes were 11 extracted through Ensembl gene ID information provided by GTEx RNA-seq data. Then, we 12 removed 1,343 genes whose maximum value was less than 1 or third quantile value was less  of differentiation was measured quantitatively by calculating the distance between the organ 7 and the undifferentiated tissues. Using 73 heart samples and two undifferentiated organoid 8 samples, we calculated the standard gene expression vector that can determine the degree of 9 tissue development. First, each data point was normalized using log2 transformation. Using 10 undifferentiated organoid and heart samples, we calculated the Manhattan distance close to the 11 differentiated organoid. To calculate this, we performed the following steps. For n genes, the 12 RPKM value of the differentiated organoid sample is X = (x 1 , ⋯ , x n ), the jth RPKM value 13 of the undifferentiated organoid sample is U j = (u j1 , ⋯ , u jn ), ℎ j = 1,2 kth RPKM value 14 of the heart organ sample is H k = (h k1 , ⋯ , h kn ) , and k = 1, ⋯ , K , with K = 73 . The 15 minimum RPKM of the two undifferentiated organoid samples is U m = 16 (min (u 11 , u 21 ), ⋯ , min (u 1n , u 2n )) = (u m1 , ⋯ , u mn ) . Similarly, the minimum RPKM value 17 for each gene of the heart samples is set atH m = (min(h 11 , ⋯ , h K1 ), ⋯ , min(h 1n , ⋯ , h Kn )) = 18 (h m1 , ⋯ , h mn ). U m is the boundary of the undifferentiated organoid of the RPKM, and H m 19 is the minimum RPKM expression boundary of the heart. Using these parameters, we estimated 20 the status of differentiation by calculating the Manhattan distance among organs and 21 undifferentiated and differentiated organoids after adjusting the expression values to the score 22 boundaries. For quantitative assessment, we found that the score of the unknown sample 23 approached 100 when the distance of the differentiated organoid was similar to that of the organ 1 and approached zero if it was closer to the undifferentiated organoid. We compared 2 differentiated organoid X with H m for each gene and set the minimum value of X m = 3 (median(u m1 , x 1 , h m1 ) , ⋯ median(u mn , x n , h mn )) as the RPKM value for the differentiated 4 organoid. If X m = H m , the score value is set at 100%, and the closer the U m , the more the 5 score value decreases and the farther away it is from the heart. The score is as follows: The same expression value is used for the lung and stomach. For the lung, the RPKM value of 9 the k th lung organ sample is L k = (l k1 , ⋯ , l kn ) , k = 1, ⋯ , K , K = 8 and X m = 10 (min(x 1 , l m1 ) , ⋯ min(x n , l mn )). The score is as follows: