scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks

Simulated single-cell data is essential for designing and evaluating computational methods in the absence of experimental ground truth. Existing simulators typically focus on modeling one or two specific biological factors or mechanisms that affect the output data, which limits their capacity to simulate the complexity and multi-modality in real data. Here, we present scMultiSim, an in silico simulator that generates multi-modal single-cell data, including gene expression, chromatin accessibility, RNA velocity, and spatial cell locations while accounting for the relationships between modalities. scMultiSim jointly models various biological factors that affect the output data, including cell identity, within-cell gene regulatory networks (GRNs), cell-cell interactions (CCIs), and chromatin accessibility, while also incorporating technical noises. Moreover, it allows users to adjust each factor’s effect easily. We validated scMultiSim’s simulated biological effects and demonstrated its applications by benchmarking a wide range of computational tasks, including cell clustering and trajectory inference, multi-modal and multi-batch data integration, RNA velocity estimation, GRN inference and CCI inference using spatially resolved gene expression data. Compared to existing simulators, scMultiSim can benchmark a much broader range of existing computational problems and even new potential tasks.


21
In recent years, technologies that profile the transcriptome and other modalities (multi-omics) of single cell have 22 brought remarkable advances in our understanding of cellular mechanisms [61]. For example, technologies 23 have enabled the joint profiling of chromatin accessibility and gene expression data [10; 8; 41], as well 24 as the measurement of surface protein abundance alongside transcriptome [56; 47]. Additionally, spatial 25 locations of cells can be measured together with transcriptome profiles using imaging-based [52; 19; 63] or 26 sequencing-based [55; 50] technologies. 27 The advent of single-cell multi-omics data has facilitated a more comprehensive understanding of cellular 28 states, and more importantly, allowed researchers to explore the relationships between modalities and the 29 causality across hierarchies [18]. Prior to the availability of single cell multi-omics data, gene regulatory 30 network (GRN) inference methods were developed using only single-cell RNA sequencing (scRNA-seq) data [48]. 31 However, these methods mainly focused on transcription factors (TFs) as the sole factor affecting gene 32 expressions. In reality, the observed gene-expression data is affected by multiple factors, such as the chromatin 33 accessibility of corresponding regions. Consequently, newer methods utilizing both scRNA-seq and scATAC-seq 34 data have been developed to infer GRNs [30;62;68]. Similarly, there has been a surge in the development 35 of other computational tools that harness multi-modality information. For instance, Cell-Cell Interaction (CCI) 36 inference methods seek to utilize both the gene expression and the spatial location modalities [16; 53; 5; 6] to 37 learn the interactions with a lower false-positive rate than those using only scRNA-seq data [4; 26; 29]. Data 38 integration methods combine multi-omics data to obtain a wholistic view of cells [58; 64; 1; 70; 36]. Moreover, 39 RNA velocity can be inferred from unspliced and spliced counts using scRNA-seq data to indicate the near-future 40 state of each cell [35; 3]. Recently, methods have also been proposed to infer RNA velocity from jointly profiled 41 chromatin accessibility and transcriptomics data [38]. 42 Overall, a large number of computational methods have been developed using scRNA-seq data or single cell 43 multi-and spatial-omics data [66]. However, the scarcity of ground truth in experimental data makes it difficult 44 to evaluate the performance of proposed computational methods. To address this, de novo simulators have 45 been widely used to evaluate the accuracy of computational methods by generating data that models biological 46 mechanisms and provides ground truth for benchmarking. SymSim [69], for example, provides ground truth cell 47 identity and gene identity and thus can benchmark clustering, trajectory inference and differential expression 48 detection. SERGIO

87
In the following sections, we will provide a brief overview of the core concepts and the simulation process of 88 scMultiSim. We will then demonstrate its capability to simulate multiple biological factors simultaneously by 89 validating the effects of each factor on the output data. Furthermore, we will showcase the applications of 90 scMultiSim by using it to benchmark a wide variety of computational tools. 91 scMultiSim overview 92 The kinetic model and control of intrinsic noise. In general, scMultiSim runs the simulation in two phases 93 (Fig. 1b). In the first phase, scMultiSim employs the widely-accepted kinetic model [46] to generate the true 94 gene expression levels in cells ("true counts"). In the second phase, scMultiSim introduces technical variations 95 (library preparation noise, batch effects, etc) and generate scRNA-seq and scATAC-seq data that are statistically 96 comparable to real data ("observed counts"). To model cellular heterogeneity and gene regulation effects, 97 scMultiSim introduces two main concepts: Cell Identity Factors (CIFs) and Gene Identity Vectors (GIVs) (Fig. 1b   98 (i, ii)). Biological factors, including cell population (cell identity), GRNs, and CCIs, are encoded in CIFs and GIVs 99 (Fig. 2a). Additionally, to model single-cell chromatin accessibility, we also introduce Region Identity Vectors 100 (RIVs, Fig. 1b(iii)). Further details on CIF, GIV and RIVs are provided in the next section. 101 When simulating single cell gene expression data, scMultiSim extends the idea of SymSim [69], where a kinetic 102 model with three major parameters k on , k off , s was used to determine the expression pattern of a gene in a cell 103 ( Fig. 1b (vi)). In the kinetic model, a gene can switch between on and off states, with k on and k off be the rates of 104 becoming on and off. When a gene is in the on state (which can be interpreted as promoter activation), mRNAs 105 are synthesized at a rate s and degrade at a rate d. It is common to fix d at 1 and use the relative values for the 106 other three parameters [43]. The kinetic parameters k on , k off , s are calculated from the CIF and GIV, as well as 107 the corresponding scATAC-seq data (because chromatin accessibility is considered to affect gene expression). 108 Since GIVs and CIFs encode information on cell identity, GRNs, and CCIs, the kinetic parameters thus capture 109 the four biological factors that affect gene expression: cell identity, chromatin accessibility, GRNs, and CCIs. 110 The kinetic model used in scMultiSim provides two modes for generating true counts from the parameters, 111 as shown in Fig. 1b (vii). The first mode is the full kinetic model, where genes undergo several cell cycles over 112 time with on/off state changes, and the spliced/unspliced RNA counts are calculated. This mode provides the 113 ground truth for RNA velocity since the RNA synthesis rate is known. The second mode is the Beta-Poisson 114 model, which is equivalent to the kinetic model's master equation [31], and is faster to run than the full kinetic 115 model. The Beta-Poisson model is recommended when RNA velocity is not needed. In the Beta-Poisson model, 116 scMultiSim also introduces an intrinsic noise parameter σ i that controls the amount of intrinsic noise caused by 117 the transcriptional burst and the snapshot nature of scRNA-seq data. This parameter allows users to examine 118 the influence of intrinsic noise on the performance of the computational methods. The two modes and the σ i 119 parameter are described in Methods.
Identity Vectors (GIVs) allows scMultiSim to encode cell identities and gene-level mechanisms (such as GRNs 122 and CCIs) into the kinetic parameters and thereby impact the gene expression levels. This design also provides 123 easy ways to adjust the effect of each factor on the output gene expression data. 124 The CIF of a cell is a 1D vector representing various biological factors that contributes to cellular heterogeneity, 125 such as the cell condition (e.g. treated or untreated), or the expression of key TFs. The GIV of a gene act as 126 the weights of the corresponding factors in the CIF, representing how strongly the corresponding CIF affect the 127 gene's expression (Fig. 2a, Methods). By multiplying the CIF and GIV matrices, scMultiSim therefore generates 128 a n cell ◊ n gene matrix, which is the desired kinetic parameter matrix with the cell and gene factors encoded. 129 Each CIF vector and GIV vector consists of four segments, each representing one type of extrinsic variation. 130 They encode biological factors including cell identity (cell population, i.e., the underlying cell trajectories or 131 clusters), GRNs, and CCIs (Figs. 2a, S1a-b). We introduce the four segments in the following. 132 (i) Non-differential CIFs (non-diff-CIF) model the inherent cellular heterogeneity. They represent various 133 environmental factors or conditions that are shared across all cells and are sampled from a Gaussian distribution 134 with standard deviation σ cif . 135 (ii) Differential CIFs (diff-CIF) control the user-desired cell population. These are the biological conditions that 136 are unique to certain cell types. These factors lead to different cell types in the data. For a heterogeneous cell 137 population, cells have different development statuses and types. Values for diff-CIFs are used to represent these 138 cell differential factors, which are generated based on the user-input cell differential tree. When generating data 139 for cells from more than one cell type, the minimal user input of scMultiSim is the cell differential tree, which 140 controls the cell types (for discrete populations) or trajectories (for continuous populations) in the output. The 141 process of generating diff-CIFs is described in Methods. 142 (iii) CIFs corresponding to Transcription Factors (tf-CIF) control the effects of GRNs. This segment, together 143 with the TF segment in the GIV, model how a TF can affect expression of genes in the cell (Methods). Its length 144 equals to the number of TFs. In other words, the GRN is encoded in the tf-CIFs and GIVs. 145 (iv) CIFs corresponding to ligands from neighboring cells (lig-CIF) control the effect of CCI. If CCI simulation 146 is enabled, this segment together with the ligand segment in the GIV of the receptor gene encodes the ground 147 truth CCI between two cells. This encoding ensures that a ligand and its interacting receptor have correlated 148 gene expression. A receptor can also interact with ligands of multiple neighbors (Fig. 2a (viii)). The GIV matrices 149 are generated carefully considering the nature of the kinetic model (Methods). 150 The simulation process. Fig. 1b shows an overview of the simulation process. The scATAC-seq data is generated 151 at first (Fig. 2b(iv)), because we consider that the chromatin accessibility of a cell affects its gene expression. 152 The scATAC-seq data also follows a pre-defined clustering or trajectory structure represented by the input cell 153 differentiation tree. Similar to the gene expression, we multiply the CIF with a Region Identity Vector (RIV) 154 matrix, which represents the effect of each CIF on the accessibility of chromatin regions. Details on generating 155 the scATAC-seq data are included in Methods. The scATAC-seq data affects scRNA-seq data through the k on 156 parameter, because chromatin accessibility controls the activated status of genes (Methods). 157 After obtaining all the kinetic parameters, scRNA-seq data can be generated in different modes: with or without 158 CCIs and spatial locations, and with or without outputting RNA velocity data ( Fig. 1b (vii, viii)). If the user specify 159 to generate RNA velocity, the full kinetic model is used, where cells undergo several cycles before the spliced 160 and unspliced counts are outputted (Methods). Otherwise, if the Beta-Poisson model is used, and the true counts 161 are sampled from the Beta-Poisson distribution. In this mode, RNA velocity and unspliced count data are not 162 outputted.

163
Simulating cell-cell interaction. If specified to generate spatial-aware single cell gene expression data including 164 cell spatial locations and CCI effects, scMultiSim uses a multiple-step approach that considers both time and 165 space ( Fig. 1b (viii), Fig. S1c). The simulation consists of a series of steps, with each step representing a time 166 point. Cells are placed in a grid ( Fig. 2a (ix), Fig. S1d), and one cell is added to the grid at each step, representing 167 a newborn cell. Users can use the parameter p n to control the probability for the newborn cell to locate with cells 168 of the same type (Methods). As experimental data cannot measure cells at previous time points, scMultiSim 169 outputs data only for cells at the final time point, which contains the accumulated CCI effects during the cells' 170 developmental process. 171 To simulate CCI, scMultiSim requires a user-inputted list of ligand-receptor gene pairs that can potentially 172 interact, which is called a ligand-receptor database. Users can input cell-type-level or single cell level CCI ground 173 truth. If users do not provide ground truth CCIs, scMultiSim can randomly generate the ground truth from the 174 ligand-receptor database. 175 Technical variations and batch effects. The steps described above belong to the first phase, which generates the 176 "true" mRNA counts (and unspliced counts if RNA velocity mode is enabled) in the cells. In the second phase, 177 scMultiSim simulates key experimental steps in wet labs that lead to technical noises in the data and output the 178 observed scRNA-seq data. Batch effects can also be added to simulate datasets from a user-specified number 179 of batches. Users can also control the amount of technical noise and batch effects between batches. These 180 procedures are described in Methods. Next, we show the various output of scMultiSim and validate the effects 181 present in the simulated data. 183 We have generated a comprehensive set of datasets using scMultiSim to demonstrate the effects of different to minimize potential bias and provide a more comprehensive benchmark of the computational methods. 192 As presented in Table 1, we label the main datasets with the following format: M{p}{c}{s}. The first letter M 193 denotes the main dataset, followed by a letter p oe{L, T, D} that specifies the cell population as linear trajectory, 194 tree trajectory or discrete, respectively. The number c oe [1, 12] denotes a particular configuration of σ cif , n cell , and 195 n gene , while the last lowercase letter s oe{a, b, c, d} represents random seed 1-4. For instance, the dataset MD5c 196 has a discrete cell population, σ cif =0.1, 800 cells, 200 genes and random seed 3. 197 We have also generated auxiliary datasets with fewer types of effects and presented them in Table 2. These 198 datasets allow us to explore the effect of other parameters and are compatible with computational methods 199 that impose additional constraints on the input. In the remaining, we will primarily use the main datasets M for 200 benchmarking and demonstration, while the auxiliary datasets will serve as additional and supplementary results. 201 scMultiSim generates multi-batch and multi-modality data from pre-defined clusters or trajectories 202 scMultiSim offers a key advantage in its ability to generate coupled scRNA-seq and scATAC-seq data while 203 allowing users to control the shape of trajectories or clusters. It is accomplished by offering various parameters 204 to control the structure of cell populations. First, the user can choose to generate "continuous" or "discrete" 205 populations, and input a tree that represents the cell trajectories (in the case of "continuous" populations) or 206 relationship between clusters (in the case of "discrete" populations). We name the tree "differentiation tree". 207 scMultiSim provides three example differentiation trees: Phyla1, Phyla3, and Phyla5, each having 1, 3, and 208 5 leaves, as illustrated in Fig. 2b. The main datasets were simulated using these trees (Table 1). From a 209 differentiation tree, scMultiSim is able to generate both discrete and continuous cell populations (Fig. 2c). Then, 210 users can use these three parameters: intrinsic noise σ i , CIF sigma σ cif and Diff-to-nonDiff CIF ratio r d , to 211 control how clean or noisy the population structure is in the data (Fig. 2c-e). 212 For the continuous population, we visualize a dataset MT3a generated using tree Phyla3 in Fig. 2c. We 213 can observe that the trajectories corresponding to the input differentiation tree are clearly visible for both the 214 scRNA-seq and the scATAC-seq modality. For the discrete population, we visualize dataset MD3a and MD9a 215 generated with tree Phyla5 in Fig. 2d. The parameter σ cif controls the standard deviation of the CIF, therefore 216 with a smaller σ cif , the clusters are tighter and better separated from each other. We then used the auxiliary 217 dataset A (Table 2) to explore the effect of the intrinsic noise parameter σ i and r d , the ratio of number of diff-CIF 218 to non-diff-CIFs. In Fig. 2e, we visualize the scRNA-seq modality generated using Phyla5 continuous mode 219 with the same σ cif . With a smaller Diff-to-nonDiff CIF ratio r d , the trajectory is vague and more randomness is 220 introduced, as the tree structure is encoded in the differential CIFs. With a smaller intrinsic noise σ i , a fraction 221 of the expression value is directly calculated from kinetic parameters without sampling from the Poisson model; 222 As a result, the trajectory is more prominent. These patterns are much cleaner than real data because real data 223 always has technical noise. We will show more results with technical noise in later sections and in Fig. S2. 224 Coupling between scATAC-seq and scRNA-seq data. In paired scATAC-seq and scRNA-seq data, these two 225 data modalities are not independent of each other, as it is commonly considered that a gene's expression 226 level is affected by the chromatin accessibility of the corresponding regions. If a gene's associated regions 227 are accessible, this gene is more likely to be expressed. This mechanism can be naturally modeled in scMultiSim 228 through the kinetic parameter k on (Methods). 229 We provide a user-adjustable parameter, the ATAC-effect E a , to control the extent of scATAC-seq data's effect 230 on k on (ranging from 0 and 1). In order to validate the connection between the scATAC-seq and scRNA-seq data, 231 we calculate the mean Spearman correlation between these two modalities for genes that are controlled by one 232 region in the scATAC-seq data. In The strength of scMultiSim also resides in its ability to incorporate the effect of GRN and CCI while preserving 256 the pre-defined trajectory structures. In this section, we show that the GRN and CCI effects both exist in the 257 simulated expression data. The main datasets (Table 1)  to each cell (Methods), for example, a newborn cell has a probability p n of staying with a cell of the same type. 277 Changing p n allows us to generate different tissue layouts. In real data, how likely cells from the same cell type 278 locate together depends on the tissue type, and scMultiSim provides p n to tune this pattern.  (Table S1). We compare the simulated data with real data in terms of the following properties: library size, zero 296 counts per cell, zero counts per gene, mean count per gene, variation per gene, and the ratio between zero count 297 and mean count per gene (Fig. 3g). 298 Fig. 3g shows that the library size, zero counts per cell, zero counts per gene and mean counts per gene 299 simulated by scMultiSim are closer to that of real data than the dyngen simulated data, and both scMultiSim and 300 dyngen are able to simulate data with realistic variation per gene. There is also usually a negative correlation 301 between zero counts and mean counts in real data, and scMultiSim is able to simulate this relationship, matching 302 well with the reference data.

303
Benchmarking computational methods using scMultiSim 304 We next show that scMultiSim can be used to benchmark a board range of computational tasks in single cell 305 genomics, including clustering, trajectory inference, multi-modal data integration, RNA velocity estimation, GRN 306 inference and CCI inference using spatially resolved single cell gene expression data. Using scMultiSim, we 307 studied the performance of several recent methods on each task, and also investigated the effect of particular 308 parameters for some of the benchmarks. As far as we know, scMultiSim is the only simulator that can benchmark 309 all these tasks. It is noteworthy that our intention is not to perform a comprehensive benchmarking analysis, but 310 rather to show evidence of scMultiSim's broad applications. We anticipate that these benchmarks can encourage 311 forthcoming researchers to discover more use cases of scMultiSim.

312
Benchmarking clustering and trajectory inference methods 313 We first applied scMultiSim to test methods for two classic problems: cell clustering and trajectory inference, 314 using the scRNA-seq modality in our discrete main datasets (MD,  (Fig. 4a). For each method and each dataset in 316 the main datasets, we vary the parameter "number of clusters". Since Seurat does not provide direct control over 317 the number of clusters, we varied the resolution parameter instead and plotted using the number of clusters 318 in the results. From Fig. 4a, all methods have the best performance when the cluster number is the true 319 value. In general, Seurat and SC3 have better performance than the others, which is consistent with previous The result is shown in Fig. 4c. Since Seurat-bridge does not output the latent embedding for the "bridge" 345 dataset (batch 1 in Fig 4d), only the two matrices from batches 2 and 3 (colored in Fig. 4d) were used for 346 evaluation. We observe that UINMF has the best performance in terms of all four measurements. Seurat-bridge 347 and Cobolt have comparable ARI and NMI but Cobolt has better batch mixing scores. When comparing the 348 ARI and NMI scores for σ cif =0.1 and σ cif =0.5, one can observe that these cell identity preservation scores 349 are higher with smaller σ cif . Comparing different cell population structures, we see that continuous populations 350 ("Linear" and "Tree") have lower ARI and NMI scores than discrete populations, potentially because that metrics 351 like ARI and NMI are better suited for discrete populations. 352 We then ran the integration methods on a large dataset with 3000 cells and visualized the integrated 353 latent embedding in Fig. S5, which helped us to understand each method's behavior. We noticed that while 354 Seurat-bridge has lower graph connectivity and ASW scores, different batches are located closely (but do not 355 overlap) in the visualized latent space. That the reference and query data in the latent space do not overlap can 356 cause the low batch mixing scores, but may not affect the ability of label transfer.

357
Benchmarking RNA velocity estimation methods 358 We demonstrate scMultiSim's ability of benchmarking RNA velocity estimation methods by running two 359 representative RNA velocity inference methods, scVelo [3] and VeloCyto [35], on the simulated data. We compare 360 all three models in scVelo package, including the deterministic, stochastic, and dynamical models. The auxiliary 361 dataset V (Table 2)  it is still a challenging task to infer RNA velocity from noisy scRNA-seq datasets.

379
Benchmarking GRN inference methods 380 Using scMultiSim, we benchmarked 11 GRN inference methods which were compared in a previous 381 benchmarking paper [48]. Using the predicted networks, we calculate the AUROC (area under receiver operating 382 characteristic curve) as well as the AUPRC (area under precision-recall curve) ratio, which is the AUPRC divided 383 by the baseline value (a random predictor). These metrics were also used in previous benchmarking work [48]. 384 We show results on the 144 main datasets in Fig. 5a. To further inspect the performance in a less-noisy 385 scenario, we also generated auxiliary datasets G (Table 2) with a linear trajectory and without CCI effect. 386 We benchmarked the methods using true counts and observed counts in G, respectively. The result of G is 387 shown in Fig. 5b indicating that GRN inference is still a challenging problem. 396 Notably, the ordering of the methods tested using true counts is generally consistent with the ordering reported  Fig. 5c. Since SpaTalk needs a minimum of 3 genes from the receptor to a downstream activated 408 TF, we also generated an auxiliary dataset C (Table 2) using an artificial GRN with long pathways to satisfy such 409 requirement (Fig. S6a). There are totally eight C datasets with 500 cells, 200 genes and a linear trajectory, and the 410 result is shown in fig. S6b  allows users to set the trajectory to any shape without affecting the GRN effects. Users can also let the GRN 446 control the trajectory by increasing the number of non-diff CIF. 447 We underline that scMultiSim's major advantage is its ability to encode various factors into a single versatile 448 model, thus creating a comprehensive multi-modal simulator that can benchmark an unprecedented range 449 of computational methods. More importantly, the coupled data modalities in the output jointly provide more 450 information than a single modality alone, making it ideal for designing and benchmarking new methods on 451 multi-omics data. We believe that scMultiSim has the potential to be a powerful tool for fostering the development 452 of new computational methods for single-cell multi-omics data. Moreover, as more benchmarks are conducted, it 453 can help researchers select the appropriate tool based on the type of data they are working with, leading to more 454 accurate and reliable analyses. The sampling process from the Beta-Poisson distribution to obtain x introduces intrinsic noise to the data, which corresponds to the intrinsic noise in real data caused by transcription burst. The theoretical mean of the kinetic model, which is ( k on k on +k off · s), corresponds to the gene expression level of the gene with no intrinsic noise. We introduced parameter σ i which controls the intrinsic noise by adjusting the weight between the random samples from the Poisson distribution and the theoretical mean: The intrinsic noise in scRNA-seq data is hard to reduce in experiments due to the snapshot nature of scRNA-seq 462 data. The parameter σ i allows users to investigate the effect of intrinsic noise on the performance of the 463 computational methods.

465
The length of the CIF and GIV, denoted by n cif , can be adjusted by the user. Overall, we have a n cell ◊ n cif CIF 466 matrix for each kinetic parameter (Fig. S1a), where each row is the CIF vector of a cell. Correspondingly, we also 467 have the n cif ◊ n gene Gene Identity Vectors (GIV) matrix, (Fig. S1b) where each column is linked to a gene, acting 468 as the weight of the corresponding row in the CIF matrix, i.e. how strong the corresponding CIF can affect the 469 gene. In short, CIF encodes the cell identity, while GIV encodes the strength of biological effects. Therefore, by 470 multiplying the CIF and GIV matrix, we are able to get a n cell ◊ n gene matrix, which is the desired kinetic parameter 471 matrix with the cell and gene effects encoded. Each cell has three CIF vectors corresponding to the three kinetic 472 parameters k on , k off , and s, and similarly for the GIV vectors ( Fig. S1a-b). 473 When generating data for cells from more than one cell type, the minimal user input of scMultiSim is the cell 475 differentiation tree, which controls the cell types (for discrete population) or trajectories (for continuous population) 476 in the output. The generated scRNA-seq and scATAC-seq data reflect the tree structure through the diff-CIF 477 vectors. The diff-CIF vectors are generated as follows: starting from the root of the tree, a Gaussian random walk 478 along the tree (Fig. 2a) is performed for each cell to generate the n diff-CIF dimension diff-CIF vector. Parameter 479 σ cif controls the standard deviation of the random walk, therefore a larger σ cif will produce looser and noisier i is the expression level of the i th TF in the t th cell. The corresponding tf-CIF for the root cell is sampled 499 randomly from the Gaussian distribution N cif supplied by the user. 500 The TF part of the GIV for a gene also has length of n TF (Fig. S1b). Considering all genes, we have a n gene ◊ 501 n TF matrix, which we call the GRN effect matrix. This matrix encodes the ground truth GRN that is supplied by 502 E lig-CIF and GIV encode cell-cell interactions the user. Naturally, the GRN effect matrix is included in the GIV when calculating the s parameter, where the 503 value at (i, j) is the regulation strength of TF j on gene i. Therefore, a larger regulation strength will lead to 504 higher s, and consequently, higher expressions for the target genes. For k on and k off , the tf-CIF vector is sampled 505 using N cif , assuming that the GRN does not affect the active state of a gene. However, in the scenario where it 506 is desired to model GRN effect also in k on and k off , similar GRN effect matrix for s can be used for k on and k off . ligand CIF for the current step is inherited from the previous one to make sure other factors stay the same. 547 We use the following steps to calculate the correlation between the expressions of neighboring cells in Fig. 3e. of the corresponding factors in the CIF, i.e., how strong the corresponding CIF can affect the gene (Fig. 2a). If we 558 have n gene genes, we obtain a GIV matrix of size n cif ◊ n gene . 559 It can be divided into four submatrices as shown in Fig. S1b. For k on and k off , the nd-CIF and diff-CIF are 560 sampled from distribution G as shown below: where p G 0 is a parameter specifying the probability of being zero, and N giv is a user-adjustible gaussian 562 distribution. tf-GIV and lig-GIV are all zeros since TF/ligands affect s only. For s, the tf-GIV submatrix is the 563 GRN effect matrix, i.e. a n TF ◊ n gene matrix where the entry at (i, j) is the regulation effect between TF i and 564 gene j. Similarly, the lig-GIV submatrix is the cell-cell interaction effect matrix. The nd-GIV submatrix is sampled 565 from G. For diff-GIV, we do the following steps to incorporate the connection between TFs and regulated genes: 566 G Simulating scATAC-seq data and relationship between scATAC-seq and scRNA-seq (1) Randomly select 2 GIV entries for each TF gene and give them a fixed small number.
(2) For every target 567 gene, it should use the same GIV vector as its regulators. If a gene has multiple regulators, its gene effects will 568 be the combination of that of the regulators. This is achieved by multiplying the n diff ◊ n TF GIV matrix in (1) and 569 the n TF ◊ n gene effect matrix. If a gene is both a TF and target, its GIV will be 0.5 · ((1) + (2)).

570
G. Simulating scATAC-seq data and relationship between scATAC-seq and scRNA-seq 571 Since scMultiSim incorporates the effect of chromatin accessibility in the gene expressions, the scATAC-seq data 572 is simulated before the scRNA-seq data. The cell types in the scATAC-seq data can follow the same differentiation 573 tree as in the scRNA-seq data (the scATAC-seq and scRNA-seq data share the same cells) or can follow a 574 different tree (to reflect the difference between modalities). 575 Similar to GIV, we use a randomly sampled Region Identity Vector (RIV) matrix to represent the chromatin 576 regions. Following the same mechanism, we multiply the CIF and RIV matrix, and obtained a "non-realistic 577 scATAC-seq" data matrix. Next, the scATAC-seq data matrix is obtained by scaling the "non-realistic" scATAC-seq 578 data to match a real distribution learned from real data. This is a step to capture the intrinsic variation of 579 the chromatin accessibility pattern, which we will also apply to the kinetic parameters when generating gene 580 expressions.

581
The RIV matrix is sampled from a distribution R similar to G: where p R 0 is the probability of being zero and N riv is a user-adjustable Gaussian distribution. With the CIF and 583 RIV matrices, the n cell ◊ n region scATAC-seq can be generated by (1) multiplying the CIF matrix by the RIV matrix, 584 (2) scale the matrix to match the real data distribution, and (3) adding intrinsic noise (sampled from a small 585 Gaussian) to the scATAC-seq data. In Step (2), we use the same rank-based scaling process as used for the 586 kinetic parameters as described in Section "Preparing the kinetic parameters" above, and the real scATAC-seq 587 data distribution is obtained from the dataset in [11]. 588 To incorporate the relationship between scATAC-seq and scRNA-seq data, we use the scATAC-seq data to 589 adjust the k on parameter that is used to generate the scRNA-seq data, considering that chromatin accessibility 590 affects the activated status of genes. First, a region-to-gene matrix (Fig.1b) is generated to represent the mapping respectively. The scATAC-seq data is also used to adjust k on as described in the following section. 596 The kinetic parameters, k on , k off and s are needed when generating single cell gene expression data (mRNA 597 counts) using the kinetic model or Beta-Poisson distribution (Fig. 1b). While the basic idea is to get the parameter 598 matrix using CIFs and GIVs (Fig. 1b), the three parameters go through different post-processing after the step of 599 CIF ◊ GIV. We first denote the result of CIF ◊ GIV for k on , k off and s as M 1 , M 2 and M 3 , respectively. 600 (i) k on . Since chromatin accessibility controls the activation of the genes, the scATAC-seq data is expected 601 to affect the k on parameter. We first prepare a n region ◊ n gene 0-1 region-to-gene matrix Z using r, where Z ij 602 indicates region i is associated with gene j (Z is outputted as the region-to-gene matrix). We multiply the  . We also provide an optional cell length factor η L parameter to scale the cycle length. The probabilities of gene switching on or off are then calculated with p on = k on m·L and p off = k off m·L . In each simulation step, we update the cell's current on/off state based on p on and p off , and generate the spliced/unspliced counts x s and x u . The spliced counts at step t are obtained by:

H. Preparing the kinetic parameters
and the unspliced counts are obtained by: if state is off The outputted x s and x u are the values at the final step t = m. The ground truth RNA velocity is calculated as: We obtain the KNN averaged RNA velocity by applying a Gaussian Kernel KNN on the raw velocity data, with

666
When evaluating the trajectory inference methods, we calculate the coefficient of determination R 2 and the kNN purity for all cells on each lineage. Given the cells' ground truth pseudotime vector t and the inferred pseudotimê t, the R 2 is equal to the square of the Pearson correlation coefficient: wheret is the mean of t. Given a cell i's kNN neighborhood Nt i int and its kNN neighborhood N t i in t, the kNN 667 purity K p for the cell is the Jaccard Index of N t i and Nt i .

668
The evaluation metrics used for multi-model data integration methods, Graph Connectivity and ASW, are 669 described as following. 670 M Details on running clustering methods Graph Connectivity is defined as: where C is all cell types, LCC(c) is in the largest connected component for cells of type c. 672 For the ASW: where M is the set of all cell types, and C j is all the cells of type j. We used the implementation in [40]. 674 When evaluating RNA velocity inference methods, we used the cosine similarity between the averaged estimated velocity and the ground truth. Calculating the average of estimated velocity vectors is commonly used to reduce local noise [7]. In dyngen [7], averaged RNA velocities were calculated across cells at trajectory waypoints weighted through a Gaussian kernel using ground truth trajectory; while in scMultiSim, we averaged the raw velocity values by kNN with a Gaussian kernel and k = n cells /50 to achieve a similar averaging effect.
Finally, cosine similarity is calculated as: where v i is the ground truth velocity vector for cell i, and u i is the predicted velocity vector.

O. Details on running data integration methods
We use all 144 main datasets. Technical noise and batch effects were added using default parameters (non-UMI, 691 α ≥N(0.1, 0.02), depth ≥N(10 5 , 3000), ATAC observation probability 0.3). All integration methods were run on 692 the scRNA and scATAC data with technical noise and batch effect. For Seurat-bridge, we followed the vignette 693 "Dictionary Learning for cross-modality integration" in Seurat 4.1.0 using the default parameters. For UINMF, 694 we used the latest GitHub release. We followed the "UINMF integration of Dual-omics data" tutorial and ran the 695 optimizeALS method using k = 12. For Cobolt, we used the GitHub version cd8015b, with 10 latent dimensions, 696 learning rate 0.005. If the loss diverged, we automatically retry with learning rate 0.001. The metrics, including 697 ARI, NMI, Graph Connectivity, and ASW were computed using the scib [40] package. 698 P. Details on running RNA velocity estimation methods 699 We use the datasets V to benchmark RNA velocity inference methods as shown in Table 2 703 We use the BEELINE [48] framework to benchmark GRN inference methods. Apart from the main datasets, The 704 dataset G (Table 2) was generated using the following configurations: The 100-gene GRN in Fig. 3, 1000 cells, 705 50 CIFs, r d =0.2, σ i =1, with other default parameters. Eight datasets were generated for random seed 1 to 8, 706 and technical noise and batch effect was added using default parameters. We ran the BEELINE GitHub version 707 79775f0. In order to resolve runtime errors, all docker images were built locally, except that we used the provided 708 images on Docker Hub for PIDC and Scribe. We use BEELINE's example workflow to infer GRN and calculate the 709 AUPRC ratio and AUROC for (a) true counts in the eight datasets, and (b) observed counts with batch effects in 710 the eight datasets. The AUPRC ratio is the AUPRC divided by the AUPRC of a random predictor, which equals to 711 the network density of the ground truth network. Eleven methods were benchmarked in total: PIDC, GRNBoost2, 712 GENIE3, Sincerities, PPCOR, LEAP, GRISLI, SINGE, GRNVBEM, Scribe and SCODE. 713 R. Details on running CCI inference methods 714 We generated 12 datasets using the following procedure. Apart from the main datasets, for each C dataset 715 (Table 2), we first construct the GRN (Fig. S6a) Connect gene 54-100 to gene 110-156. In this way, we can generate a GRN with reasonable edge density and 718 make sure that there are three downstream genes for each TF, which is required by SpaTalk. Then we construct 719 R Details on running CCI inference methods the ligand-receptor pairs: let the ligands be gene 101-106 and receptors be gene 2, 6, 10, 8, 20, and 30. We 720 divide a linear trajectory into 5 sections, corresponding to 5 cell types. Between each cell type pair (excluding 721 same-type pairs), we sample 3-6 ligand-receptor pairs and enable cell-cell interactions with them for the two cell 722 types. The dataset is then simulated using 160 genes in total, 500 cells, and 50 CIFs. We use the true counts to 723 benchmark the methods. 724 To run SpaTalk, we modify the original plot_lrpair_vln method to return the p-value from the Wilcoxon rank sum test directly, rather than drawing a figure. Before using the p-values to calculate the precision and recall, we adjusted them using Bonferroni correction:p

Q. Details on running GRN inference methods
where p is the p-value vector for all cell types and ligand-receptor pairs. For Giotto, we used the R package 1.  . Overview of scMultiSim. (a): The input, output, and use cases. The minimal required input is a cell differential tree describing the differentiation relationship of cell types. It controls the cell trajectory or clusters in the output. A user-input ground truth GRN is recommended to guide the simulation. Users can also provide ground truth for cell-cell interaction and control each simulated biological effects using various parameters. (b): The overall structure of scMultiSim. The scATAC-seq data (iv) is firstly generated using CIF (i) and RIV (iii). The kinetic parameters used to generate scRNA-seq data (vi) is prepared using GIV (ii), CIF (i) and the scATAC-seq data with (v) a region-to-gene matrix. Using the parameters, either the full kinetic model (when RNA velocity is required), or the Beta-Poisson model (when running speed matters) will be used to generate the scRNA-seq data (vii). scMultiSim uses a multiple-step approach that considers both time and space when CCI is enabled (viii). With the simulated true counts (viv), technical noise and batch effects can be added to obtain the observed counts (x).  Comparison between a real dataset and simulated data using multiple statistical measurements. Parameters were adjusted to match the real distribution as close as possible. results on benchmarking GRN inference methods using datasets G that does not contain CCI effects. We also tested the performance on observed counts with technical noise. (c) Benchmarking cell-cell interaction inference methods. Each curve in the ROC/PRC plots correspondings to one dataset.