Target-Oriented Reference Construction for supervised cell-type identification in scRNA-seq

Cell-type identification is the most crucial step in single cell RNA-seq (scRNA-seq) data analysis, for which the supervised cell-type identification method is a desired solution due to the accuracy and efficiency. The performance of such methods is highly dependent on the quality of the reference data. Even though there are many supervised cell-type identification tools, there is no method for selecting and constructing reference data. Here we develop Target-Oriented Reference Construction (TORC), a widely applicable strategy for constructing reference given target dataset in scRNA-seq supervised cell-type identification. TORC alleviates the differences in data distribution and cell-type composition between reference and target. Extensive benchmarks on simulated and real data analyses demonstrate consistent improvements in cell-type identification from TORC. TORC is freely available at https://github.com/weix21/TORC.


Background
Computational cell-type identi cation (referred to as "cell-typing" hereafter) is the most fundamental step in single-cell RNA sequencing (scRNA-seq) analysis [1].Supervised cell-typing has gained increasing popularity over unsupervised clustering, due to better accuracy and robustness [2].
Supervised cell-typing trains on a reference (training) sample of cells with known cell-type labels, and assigns labels to cells in a target (testing) sample using the trained classi er [3].A key factor that determines a classi er's success is the quality of the training data, especially the similarity between the reference and the target populations [4].A natural idea is to expand the sources of the reference and increase the reference size.However, the bene t of a larger reference sample is limited by its quality.
Much effort has been put into developing new cell-typing algorithms [5][6][7].Not enough attention is given to selecting and constructing reference data, which we argue is more fundamental to the choice of algorithm.To ll the gap, we develop a novel method named "Target-Oriented Reference Construction (TORC)".We rst demonstrate the importance of reference quality and show that the reference quality should be viewed in respect to the target, thus an appropriate reference is target-oriented.TORC provides a general strategy that minimizes the difference in cell-type composition as well as cell-type-speci c expression pro les between the reference and the target.We demonstrate the improvement from using TORC in extensive real data examples.

Algorithm Overview
TORC aims to construct a reference suitable for a given target data.It rst uses an off-the-shelf supervised method to label the target cells, from which TORC rst estimates cell-type composition in the target.TORC then add target cells with high-con dence labels to the reference to form an expanded reference pool.Then TORC resamples from the pool to construct a new reference according to the target composition.The reconstructed reference is used to build the nal classi er (Fig. 1).The main goal of TORC is to construct a reference that closely resembles the target.TORC employs a two-round prediction strategy.In the rst round, a classi er is trained on a reference with known cell labels to predict the cell-type composition of the target.Using the predicted probability matrix from this initial prediction, entropies are calculated for each cell.Cells with relatively low entropies are then added to expand the reference pool.A target-oriented reference is subsequently created by sampling from the expanded reference pool according to the estimated cell-type composition.The constructed reference is then used to retrain the classi er and update the cell labels of the target.

Study design
A benchmarking of supervised cell-typing for scRNA-seq [8] investigated key factors affecting the performance: feature selection, prediction method, and choice of the reference dataset.They found that multi-layer perceptron (MLP) [9] combined with F-test-based feature selection generally performs the best.Based on these observations, we focus on using MLP to demonstrate the results from the TORC.Results based on ACTINN [10], scNym [11] and scANVI [12] are also included to show the generalizability of TORC.We use Accuracy as the primary assessment metric, which captures the overall percentage of correct cell-type assignments.

Datasets
All datasets used are listed in Tables S1.We include multiple datasets from human peripheral blood mononuclear cells (PBMC) from 10X sequencing platform.The datasets "Covid CN" [13], "Covid UK" [14], "Covid FMC" [15], "Lupus" [16] and "Protocol" [17] each obtains cells from multiple individuals.The "FACS" dataset [18] includes cells separated by uorescence-activated cell sorting (FACS).Substantial inter-individual variability due to age, sex, and overall health are present in PBMC, allowing us to examine the impact of different target-reference sample relationships.S2 shows that, depending on the target, using cells from a single subject as reference can outperform using all 39 subjects from the entire study.
To investigate the reference impact, we compare the performance from three references on 21 targets.
Figure S1 shows a two-fold reference effect.First, some references are better in general.Second, a good general reference is not always the best choice for all targets.For example, reference "batch1 1079" has high accuracy in general, but it is not the best for target "control 1016".Therefore, choosing the best reference has to take the target into consideration.

Both domain shift and composition difference exist in real data
An implicit assumption in most learning scenarios is that reference and the target follow the same distribution.Thus, in the scRNA-seq context, it is ideal if all training cells are from the same biological source as the target.In reality, deviation between the two populations always occurs.This discrepancy calls for a balance between quantity and quality.The reference quality is attributable to two main aspects: the similarity in the distribution of expression pro les and the similarity in cell-type composition.Constructing a reference that re ects the target cell composition improves accuracy we demonstrate the bene t of constructing a reference that re ects the target cell composition using a toy example.The "FACS" dataset comprises nine subtypes identi ed through FACS experiments, commonly regarded as the "gold standard" dataset.The cytotoxic T and naive cytotoxic T cells are the most di cult to distinguish (Figure S3.A).We create a scenario where the odds of these two cell-types are reversed between the reference and target, while the proportions of other cell-types remain the same (Table S3).Trained with all cells in the reference, the accuracy is 0.84.However, if cells are sampled from the reference pool according to the target cell composition to form the training set, the accuracy increases sharply to 0.9 (Fig. 2.A).When an estimated target cell composition is used in place of the true composition, the accuracy still shows substantial gain.These improvements are due to the reduction of cytotoxic T cells misclassi ed as naive cytotoxic T cells (Figure S3.B, Figure S3.C).
The FACS dataset includes cells from only one source.We next assess the performance of TORC when the reference and target include cells from multiple subjects from the same study (Table S4).The improvement brought by the TORC is consistent (Fig. 2

.A).
A reusability report of scBERT [19] indicates that the role of cell-type distribution is overlooked and taking a reference with balanced cell-type weights may improve prediction [20].Though this approach ensures every cell-type is well represented in reference, it does not re ect the target composition.TORC consistently surpasses the approach of equal-weighted sampling (Fig. 2.A), indicating that leveraging a target-oriented yields greater bene ts compared to a uniform, target-blind .
Expanding the reference pool reduces bias due to domain shift Next, we apply the TORC using three public COVID datasets and two non-COVID datasets as targets to examine the situation where the reference samples are from different studies.Since reference cells are from biological samples in studies different from the target, we employ the reference expansion option in the TORC to include some target cells with low prediction entropy (high con dence) in the expanded reference pool.In most reference-target pairs, using the reference constructed by TORC leads to increased accuracy (Fig. 2.B, Figure S4).Using the same TORC also improves accuracy using ACTINN, scNym and scANVI in most situations (Figure S5).

Use MLP-based reference
In practical applications, if a researcher has a particular interest in a speci c supervised classi cation method such as scANVI, apart from directly applying TORC with scANVI as the algorithm for both the reference construction and ultimate classi cation, we provide a exible, alternative approach.This involves using MLP in reference construction and, once a target-oriented reference is constructed, using the user's choice of classi cation method for cell-type identi cation.Using MLP in reference construction is computationally e cient (Figure S6) and algorithms such as ACTINN, scNym and scANVI bene t from MLP-based reference construction (Fig.

Discussion
TORC constructs a reference sample with the target in mind to address the common issue of dataset shift in scRNA-seq cell-typing.The essence of the algorithm is to consider the complexity in the relationship between reference and target.We nd two major factors that affect the classi cation accuracy: the cell-type-speci c expression pro le and cell-type composition.
In this paper, we aim to point out the importance of reference, in addition to the choice of algorithm, in supervised scRNA-seq cell-type identi cation.We view the current TORC as a beginning and see many potential extensions as public data continues to accumulate.For example, available reference samples can be rated by their quality, re ected by the accuracy in classifying cells in other labeled references.

Conclusions
As scRNA-seq becomes increasingly applied, particularly in large-scale population-level studies, cell-type identi cation remains among the most crucial aspects.Even though there are many supervised celltyping methods, no work has addressed the problem of selecting reference data.To ll this gap, we propose a widely applicable target-oriented reference reconstruction strategy and validate the effectiveness and practicality of the TORC.Both simulated and real data analyses have showcased the potential of this strategy which points out future research interest and directions.For example, a similar strategy can be applied to other single cell assays such as scATAC-seq and spatial transcriptomics.

Declarations
Ethics approval and consent to participate Not Consent for publication Not applicable.
Consent for publication applicable.

Figures
Figure 1
Figure S2.A shows an example with obvious of domain shift while Figure S2.B shows an example with little domain shift, but with large differences in cell-type composition.

P
Cells of different types within a reference sample may be associated with different quality and a good construction may sample cells from multiple sources.