Machine Learning-based Approaches for Identifying Human Cells Harboring Fetal Chromatin Domain Ablations

Two common hemoglobinopathies, sickle cell disease (SCD) and β-thalassemia, arise from genetic mutations within the β-globin gene. A 500-bp motif termed Fetal Chromatin Domain (FCD), upstream of human ϒ-globin locus, may function as a transcriptional regulatory element driving inhibition of the ϒ-globin gene. Here, we hypothesize that the removal of this motif using CRISPR technology may reactivate the expression of ϒ-globin and subsequently restore fetal hemoglobin functionality. In this work we present two different cell morphology-based machine learning approaches that can be used identify cells that harbor FCD genetic modications. Three candidate models from the rst, which uses multilayer perceptron algorithm (MLP 20–26, MLP26-18, and MLP 30 − 26) and ow cytometry-derived cellular data, yielded 0.83 precision, 0.80 recall, 0.82 accuracy, and 0.90 area under the ROC (receiver operating characteristic) curve when predicting the edited cells. In comparison, the candidate model from the second approach, which uses deep learning (T2D5) and DIC microscopy-derived imaging data, performed with less accuracy (0.80) and ROC AUC (0.87). We envision both assays could be valuable and complementary to currently available genotyping protocols for specic genetic modications which result in morphological changes in human cells. machine learning models (logistical regression) can be used to predict cells harboring ubiquitin-proteasome system-related genetic mutations with reasonably good performance Recently, we discovered a 500-bp motif upstream of human ϒ-globin locus 30 (named as Fetal Chromatin Domain, FCD, Supplementary Materials/FCD sequence), which harbors a DNase hypersensitive site and the histone 3 lysine 4, mono-methylated (H3K4Me1) enhancer mark. Our study showed that it is a transcriptional regulatory element of the ϒ-globin gene and the removal of


Introduction
The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) genome editing technology, which is adapted from an immune system analog found in archaea and prokaryotes, has been applied to exceedingly broad scienti c, industrial, and medical domains at an exceptional pace, since the rst demonstration in cells [1][2][3][4][5][6][7][8] . One particularly exciting application is CRISPR-based therapeutics, which as of today, have been extended to at least ve treatment areas: blood disorders, cancers, eye diseases, chronic infections, and protein-folding disorders 9 .
Two most common hemoglobinopathies, sickle cell disease (SCD) and β-thalassemia, arise from genetic mutations within the βglobin gene. These mutations result in de cient or absent β-globin synthesis, which in turn lead to oxygen being disassociated from the hemoglobin and eventually conformational changes in red blood cells 10,11 . No cure is available for these disorders except bone marrow transplantation (BMT) when a suitable donor is available, and most treatments are mainly aimed at relieving symptoms and preventing complications. Recently, the CRISPR technology has been used to reactivate the expression of fetal hemoglobin, which can take the place of defective adult hemoglobin, and shown remarkable results in improving the quality of life in such patients 12 .
Machine learning, which can yield models for pattern recognition, classi cation, and prediction from acquired data, has been widely used in biological studies ranging from protein folding prediction 13 to cancer prognosis 14 . There are two main types of machine learning methods: (1) supervised learning (e.g. random forest, support vector machine), which derive the relationship between a set of input variables (features) and a designated dependent variable (label) from training instances and subsequently can be used to predict on new instances, and (2) unsupervised learning (e.g. clustering), which infer patterns from data without known labels 15 .
More recently, deep learning, a collection of new machine learning techniques extended from classical neural networks, has gained popularity due to its better performance compared to existing best-in-class machine learning algorithms across several elds including linguistics 16 , high-energy physics 17 , computational chemistry 18 , and biology 19 .
One area that has received particular attention in recent years is classi cation of different cell types (e.g. different blood cell types) [20][21][22][23] , states (apoptotic and healthy cells) [24][25][26][27] , and genotypes 28 using machine learning approaches to provide novel insights for biological systems. In one study 29 , Suzuki and colleagues developed a convolutional neural network (CNN) that was at least 90% accurate in classifying whole blood cells, peripheral blood mononuclear cells, human colon cancer cells, and human T lymphoma cells, using imaging ow. Similarly, in our previous work 24 , we have shown that machine learning can be an e cient and cost-effective approach in identifying live and apoptotic human cells. Suzuki and colleagues also demonstrated that, using labelfree, bright eld (BF) microscopy images, machine learning models (logistical regression) can be used to predict cells harboring ubiquitin-proteasome system-related genetic mutations with reasonably good performance (ROC AUC = 0.773) 28 .
Recently, we discovered a 500-bp motif upstream of human ϒ-globin locus 30 (named as Fetal Chromatin Domain, FCD, Supplementary Materials/FCD sequence), which harbors a DNase hypersensitive site and the histone 3 lysine 4, mono-methylated (H3K4Me1) enhancer mark. Our study showed that it is a transcriptional regulatory element of the ϒ-globin gene and the removal of this motif using CRISPR system reactivates the expression of fetal hemoglobin. Herein, we explore cell morphology-based machine learning approaches to classify cells with or without such genetic modi cations within the FCD domain.

Results
Generation and characterization of heterozygous FCD-de cient KU-812 cell model (FCD-HT) The CRISPR/SpCas9 technology was used to generate the FCD-de cient model in KU-812 cells, which were established from the peripheral blood of a patient with chronic myelogenous leukemia 31 . Brie y, the parental cells were transiently transfected with CRISPR/SpCas9 complex which targets both left and right genomic regions anking the FCD motif, along with a homologous recombination repair template containing a puromycin resistance gene transcript ( Figure 1A). The polyclonal stable cell line was then established after 2 weeks of puromycin selection (2µg/mL). Subsequently, to avoid potential interference with the transcriptional activities of the globin genes, the puromycin resistance gene transcript (~2.2kb), which was anked by ippase recognition target (FRT) sites, was removed using ippase. Finally, monoclonal stable cells were established using FACS single cell sorting.
To characterize the stable monoclonal cell line, genomic DNA was isolated using DNeasy Blood&Tissue Kit (Qiagen), and subsequently the transcript containing the FCD motif or FRT site was ampli ed using primers P13 and P14 (Supplementary Table   1). The PCR products were then subjected to gel electrophoresis and as shown in Figure 1B, the monoclonal cell line yielded two distinct bands corresponding to both the wild type (806 bp) and FCD-knockout (341 bp) alleles, con rming its heterozygous status (named as FCD-HT). Both bands were further extracted and subjected to Sanger sequencing, which con rmed that the FCD sequence was successfully removed in the FCD-Knockout allele (Supplementary Figure 1). To determine how the FCD-removal affects the ϒ-globin expression, total RNAs were extracted using the RNeasy Mini Kit from KU-812 and FCD-HT cells and the relative expression of ϒ-globin transcript was determined using quantitative reverse transcription-PCR (qRT-PCR). As shown in Figure 1C, the mRNA level of ϒ-globin signi cantly increased in FCD-HT cells (2.87-fold compared to its parental KU-812, named as HCT-WT), which is consistent with our hypothesis that the FCD motif may serve as a transcriptional repressor domain within human globin locus.  We rst compared the absolute readings among the 6 features using box plotting. As shown in Figure 2A, the means of these features varied signi cantly with the maximal ratio larger than 2.0-fold (mean FSC-A /mean SSC-H = 2.47), indicating that standardization of the original training and testing datasets are required (standardized training and testing datasets in Supplementary Table 5 and 6, respectively). Subsequently, the standardized training dataset was subjected to two dimensionality reduction algorithms, principal component analysis (PCA) and t-distributed stochastic neighbor embedding t-SNE (t-SNE). As shown in Figure 2B (PCA) and Figure 2C  Cell morphology-based machine learning models using ow cytometry-derived data A general work ow as described in our previous study was adopted to build and test various cell morphology-based machine learning models using ow cytometry-derived data 24 . In total, ve (5) supervised learning algorithms (logistical regression, random forest, k-nearest neighbor, support-vector machine, and multilayer perceptron) were included (model hyperparameters in Supplementary Table 7).
First, using tenfold cross-validation, we screened all models with the standardized training dataset, and adopted the ltering conditions as (1) the mean accuracy > 0.80, and (2) the standard deviation of accuracy < 0.10. In total, one (1) logistic regression model (Supplementary Table 8), 94 random forest models (Supplementary Table 9), 96 k-nearest neighbor models (Supplementary Next, all selected models (1,111) were trained using the training dataset, and subsequently applied to the standardized testing dataset and subjected to secondary ltering conditions as (1) precision when predicting FCD-HT cells > 0.80, and (2) recall when predicting FCD-HT cells > 0.80. As shown in Supplementary Table 14, only 533 MLP models survived this additional lter.
Cell morphology-based machine learning models using microscopy-derived data In addition to ow cytometry, cell morphology information can also be directly assessed using imaging 28  Next, deep learning-based convolutional neural networks (CNNs) were used to construct genotype-predictive models. Two general CNN architectures were explored: (1) Type 1 (T1): (Conv-Conv-Pool) n , which was based on the VGG design 32 , and (2) Type 2 (T2): (Conv-Pool) n , which contained a single convolution layer for each repeat. For each type, different number of convolution numbers were tested (two, four and six layers for T1, and two, three, four, ve layers for T2) until the nal feature map reaches a dimension of zero. Since our image inputs have a relatively small size (100 pixels by 100 pixels), we xed the lter size at 3 and when applicable, the Maxpooling pool size at 2. The detailed architectural designs were included in Supplementary Table 15.
As an example, for Type 2/5 layers (T2D5, Figure 4), the numbers of layers at the feature extraction step were 32, 64, 92, 100 and 128 for each successive layer, and recti ed linear unit (ReLU) was used as the activation function. Additionally, a Maxpooling layer was included after each convolution layer. Next, the outputs from convolutional layers were subjected to global average pooling and converted into a 1-dimensional vector (Flatten) for a fully connected layer (dense, 1028 nodes). Finally, a Softmax classi er, which applies a categorical cross-entropy loss function, was used, together with the adaptive moment estimation (ADAM) optimization algorithm.
First, all 7 candidate architectures were subjected to tenfold cross-validation using the training dataset. As shown in Supplementary   Table 16, models from Type 2 showed better performance compared to those from Type 1. Speci cally, the best-performing model of Type 2 (T2D5) showed a mean of accuracies from 10 cross-validation tests at 67.3% (Supplementary Figure 5), while the bestperforming model of Type 1 (T1D4) yielded a mean of accuracies at 58.3%.
We further trained models using all candidate architectures and the training dataset, and subsequently applied them to the testing dataset. As shown in Table 2, the architectures T2D5 displayed the best predictive outcome. Speci cally, for FCD-HT cells, precision was 0.84, recall was 0.76, accuracy was 0.80 and AUC was 0.87 ( Figure 5).

Discussion
In addition to ve supervised machine learning algorithms, we subjected our ow cytometry-derived dataset (the standardized training dataset) to two unsupervised clustering algorithms (k-means clustering and Gaussian mixture clustering) in parallel 33 . As shown in Supplementary Fig. 6 (SSC-A vs. FSC-A), the predicted distributive pattern for two subpopulations from k-means algorithm differed drastically from real genotype labels (Supplementary Fig. 2). Speci cally, even with the best-performing labeling schema (green: FCD-WT, red: FCD-HT), the model yielded poor predictive performance when predicting FCD-HT cells (precision: 0.52, recall: 0.66, accuracy: 0.53). Similarly, as shown in Supplementary Fig. 7, the predictive model derived from Gaussian mixture clustering performed poorly, with precision at 0.61, recall at 0.25 and accuracy at 0.55 when predicting FCD-HT cells (green: FCD-WT, red: FCD-HT). It is interesting to note that in our previous study, which focused on predicting cell states using cell morphological information, supervised learning also performed signi cantly better than unsupervised learning. These results may be because compared to supervised learning, unsupervised learning uses less information (unknown outputs/labels), and thus may be less accurate when applied to data from complex systems such as human cells.
In this study, we have investigated two alternative approaches in predicting cell genotypes: (1) numeric data based on ow cytometry assay, and (2) imaging data based on DIC microscopy. Our analysis showed that the best performing models from Typically, to establish and con rm a mammalian stable cell line, polyclonal cells need to be sorted into single cells, commonly on 96-well plates. Next, the single cells are allowed to propagate until su cient genomic material can be extracted and subjected to PCR-based genotyping assays (Fig. 1). While the protocol is well established, the full pipeline can become time-consuming (up to several weeks for cell propagation step dependent on the cell proliferation rates) and labor-intensive (hundreds of monoclones may be needed for acquiring desirable genotypes). On the other hand, sophisticated imaging ow cytometry techniques, which record extensive physically measurable quantities (features) of the cells, have been used to identify cell subpopulations with speci c traits (e.g. cell types, apoptotic states) 26,29,36,37 . However, most of these instruments are complex, expensive, and may not be yet available to most research labs. In comparison, our ow cytometry and DIC microscopy-based machine learning approaches provide a unique balance between e cacy and availability, and theoretically can be applied to any engineered cells with genetic modi cations known to introduce cell morphological changes.
In conclusion, we demonstrated the feasibility to use ow cytometry-based cellular information (FSC-A, FSC-H To pass the cells, the con uent cell culture was diluted in fresh medium at a ratio of 1:10. When applicable, 2 µg/mL puromycin (ThermoFisher Scienti c, catalog number: A1113803) was added to the growth medium.

Generation of FCD-HT monoclonal stable cell line
To generate the FCD-HT monoclonal stable cell line, ~10 million of the human KU-812 cells were seeded onto a 10 cm petri dish. 16 hours later, the cells were transiently transfected with 4.5 μg of PCMV-SpCas9-U6-sgRNA-L, 4.5 μg of PCMV-SpCas9-U6-sgRNA-R, and 1 μg of the donor plasmid using the JetPEI reagent (Polyplus Transfection). 48 hours later, puromycin was added at the nal concentration of 2 μg/mL. The selection lasted ~2 weeks, after which the surviving clones were pooled to generate the polyclonal stable cells. Next, to remove the puromycin resistance gene-T2A-mKate cassette, ~10 million of the polyclonal stable cells were seeded onto a 10 cm petri dish, and after 16 hours were transfected with 10 μg of EF1-Flpase (unpublished data) using the JetPEI reagent. 48 hours later, single cells were isolated using ow cytometry. The established monoclonal stable cell line was con rmed to be heterozygous by genotyping and further expanded and maintained in the complete growth medium.

Genotyping of FCD-HT monoclonal stable cell line
The genomic DNAs were isolated from FCD-HT monoclonal stable cells using DNeasy Blood&Tissue Kit (Qiagen). The transcripts containing the CRISPR-targeting region was ampli ed with primers P13 and P14. The PCR products were then subjected to gel electrophoresis and Sanger sequencing using primers P13 and P14. Approximately 50,000 FCD-WT or FCD-HT cells were seeded on 12-well plates (Greiner Bio-One) in the complete medium. Cells were imaged using an Olympus IX81 microscope in a Precision Control environmental chamber. The images were captured using a Hamamatsu ORCA-03 Cooled monochrome digital camera. The lter set was Differential Interference Contrast (DIC) with magni cation at 40X. After obtaining the images, Adobe Photoshop was used to isolate individual cells with xed size at 100 pixels by 100 pixels.

Machine learning model training and testing
For ow cytometry-derived dataset, a Dell desktop computer (Intel Core i7-10700 CPU @ 2.90 GHz, Windows 10 enterprise 64-bit OS and 32 GB RAM) was used to conduct the machine learning modeling. Scikit-Learn, a free Python machine learning library, was used to conduct all model training and testing procedures. For DIC microscopy-derived dataset, a Lenovo Laptop (Intel Core i7-10510 CPU @ 1.80 GHz, Ubuntu 20.04 OS and 16 GB RAM) was used to conduct the deep learning modeling. The Keras library in TensorFlow was used to conduct all model training and testing procedures. Other Python libraries, including NumPy, Pandas, and Matplotlib, were also included for data analysis and presentation.

Performance metrics
Performance of different models was evaluated using threshold dependent and independent metrics, which include: (1) precision: this parameter measures how accurate a model is when predicting cells being at live state. Precision = TP/(TP + FP), where TP refers to correctly predicted live cells and FP refers to falsely predicted live cells. Generation and characterization of heterozygous FCD-de cient KU-812 cell model (FCD-HT). (A) Schematic illustration of the CRISPR/SpCas9 homologous recombination process to remove the FCD motif within human globin locus. The polyclonal stable cell line was established using 2 weeks of puromycin selection (2 µg/mL). Subsequently, the puromycin resistance gene transcript, which was anked by ippase recognition target (FRT) sites, was removed using ippase. Finally, monoclonal stable cells were established using FACS single cell sorting. (B) Genotyping of FCD-HT monoclonal stable cell. Genomic DNAs were harvested from FCD-WT and FCD-HT cells and the transcript containing the FCD motif or FRT site was PCR ampli ed and subsequently subjected to gel electrophoresis. The stable cell line yielded two bands corresponding to both the wild type (806 bp) and FCD-knockout (341 bp) alleles, con rming its heterozygous status. (C) Quantitative reverse transcription-PCR (qRT-PCR) assay showed that compared to FCD-WT cells, the mRNA level of ϒ-globin signi cantly increased in FCD-HT cells (2.87-fold). *** denotes p-value < 0.001.   The T2D5 deep learning architecture. The model contained ve convolutional layers (the numbers of each layer: 32, 64, 92, 100 and 128). Additionally, a Maxpooling layer was included after each convolution layer. Next, the outputs from convolutional layers were subjected to global average pooling and attened for a fully connected layer (1028 nodes). Finally, a Softmax classi er, which applies a categorical cross-entropy loss function, was used.