Generation and characterization of heterozygous FCD-deficient KU-812 cell model (FCD-HT)
The CRISPR/SpCas9 technology was used to generate the FCD-deficient model in KU-812 cells, which were established from the peripheral blood of a patient with chronic myelogenous leukemia31. Briefly, the parental cells were transiently transfected with CRISPR/SpCas9 complex which targets both left and right genomic regions flanking the FCD motif, along with a homologous recombination repair template containing a puromycin resistance gene transcript (Figure 1A). The polyclonal stable cell line was then established after 2 weeks of puromycin selection (2µg/mL). Subsequently, to avoid potential interference with the transcriptional activities of the globin genes, the puromycin resistance gene transcript (~2.2kb), which was flanked by flippase recognition target (FRT) sites, was removed using flippase. Finally, monoclonal stable cells were established using FACS single cell sorting.
To characterize the stable monoclonal cell line, genomic DNA was isolated using DNeasy Blood&Tissue Kit (Qiagen), and subsequently the transcript containing the FCD motif or FRT site was amplified using primers P13 and P14 (Supplementary Table 1). The PCR products were then subjected to gel electrophoresis and as shown in Figure 1B, the monoclonal cell line yielded two distinct bands corresponding to both the wild type (806 bp) and FCD-knockout (341 bp) alleles, confirming its heterozygous status (named as FCD-HT). Both bands were further extracted and subjected to Sanger sequencing, which confirmed that the FCD sequence was successfully removed in the FCD-Knockout allele (Supplementary Figure 1). To determine how the FCD-removal affects the ϒ-globin expression, total RNAs were extracted using the RNeasy Mini Kit from KU-812 and FCD-HT cells and the relative expression of ϒ-globin transcript was determined using quantitative reverse transcription-PCR (qRT-PCR). As shown in Figure 1C, the mRNA level of ϒ-globin significantly increased in FCD-HT cells (2.87-fold compared to its parental KU-812, named as HCT-WT), which is consistent with our hypothesis that the FCD motif may serve as a transcriptional repressor domain within human globin locus.
Flow cytometry-based data collection and visualization for FCD-WT and FCD-HT cells
To prepare cell morphology-based predictive models differentiating FCD-WT and FCD-HT cells, we first used flow cytometry assay to record 6 features (FSC-A, FSC-H, FSC-W, SSC-A, SSC-H, and SSC-W). In total, 192,772 FCD-WT cells (labeled as 0) and 185,544 FCD-HT cells (labeled as 1) were included (the ratio of labels 0 and 1 = 1.04, Supplementary Table 2). Next, this initial dataset was randomly split into training and testing datasets at a ratio of 80:20 (size of training dataset: size of testing dataset). Specifically, the training dataset contains 302,652 cells (label 0: 154,180 cells, label 1: 148,472 cells, Supplementary Figure 2 and Supplementary Table 3), and the testing dataset contains 75,664 cells (label 0: 38,592 cells, label 1: 37,072 cells, Supplementary Figure 3 and Supplementary Table 4).
We first compared the absolute readings among the 6 features using box plotting. As shown in Figure 2A, the means of these features varied significantly with the maximal ratio larger than 2.0-fold (meanFSC-A/meanSSC-H = 2.47), indicating that standardization of the original training and testing datasets are required (standardized training and testing datasets in Supplementary Table 5 and 6, respectively). Subsequently, the standardized training dataset was subjected to two dimensionality reduction algorithms, principal component analysis (PCA) and t-distributed stochastic neighbor embedding t-SNE (t-SNE). As shown in Figure 2B (PCA) and Figure 2C (t-SNE), the two cell subpopulations (FCD-WT: green, FCD-HT: yellow) demonstrated distinct distributive patterns and were partially separable.
Cell morphology-based machine learning models using flow cytometry-derived data
A general workflow as described in our previous study was adopted to build and test various cell morphology-based machine learning models using flow cytometry-derived data24. In total, five (5) supervised learning algorithms (logistical regression, random forest, k-nearest neighbor, support-vector machine, and multilayer perceptron) were included (model hyperparameters in Supplementary Table 7).
First, using tenfold cross-validation, we screened all models with the standardized training dataset, and adopted the filtering conditions as (1) the mean accuracy > 0.80, and (2) the standard deviation of accuracy < 0.10. In total, one (1) logistic regression model (Supplementary Table 8), 94 random forest models (Supplementary Table 9), 96 k-nearest neighbor models (Supplementary Table 10), two (2) SVM models with linear kernel (Supplementary Table 11), 25 SVM models with Gaussian kernel (Supplementary Table 12), and 893 MLP models (Supplementary Table 13) were selected.
Next, all selected models (1,111) were trained using the training dataset, and subsequently applied to the standardized testing dataset and subjected to secondary filtering conditions as (1) precision when predicting FCD-HT cells > 0.80, and (2) recall when predicting FCD-HT cells > 0.80. As shown in Supplementary Table 14, only 533 MLP models survived this additional filter.
Finally, we chose three MLP models with largest AUC values (Table 1, MLP 20-26: number of nodes in the first layer: 20/number of nodes in the second layer: 26, MLP 26-18: number of nodes in the first layer: 26/number of nodes in the second layer: 18, and MLP 30-26: number of nodes in the first layer: 30/number of nodes in the second layer: 26), and plotted both the receiver operating characteristics (ROC, Figure 3A) and precision-recall curves (Figure 3B). The three models displayed essentially identical performance when predicting FCD-HT cells (precision: 0.83, recall: 0.80, accuracy: 0.82, and AUC: 0.90).
Cell morphology-based machine learning models using microscopy-derived data
In addition to flow cytometry, cell morphology information can also be directly assessed using imaging28. Using a Differential Interference Contrast (DIC) microscopy, we prepared 1,594 images of individual FCD-WT cells and 1,695 images of FCD-HT cells (Supplementary Figure 4), the ratio of FCD-WT and FCD-HT = 0.94). Next, this starting dataset was randomly split into the training and testing datasets at a ratio of 90:10 (size of training dataset: size of testing dataset). The final training dataset contains 2,956 images (FCD-WT: 1,433 images, FCD-HT: 1,523 images), and the testing dataset contains 333 images (FCD-WT: 161 images, FCD-HT: 172 images).
Next, deep learning-based convolutional neural networks (CNNs) were used to construct genotype-predictive models. Two general CNN architectures were explored: (1) Type 1 (T1): (Conv-Conv-Pool)n, which was based on the VGG design32, and (2) Type 2 (T2): (Conv-Pool)n, which contained a single convolution layer for each repeat. For each type, different number of convolution numbers were tested (two, four and six layers for T1, and two, three, four, five layers for T2) until the final feature map reaches a dimension of zero. Since our image inputs have a relatively small size (100 pixels by 100 pixels), we fixed the filter size at 3 and when applicable, the Maxpooling pool size at 2. The detailed architectural designs were included in Supplementary Table 15.
As an example, for Type 2/5 layers (T2D5, Figure 4), the numbers of layers at the feature extraction step were 32, 64, 92, 100 and 128 for each successive layer, and rectified linear unit (ReLU) was used as the activation function. Additionally, a Maxpooling layer was included after each convolution layer. Next, the outputs from convolutional layers were subjected to global average pooling and converted into a 1-dimensional vector (Flatten) for a fully connected layer (dense, 1028 nodes). Finally, a Softmax classifier, which applies a categorical cross-entropy loss function, was used, together with the adaptive moment estimation (ADAM) optimization algorithm.
First, all 7 candidate architectures were subjected to tenfold cross-validation using the training dataset. As shown in Supplementary Table 16, models from Type 2 showed better performance compared to those from Type 1. Specifically, the best-performing model of Type 2 (T2D5) showed a mean of accuracies from 10 cross-validation tests at 67.3% (Supplementary Figure 5), while the best-performing model of Type 1 (T1D4) yielded a mean of accuracies at 58.3%.
We further trained models using all candidate architectures and the training dataset, and subsequently applied them to the testing dataset. As shown in Table 2, the architectures T2D5 displayed the best predictive outcome. Specifically, for FCD-HT cells, precision was 0.84, recall was 0.76, accuracy was 0.80 and AUC was 0.87 (Figure 5).