MIM-CyCIF: Masked Imaging Modeling for Enhancing Cyclic Immunofluorescence (CyCIF) with Panel Reduction and Imputation

CyCIF can quantify multiple biomarkers, but panel capacity is limited by technical challenges. We propose a computational panel reduction approach that can impute the information content from 25 markers using only 9 markers, learning co-expression and morphological patterns while concurrently increasing speed and panel content and decreasing cost. We demonstrate strong correlations in predictions and generalizability across breast and colorectal cancer, illustrating applicability of our approach to diverse tissue types.

selection and training a multi-encoder variational autoencoder (ME-VAE) 6 to reconstruct the full 25-plex CyCIF images at the single cell level.Wu et al. proposed a three-step method using a concrete autoencoder and convolution neural network to reduce CODEX markers and predict intensity via a linear regression model 11 .Sun et al. iteratively trained a U-Net to reconstruct patch-level images, aiding marker selection for a reduced panel 12 .In contrast to prior research, where panel selection is separate from full panel reconstruction, our method integrates iterative marker selection within a pre-trained model, streamlining panel reduction and reconstruction.Unlike xed-size reduced panels, our method optimizes marker selection during inference, enhancing e ciency, reliability, and practicality for model training.
Inspired by the success of masked language modeling in natural language processing 13 , the concept of masked image modeling (MIM) has gained traction in computer vision 14,15 .MIM-based models resemble denoising autoencoders 16 , utilized for data restoration and model pre-training.Nevertheless, the utilization of masked image modeling for missing data imputation tasks has been minimally investigated.Employing a self-supervised trained masked autoencoder (MAE), we reconstruct masked CyCIF image channels at the single-cell level and identify optimal reduced panel sets for complete panel reconstruction.Through the architecture and masked token prediction task outlined in Fig. 1A, we demonstrate successful imputation of CyCIF image channels at the single-cell level through 'channel in-painting'.Our model takes in a collection of single-cell images, each containing 25 channels representing individual CyCIF marker stains.During training, we set a xed ratio of channels that will undergo random masking for each sample (Fig. 1B left).Our model is then tasked with reconstructing these masked channels (Fig. 1B right).
After model training, the selection of masked channels becomes feasible to determine the optimal unmasked channel combination for accurate reconstruction of masked counterparts.This strategy is harnessed to progressively curate an enhanced marker panel (Fig. 1C).In each iteration, the marker selection is the one maximizing the Spearman correlation between actual and predicted mean intensities for the remaining held-out markers (Methods: Iterative Panel Selection).Subsequently, this iterative process of re ning panels establishes an order of markers that roughly represents predictive value regarding other markers (Fig. 2A).Visualized in columns, the chosen markers' in uence on predicting withheld markers is depicted, while rows illustrate the corresponding improvements in prediction for each withheld marker.We additionally assess the structural similarity index measure (SSIM) (Methods: Model's Performance Evaluation), a widely adopted metric capturing image similarity as perceived by the human visual system 17 , between real and predicted single-cell expression at the pixel level (Supplementary Fig. 1).
In our prior study 10 , we pre-selected optimally reduced panels comprising 3, 6, 9, 12, 15, and 18 markers through Spearman Correlation.The subset of markers exhibiting the highest correlation with the remaining withheld markers was chosen as the reduced panel.Figure 2B illustrates the enhancement in prediction resulting from the replacement of ME-VAE with MAE.With the same 9-marker reduced panel, MAE improves the average correlation of withheld marker predictions by 0.22 (yellow violin plot).A further enhancement is achieved by utilizing the reduced panel generated via iterative selection, yielding an additional 0.01 improvement (green violin plot).We also demonstrate similar effectiveness on the colorectal cancer (CRC) TMA dataset (Supplementary Figs. 2 and 3).
Moreover, the model generalizes well to unseen data, demonstrated by conducting 5-fold cross-validation across the CRC TMA cores (Supplementary Fig. 4).Additionally, we tested our model trained on the CRC TMA on a full Whole-Slide Image (WSI) stained with the same panel but in different batches (Supplementary Fig. 5).Although the model exhibits a modest performance reduction (0.26 reduction in Spearman correlation using the same 9 marker panel), these results hold promise, particularly given potential limitations in generalizability stemming from ROI selection bias and a single TMA reference for batch correction.This nding aligns with prior ndings on small TMA cores, emphasizing the enhanced representation provided by randomly selected multiple TMA cores in capturing tumor or immune contexture as compared to WSI 18,19 .
Our study highlights the e cacy of utilizing MAE to generate high-plex CyCIF data from only a few experimental measurements, signi cantly reducing the required biomarkers to interrogate a sample.The ability to identify a biomarker subset and perform in silico prediction offers several advantages.Our method empowers users to access a more extensive set of biomarkers beyond those experimentally measured.Additionally, it enables the allocation of resources for the exploration of novel biomarkers, thereby enhancing cell type differentiation and disease characterization.Furthermore, it can manage instances of assay failures such as low-quality markers, technical noise, and/or potential tissue loss in later CyCIF rounds.It also has the ability to arti cially up-sample and incorporate additional panel markers.In future work, we will explore different normalization strategies [20][21][22] to reduce marker intensity variability across batches.Additionally, we will explore a more diverse training dataset, incorporating WSIs in different batches, effectively mitigating TMA sampling bias and batch effects.

CyCIF Image Dataset
Our data consists of CyCIF images from two TMA datasets, one containing breast cancer (BC) tissue and the other containing colorectal cancer (CRC) tissue.The tissue microarrays (TMAs) CyCIF imaging data are available via HTAN (https://humantumoratlas.org/).The breast cancer (BC) TMA contains 88 cores representing 6 cancer subtypes.The colorectal cancer (CRC) TMA contains 332 cores.Biomarker panels for the two TMAs are shown in Supplementary Table 1.

Image Preprocessing
The original CyCIF marker intensities (16 bit) were rescaled to the 8 bit image ([0,255] range).For the BC TMA, we simply used preprocessed data in our previous work 10 .For the CRC TMA, any core containing a channel with a mean intensity beyond 2 standard deviations from the mean intensity of the entire TMA for that channel was dropped.This resulted in 12 cores being removed.Individual cores were then segmented using MESMER 23 .We use the whole cell masks generated using the max projection of the PanCK and CD45 channels as the membrane marker to crop each cell down to a 32x32 pixel region.The background of each single-cell image is then zeroed out, and the polar axis and center of mass are aligned.This resulted in 742,169 cells for the CRC TMA and 691,893 cells for the BC TMA.We use 90% of the cells randomly selected as the training set, and withhold 5% for validation/panel selection and 5% for testing.

Masked Image Modeling
We modify the patch-wise masking strategy in MAE 14 to a channel-wise masking strategy.This involves resizing the 32x32x25 multichannel single-cell image (32x32) to a 5 x 5 grid format (Fig. 1B), creating a resulting image size of 160x160 (i.e., (32x5)x(32x5)) with a single channel.Consequently, the patch size for MAE is adjusted to 32x32, where each patch now corresponds to an individual channel within the original multi-channel image.
We use a Vision Transformer (ViT) encoder and a ViT decoder as in MAE, both set to 8 heads and 6 layers, and a 2048-dimension multilayer perceptron layer.The embedding dimensions were 1024 for the encoder and 512 for the decoder.We train the model using 8 Nvidia A40 GPUs for 300 epochs using a batch size of 512, Adam optimizer and a learning rate of 1e-3.
Although the trained model works on a range of reduced panel sizes, during training the number of channels to be masked is set to a xed ratio.We evaluate different masking ratios for training by assessing the performance of different reduced panel sizes in inference on the BC TMA dataset.For testing, we choose the optimal reduced panels identi ed in Ternes et al. 10 , which have sizes of 3,6,9,12,15, and 18 markers (88%, 76%, 64%, 52%, 40%, and 28% masking ratios, respectively).We train three models using a xed masking ratio of 25%, 50%, and 75%, and nd that the 50% masking ratio results in the best overall performance across different panel sizes in inference (Supplementary Fig. 6).

Iterative Panel Selection
To obtain optimal reduced panel sets, we leverage the trained model to determine which markers are most informative.To do this, we iteratively determine an ordering of markers such that the rst markers result in the best reconstruction of the remaining markers, measured by the Spearman correlation of the predicted mean intensity at the single cell level.We start with , setting the rst marker in the order to be DAPI, as nuclear staining is important for downstream analysis such as registration as well as determining cell morphology.We then iterate through the remaining 24 markers to determine which marker, along with DAPI, results in the best reconstruction of the remaining 24 markers (Fig. 1C and 2A).We repeat this process until we nd the best marker panel: Where is the marker channel being considered for inclusion into , is the set of ground truth masked channels, is the set of unmasked channels, is the trained MAE model parameterized by , which returns the reconstructed masked channels, and is the Spearman correlation between the mean intensities of and the output of .

Model's Performance Evaluation
We evaluate the model's performance using Spearman correlation and SSIM.We assess the agreement between actual and the predicted mean marker intensities within the cell boundaries.This analysis is crucial as it quanti es speci c cellular component expression levels across markers, characterizing cellular phenotypes.We avoid directly comparing and classifying cell types in subsequent analysis to avoid oversimplifying complex cellular phenotypes.It is important to note that cell type determination in MTI settings involves diverse methodologies and can be in uenced by factors like imperfect cell segmentation, marker selection, data preprocessing and normalization [20][21][22] , and algorithm choice 24 .
5-Fold Cross-validation Cross-validation was performed on the CRC TMA dataset to evaluate model performance on unseen data.We divide the dataset at the TMA core level by separating the cores into 5 sets of 64 TMA cores for the test sets and the remaining TMA cores are used to train 5 separate models.As TMAs typically encompass multiple patients, this test effectively demonstrates the model's generalizability.The performance of the 5 models on different reduced panel sizes is shown in Supplementary Fig. 4.

WSI Testing and Batch Effect
To further demonstrate model generalizability, we test the CRC model on a whole-slide image (WSI) stained using the same panel.Segmentation produced a dataset consisting of 742,799 cells in the WSI.
To mitigate batch-to-batch staining variations, the intensity distributions for each channel in the WSI were normalized using histogram matching with 3 cores from the CRC TMA -obtained from the same tissue section as the WSI.Although this approach allows us to address batch effects while preserving intrapatient intensity distributions, a potential limitation is sample bias introduced during TMA core selection.
A histopathologist would heavily favor tumor regions for core punchouts, whereas the full WSI contains a more heterogeneous tissue region.Therefore, markers that are not expressed in tumor cells are potentially underrepresented in our training set.Supplementary Fig. 5 shows performance of the model, trained on the CRC TMA with reduced panels selected from the TMA, on this dataset.

Declarations Data Availability
As part of this paper all images at full resolution, all derived image data (e.g.segmentation masks), and all cell count tables will be publicly released via the NCI-recognized repository for Human Tumor Atlas Network (HTAN; https://humantumoratlas.org/) at Sage Synapse.Note to reviewers: this data resource is undergoing nal review for Private Health Information and requires a (free) Synapse account; public access will be provided as soon as possible.An anonymous "reviewer only" link can be provided prior to that by requesting it from the monitoring editor. Figures

Figure 2 Model
Figure 2