Patient cohort
This study received approval from the Institutional Review Board at Peking University People's Hospital (IRB NO.2021PHB182-001), and all patients provided written informed consent. The study enrolled patients who underwent complete resection for pathologic stage I-III LUAD and subsequent next-generation sequencing (NGS) for cancer-related genes on their primary tumor between 2018 and 2020 using 363 (HR363, Berry Oncology, Beijing, China), HR457 (Berry Oncology, Beijing, China), or RS520 (Burning Rock Biotech, Guangzhou, China). All patients underwent lobectomy with lymph node (LN) dissection to ensure the pathological assessment of LN. The exclusion criteria comprised induction therapy, wedge resection or segmentectomy, and minimally invasive adenocarcinoma (MIA).
We summarized the clinicopathologic characteristics of the broad-panel NGS cohort (n = 257), including age, sex, smoking status, tumor size, radiologic type, histological subtype, and pathological stage (adjusted based on the 8th edition of the AJCC Cancer Staging Manual). We performed propensity score matching (PSM) using a 1:1 ratio to match LN metastasis and non-metastasis cases in the NGS cohort. High inflammation versus low inflammation was the matching indicator, and the variables were age, sex, comorbidity, smoking history, histology, clinical stage. Pairs of high and low inflammation with a nearest propensityscore were matched 1:1 with a caliper width of 0.2 of standard deviation. The matching factors included age, gender, smoking, and tumor size on CT. Subsequently, we obtained a cohort of 92 patients for mIHC testing.
Tumor Genomic Analyses
We conducted target capture for the broad-panel NGS using commercial panels that included 363 (HR363, Berry Oncology, Beijing, China), 457 (HR457, Berry Oncology, Beijing, China), or 520 (RS520, Burning Rock Biotech, Guangzhou, China) cancer-related genes, covering 1.21 megabases (Mb) of the human genome. The protocols for genomic DNA extraction, targeted/whole-exome sequencing library preparation, and variant calling were described in previous publications [56].
We assessed the genomic alteration count in the broad-panel NGS cohort, defined as the sum of nonsynonymous somatic mutations, copy number variations (CNVs), and gene fusions. Additionally, we calculated the mutant-allele tumor heterogeneity (MATH) score based on the dispersion of variant allele frequency (VAF) distribution. A higher MATH score indicated a tumor with higher intratumor heterogeneity (ITH).
We applied the MutSigCV and dNdScv algorithm to infer significantly mutated genes (q < 0.1 in any caller and nonsilent mutations n ≥ 5) in the broad-panel NGS cohort (n = 257), including 53 pN positive and 204 pN negative cases. For oncogenic pathway analyses, we retained only functional alterations labeled as oncogenic, likely oncogenic, or predicted oncogenic in the OncoKB database, discarding variants of unknown significance. Therapeutic actionability information was also annotated using the OncoKB database, and each genomic alteration was stratified into one of four levels according to its clinical implication.
Multiplex Immunofluorescence (mIHC) Section Preparation
The mIHC staining was performed by Alphaxbio Biotechnology Co., Ltd. (Beijing, China), using a panel of six markers: PANCK, CD4, CD8, FOXP3, CD68, and PD-L1.
The 4-µm thick FFPE NSCLC tissue slides were subjected to deparaffinization in xylene for 30 mins followed by rehydration in absolute ethyl alcohol for 5 mins (twice), 95% ethyl alcohol for 5 mins, and 75% ethyl alcohol for 2 mins sequentially. The slides were washed three times with distilled water. Heat-induced epitope retrieval was performed using a microwave oven, and the slides were immersed in boiling EDTA buffer (PH9.0; ZLI-9069; Zsbio, Beijing, China) for 15 mins. Antibody Diluent/Block (72424205; Akoya Biosciences, DE, USA) was used for blocking before the detection of each panel, which included a total of 6 markers: PANCK, CD4, CD8, FOXP3, CD68, and PD-L1. The mIHC staining was carried out at Alphaxbio Biotechnology Co., Ltd. (Beijing, China).
Primary antibodies including FOXP3 (ab20034; Abcam, Cambridge, UK), PANCK (ab7753; Abcam, Cambridge, UK), CD8 (ab237709; Abcam, Cambridge, UK), CD68 (ab192847; Abcam, Cambridge, UK), CD4 (ab133616; Abcam, Cambridge, UK), and PDL1 (ab237726; Abcam, Cambridge, UK) were incubated at 37°C for 1 h. The slides were then incubated with Opal Polymer HRP Ms + Rb (2414515, Akoya Biosciences, USA) for 10 min at 37°C, followed by visualization using the Opal Seven-Color IHC Kit (NEL797B001KT, Akoya Biosciences, USA), with the correspondence between primary antibodies and fluorophores as XTSA520, XTSA540, XTSA570, XTSA620, XTSA650, and XTSA690. After each cycle of staining, heat-induced epitope retrieval was performed to remove all antibodies, including primary antibodies and Opal Polymer HRP Ms + Rb. Finally, the slides were counterstained with DAPI for 5 mins and enclosed in Antifade Mounting Medium (I0052; NobleRyder, Beijing, China).
mIHC Analysis and Cellular Metaclusters Identification
To analyze the tissue sections, mIHC images with high resolution in the image dataset were randomly selected as training samples to build an algorithm of tissue segmentation, cell segmentation, phenotyping tool, and positivity score using inForm Advanced Image Analysis software (inForm 2.5.0; Akoya Biosciences, USA). The algorithm was then applied in batch analysis of all the images. These procedures generated a single-cell dataset. PANCK, CD4, CD8, FOXP3, CD68, and PD-L1 were used for the subclass division of tumor cells and immune cells. In order to reduce the dimensionality of the data for visualization, Uniform manifold approximation, and projection (UMAP) was employed. Additionally, classical markers were utilized for the manual annotation of cell metaclusters.
Cellular Neighborhood (CN)Analysis
To determine the CN of each cell, we employed a methodology where windows consisting of the 20 nearest neighboring cells, based on the Euclidean distance between their X/Y coordinates[20]. These windows were then clustered based on their compositions regarding all cell metaclusters, using K-means clustering. Setting k = 10, each cell's CN was subsequently assigned according to the CN of its surrounding window[8, 57]. To validate the CN assignment, we overlaid the Voronoi allocation graphs onto the original tissue mIHC images.
Univariable and Multivariable Logistic Regression Analysis
Initially, we conducted univariate analyses regressing the LN status (pN positive or negative) on the clinicopathologic features and genomic alterations. The formula for univariable logistic regression was expressed as:
$$\:logit\left(pr\right)={\beta\:}_{0}+{\beta\:}_{1}X$$
where: pr is the probability of the LN metastasis occurring, β0 is the intercept term, and β1 is the coefficient for the predictor clinicopathologic or genomic variable.
The top clinicopathologic or genomic factors associated with LN metastasis in univariate analysis were included as covariates in multivariate logistic regression. The formula for multivariable logistic regression was expressed as:
The ImGene Model Construction
To predict the LN stage, we implemented each classifier using the scikit-learn (1.0.1, https://scikit-learn.org/stable/) library in Python. The coefficients were calculated as shown in Fig. 6B. We selected the top 10 coefficients from both the mIHC image features (ImFeatures) and the genomic features (GeneFeatures) to construct the models. After removing samples with unmatched ImFeatures, we obtained a dataset consisting of 83 patients with mIHC and NGS data. Among the classifiers, ImGene Support Vector Machines (SVM) demonstrated the highest performance with an AUC of 0.86, F1 Score of 0.83, and Accuracy of 0.82, surpassing Random Forest, ImFeatures SVM, GeneFeatures SVM, and Gradient Boosting.
Where, \(\:\:Z=3.51+({\beta\:}_{Im}\times\:ImFeatures)+({\beta\:}_{Gene}\times\:GeneFeatures)\)
\(\:{\beta\:}_{Im}\) is the coefficient for the ImFeatures, \(\:{\beta\:}_{Gene}\) is the coefficient for the Genefeatures (Fig. 6C).
To construct individual models based on IHC image and genomic features, SVM was implemented using three parameters: (1) C, which represents the penalty coefficient, (2) the Kernel function, and (3) Gamma. The best combination of these parameters was determined using Leave-One-Out Cross-Validation with the training dataset. The cutoff value for classification was set at the point with the highest accuracy in the validation set. In order to obtain the optimal model, logistic regression models were constructed using the predictive scores from the individual models as input features. This integration of the performance of each model was based on both the discovery and validation datasets. The probability of the LN status in a patient was calculated as follows:
$$\:Pr\left(ImGene\right)=exp\left(Z\right)/(1+exp(Z\left)\right)$$
Validation dataset
An independent external dataset comprising 61 patients treated with complete resection approved by ethics number IRB NO.2021PHB182-001 was collated. All patients provided informed consent for sample collection and all participants consented to the publication of research results. These cases were selected due to the availability of NGS and mIHC data. Clinical details for these 61 cases are summarized in Supplementary Table 4.
Statistical Analyses
The clinicopathologic features were summarized using either the median with IQR or frequency with percentage. To compare these features among the three radiological subtypes, we employed the Kruskal-Wallis test for continuous variables and the Chi-squared exact test or chi-square test for categorical variables. When it came to genomic variables, we utilized the Chi-squared test with FDR adjustment to address multiple testing. To identify the independent predictive features for LN metastasis, both univariate and multivariate negative binomial regression models were constructed, taking into account clinicopathologic and tumor genomic characteristics.
The data obtained from mIHC were imported into GraphPad software, and statistical tests were chosen based on the data distribution and variability. Two-tailed Student's t-test was used to compare two groups, while a one-way analysis of variance with Bonferroni corrected significance threshold was used for comparisons among three or more subgroups. The data are expressed as means ± SEMs (standard error of the mean). The Spearman rank correlation test was used to investigate the relationship between spatial and amount variables. The t-test and Kruskal-Wallis H test were used to compare continuous variables between two and more groups respectively. For categorical data, the Chi-squared test and Fisher's exact test were employed. For multiple clusters, the final category label was defined by combining the group information for all clusters.
All statistical analyses were performed using R 4.0.2 (R Core Team, Vienna, Austria). A p-value of < 0.05 (or FDR q < 0.1) was considered statistically significant.