Study population
This was a multicenter cross-sectional study, depending on high incidence regions of esophageal and gastric cancer established by the cancer early diagnosis and early treatment project in China [12]. In 2017, a new screening study of upper gastrointestinal cancer in five high-risk regions of upper gastrointestinal cancer in China, including Hebei, Henan, Shandong, Shanxi, and Gansu Provinces was released. The main purpose of this project was to confirm the high-risk population of malignant upper gastrointestinal cancer and to establish a cancer risk prediction model to provide support for the prevention of upper gastrointestinal cancer.
The inclusion criteria were as follows: (1) local permanent residents in selected regions, (2) no history of endoscopic examination during the last 3 years, (3) no history of cancer, mental disorder, or any contraindication for endoscopy, (4) signed informed consent and (5) agreement to complete the entire survey and examination, including endoscopy. The participant selection process is shown in Figure 1. We recruited participants from April 2017 to December 2018. The final analysis included 34,707 residents aged 40-69 years. Among these participants, there were 81 persons with ESCC, 251 persons with HGIN, 1,413 persons with LGIN, 3,883 persons with esophagitis and 29,079 persons serving as normal esophagus controls.
The study was approved by the Capital Medical University, Chinese Academy of Medical Sciences and Peking Union Medical College. The experimental protocol involving humans was in accordance to the guidelines of the Declaration of Helsinki.
Diet and symptoms assessment
Comprehensive questionnaire information was collected by face-to-face interviews and entered directly into a laptop based data entry system by trained investigators. The data entry process was conducted with software designed to decrease missing items and reduce logic inaccuracy. A questionnaire typically took 35-45 minutes to complete. Items of dietary intake were selected through the above questionnaire, including livestock meat, poultry meat, seafood, eggs and their products, vegetables, fruits, bean products, scallions, ginger and garlic, pickles and nuts. All the variables were categorical. Foods that are consumed more frequently, were divided into three categories namely: every day, 1-6 days per week, and less than 1 day per week. Foods that are consumed infrequently, were divided into two categories namely: at least one day per week and less than one day per week. Items of typical symptoms were also selected through the above questionnaire, including number of lost teeth, frequent bleeding of gums, dysphagia, bloating, heartburn, acid reflux, nausea, vomiting, belching and epigastric pain. The number of lost teeth were categorized into three groups namely: never dropped, 1-3 teeth and more than 4 teeth. Other variables are categorized as yes or no.
Outcome assessment
The endoscopic examinations were carried out by physicians at local hospitals. Procedures were based on clinical guidelines for cancer screening and early diagnosis and treatment in China. Lugol’s iodine staining method was used to identify suspicious tissues, which were then biopsied. To confirm severity, the esophageal mucosa was ranked into 5 categories: normal esophageal mucosa, minor mucosa changes, esophagitis, esophageal squamous simple hyperplasia (ESSH) or esophageal squamous dysplasia (ESD) [13]. ESD was further classified into 3 levels including slight, moderate, and severe. According to WHO tumor histological classification, mild and moderate ESD combined fall under LGIN. Severe ESD and squamous cell carcinoma in situ are considered as HGIN. If there were any inconsistencies, a third pathologist would give advice through discussion. Doctors reported the worst biopsy diagnosis to be from participants with multiple lesions. In this study, we divided the participants into 4 groups: normal control, esophagitis, LGIN and HGIN/ESCC.
Statistical methods
We characterized the dietary patterns and symptom patterns, assumed as unobserved mutually exclusive, with different variables probability distributions, by performing LCA on the observed responses on the different items.
LCA identified latent classes of participants based on the ten dietary variables and six symptom variables. Estimation was conducted with the robust maximum-likelihood and expectation-maximization algorithms [14]. Statistical fit indexes were used to assess model fit and to decide the final number of latent classes. The model that fits the data best was selected by a combination of the following criteria: (a) the lowest Akaike information criterion (AIC), (b) the lowest Bayesian information criterion (BIC), (c) the lowest Lo–Mendell–Rubin likelihood ratio test (LMR), (d) the lowest Lo–Mendell–Rubin Adjusted LMR test (ALMR), and (e) entropy to be 0.6 or greater [15]. Next, we executed an unconditional multivariable logistic regression to identify sociodemographic and risk factors that predicted class membership. Before conducting the analysis, we performed a covariance diagnosis between the independent variables. We considered models to calculate the adjusted odds ratios (ORs) and 95% confidence intervals (CIs), including age, gender, education, body mass index (BMI), smoking and drinking at the same time. We also used a nomogram to model normal controls separately from the different stages of the disease. Evaluation of the model was performed using calibration curves and decision curves. In addition, basic, descriptive statistics show categorical variables as percentages, while continuous variables are shown as mean and standard deviations.
LCA was conducted in both cases and normal controls. Analysis of only the normal control was performed to check the robustness of the previous solution. As dietary patterns and symptom severity classes identified on controls were consistent (number and characteristics of the patterns) with the ones obtained on the overall dataset, we based all our analysis on the overall dataset. To guarantee the internal reproducibility of the chosen solution the analysis was conducted separately in two randomly selected subsets of the original data several times.
Statistical analyses were performed using Mplus (version 8.1) and R (version 3.6.3) software. All tests were two-sided and had a significance level of 0.05.