The County of Kent:
Kent is the largest county in England with a population of 1.6 million23. It has an exceptional spread of affluence and extreme poverty. Before COVID, a life expectancy gap of almost 20 years already existed between the least and most deprived wards24. Some of the largest groups which suffer extreme health inequalities are asylum seekers, migrants and refugees, Gypsy, Roma and Travelers, veterans, looked after children and seasonal agricultural workers. Kent is faced with a range of key health challenges. Widening inequalities in health and wellbeing are observed across both geographical areas and amongst people with different vulnerabilities influenced by a range of wider determinants of health. A ‘coastal excess or effect’ in health inequalities exists across its numerous coastal and rural communities25.
Dataset description:
Data for this study was taken from the KID22, which contains a vast array of patient level, pseudonymised integrated health and care data. The KID is overseen by a steering group known as the Kent & Medway Shared Health and Care Analytics board (SHcAB) that includes representatives of Kent County Council, local health commissioners and information governance leads. The SHcAB considers issues such as information governance, development of the dataset and applications for use of the data. The Kent and Medway data warehouse team provides day-to-day administration and project management. Access was granted to the first author by the SHcAB for the study duration through established due process. Patients can opt-out of contributing data to the KID by informing their GP surgery that they do not want their data to be shared with external organisations. It has to be appreciated that the data is not in the public domain as it is a pseudonymised person level data set. We established a project oversight group, supported by the Kent & Medway cancer alliance which included cancer clinicians, service managers, Public Health physicians, epidemiologists, and AI experts. Regular stakeholder engagement took place throughout the study involving patients and public representatives.
Data contained within the KID represented a six-year longitudinal record of health and care data for residents for 2014-2019 which was 1,865,382. An initial exclusion for under 18s years was made (n=599,866) which reduced the cohort to 1,265,516. We then removed a further 10,532 patients (0.8% of the total population), due to incomplete or missing records data, which took the original cohort size to 1,254,984. The final dataset contained a total of 1,254,984 patients of which 6053 were diagnosed with a primary lung cancer during this period and these were included within the scope of this investigation. The cohort selection (lung cancer cohort) only encompassed lung cancers that originated from a primary metastatic tumor site, effectively excluding benign tumors and secondary metastases caused by other types of cancer. To ensure comprehensive capture of all patients meeting the criteria, we assessed both primary and secondary healthcare records using relevant SNOMED or ICD-10 codes respectively. Patients with Lung Cancer included all confirmed diagnoses regardless of diagnosis of care setting, staging at the time of diagnosis, disease progression or onward treatment options and outcomes. Core dimensions of data used within this study are shown below:
• Patient Demographics
• Primary Care (Events, Consultations, Long term condition registers, Medications, Deaths)
• Secondary Care (A&E, Inpatient Spells and Outpatients, Critical Care Bed Days)
• Mental Health (Inpatient and Outpatient History)
• Community Care (Contacts, Appointments, Minor Injuries Units and Walk In Centers)
• Wider Determinants of Health including Housing, Education, Occupation, Economic and Deprivation
• Environmental Datasets - Pollution, Radon ground levels
Data Pre-processing:
The dataset contained missing values mainly in the attribute named ‘ethnicity’ as shown in Table 1, despite a lot of work to try and capture ethnicity coding from various sources. We therefore excluded this from the model as we felt that it wasn’t appropriate to try and use average value or synthetic data derivative which is done in most cases. Other dataset attributes had limited to no missing or outlier values from features, so no further transformations were made on the remainder of the datasets.
The data attributes are grouped into life history, symptoms, diagnostics, treatment and end of life care based on the stage at which the data is collected as depicted in figure 1. To prepare the model for predicting patients' risk ratios, we extracted only the essential attributes from the dataset. These columns were selected based on their potential to provide valuable predictive information. We specifically focused on data concerning the pathways leading to the diagnosis of lung cancer as it held valuable insights regarding the associated causes and symptoms. Attributes related to cancer diagnosis or data related to two-week wait urgent referrals, appointments to see an oncologist, Chest X-Rays and Low Dose Computer Aided Tomography (LDCT) scans for confirming diagnosis, treatment options such as chemotherapy and radiotherapy and mortality were omitted. These attributes were excluded from the dataset because they were deemed as non-predictive elements that did not offer significant insights into the associated risks of a positive diagnosis of lung cancer. We excluded the above diagnostics and treatment elements up to 12 months before the date of diagnosis.
Relative risks (RR) were calculated for all the variables and were used to determine the important attributes and for categorisation. Relative risk is the ratio of the incidence of an event occurring (Lung Cancer) with an exposure (e.g., smoking) versus the incidence of the same event occurring without the exposure. For example, the relative risk of developing lung cancer in smokers (the exposed group) versus non-smokers (non-exposed group) would be the probability of developing lung cancer for smokers divided by the probability of developing lung cancer for non-smokers. All characteristics of the individual datasets such as medications, events, tests, demographic qualities or wider determinant of health factors were tested, and risk scored using this methodology. To reduce the number of categories we collapsed these into meaningful groupings and these were informed by the higher relative risk of related variables. For instance, for respiratory disorders such as COPD and Asthma each of which have numerous diagnosis codes, these were built up into simple three state options; Yes, No or Has Familial History. Other features such as smoking history and activity with high dimensionality were ranked into similar groups by creating scores.
Model development:
We used feature encoding to reduce the number of states and to simplify the complexity of model development and enhance performance. One-hot encoding and standard scaling was used for the feature encoding26. Given the need to develop a scalar response to risk scoring in order to aid prioritisation of patients at greatest risk of developing lung cancer within a screening pool, logistical and other categorical models were ruled out. Traditional linear regression was selected as an initial candidate model to detect lung cancers early and thereby improving outcomes over and above the current screening protocol for lung cancer in the UK.
Using a combination of methods namely informed by the data, proposals from clinical experts and published literature27,28, sixteen attributes were identified. We took our entire population data for n attributes, which could be anywhere between 2 to 16, and split this into 70% training and 30% validation datasets29. We then used the 70% dataset to build a linear regression model on these n attributes. We developed a loop within Python30 to identify all the possible combinations of these 16 attributes in their ability to detect lung cancer. We applied this model for n attributes to the 30% test population to achieve an output which is number of lung cancer cases detected. This was repeated one hundred times (Figure 2) in order to create multiple outputs that could be averaged to test the models’ repeatability and for onward evaluation. We then employed boot strapping31 to test the general ability of the model to work across randomised populations. In each run, both the 70% training set and the 30% validation set were again randomized to eliminate any potential biases or chance influences. This randomization also aimed to provide comprehensive average performance statistics for all models. In each model run the TLHC eligibility criteria were applied, and the number of cancers counted. This was compared to the highest risk scored patients identified by the prediction model, keeping both the screening cohort sizes equal.
Model evaluation: Evaluation of the algorithm could not be investigated using standardised evaluation methods (e.g., R2) due to the desired scalar output of the model as our objective was to identify a cohort at high risk of lung cancer. Instead, we rationalised that if the algorithm is working most efficiently, we should be able to demonstrate more lung cancer cases being found within a screening pool in the population compared to that of the current screening pilots ongoing in England. In order to baseline our evaluation therefore, we compared the output of the algorithm against the current screening population for the TLHC32 programme. Patients meeting the following three criteria will be invited for screening:
- are over 55 but younger than 75 years old
- are registered with an GP in the area the scheme is operating
- have ever smoked, and this is recorded with the GP.
This number of cases found from the TLHC programme was then compared with the number of cases identified using the linear regression model using the top performing combination of attributes.