Patients
Data from patients affected by invasive carcinoma of the cervix, FIGO stage IB2-IVa, treated between 2015 and 2018 were extracted from our institutional data-lake. The following Electronic Health Records (eHRs) items have been considered for analysis:
- Staging Magnetic Resonance (MR) report;
- Gynecologic examination under general anesthesia (EUA) report;
- Staging Positron Emission Tomography–Computed Tomography (PET-CT) report.
Other patient’s relevant data (e.g., demographics, laboratory tests, body mass index, drugs, comorbidities etc.) were collected for further analysis.
Methods
A two steps model has been applied to allow the set-up of the MTB Virtual Assistant:
(i) Automated extraction of the relevant eHR sets that capture the patient’s data before the diagnosis and then, through Natural Language Processing (NLP), analysis and categorization of all information to transform source information into structured data,
(ii) development of A.I. methods to support the clinical staff in the decision process with regards to tumor staging confirmation and to help in identifying the most complex cases, where more complex analyses and discussion are needed (e. g. due to conflicting information coming from different exams).
A first subset of patients with pre-validated staging and diagnosis was used as training set for steps one and two.
Once steps (i) and (ii) have been completed and successfully tested for patients’ subsets with pre-validated staging and diagnosis (the ‘training set’), we developed an integrated toolset to support the MTB diagnostic process. Each time a new patient is selected for staging and treatment decision-making and enters the workflow, her eHR are automatically processed to provide structured clinical features (e.g. presence / absence of specific disease features in the tumor region, tumor activity etc.).
The A.I. algorithm then delivers an assessment for the staging of the tumor with a certain degree of reliability, reported on the screen as percentage of accuracy. The MTB staff can proceed– if needed- to go deeper in the characterization of the information, performing further analyses of clinical data patterns from different sources and comparing the content from different eHRs. This process, characterized by such a depth and complexity of information, and the A.I. empowered multi-dimensional analyses allow a robust consensus on the clinical decision to be taken.
Step (i): Natural Language Processing: extracting clinical data from text-based medical reports
The first step is represented by the extraction of clinically relevant information from MR, EUA, PET-CT reports and other eHRs. The challenge with these data sources was firstly to transform the unstructured information into discrete, categorical data able to define a clear, robust and actionable framework of clinical and pathological features related to the tumor loco-regional morphology.
The output of this transformation is therefore a pattern of structured clinical features that describe in detail the disease of the patient whose specific data constitute the source information of the integrated A.I. empowered analysis.
In terms of computer algorithm used, the NPL method to transform text into data is based on a hybrid approach using rules and annotations derived from medical guidelines, combined with A.I. (machine learning); in this experience, this was developed using the SAS Visual Text Analytics ® environment [12,13]. Pre-processing steps as such as: segmentation; boundary detection and tokenization; word normalization (stemming, spelling correction, expansion of abbreviation) were performed to achieve a higher degree of accuracy, Thereafter, syntactic and semantic analysis were performed with the support of an algorithm that creates the network of words, showing the occurrence of links among two words and providing an enhanced approach to natural language understanding. Finally, the sequence of steps above gave us the relevant NLP features leading to data extraction from real life medical reports.
By using these NLP steps, the medical reports were processed and free-text diagnostic information were transformed into categorical or quantitative clinical data that classify the clinical features resulting from each of the three exams MR, EUA, PET-CT. The selection of the relevant clinical features that characterize the diagnosis – and most importantly tumor staging – was performed by the multidisciplinary clinical team and constitute the basis for the ontology of the study.
Therefore, the result of this data discovery process for each patient is a table showing how detailed clinical features in the tumor region are diagnosed for each of the three exams – as shown in Table 1a. Any clinical feature is then inspected and reported as being or not within the framework of the three types of exams. Categorical morphological variables (i. e. whether or not a specific region is involved) are mostly extracted from MR and EUA, while PET-CT clinical features provide additional levels of tumor (metabolic) activity.
Therefore, after the eHR automated reading and the subsequent NLP step, the patient’s clinical features are collected in a summarized pattern, as shown in Table 1b (specific instance of the table for a patient case); this view shows, for each of the clinical features, whether this has been identified as positive (meaning whether that region is involved in the tumor progression) or not. Examples from Table 1b indicate bladder involvement, as detected both by MR and EUA; while rectovaginal septum appears as involved when analyzing the results from the EUA and not from the RM. This conflicting outcome may indicate uncertainty in the staging assessment, which is typically represented in the predictive model results, as explained in step (ii) below.
This transformation from unstructured to structured data is the mainstay of the input to the prediction and clustering then executed by A.I. (machine learning) models.
Step (ii): Assessment of Tumor Staging through Statistical Learning
To create a system that supports the MTB in disease staging, the first step is to use a supervised learning technique for the training set, where tumor stage was known a priori for each patient in this group. This was achieved by applying clustering methods to classify patients based on similarity in their clinical feature pattern (the summary view as in Figure 1) and in their diagnosed staging. When applying clustering algorithms for each of the 3 diagnostic methods separately (MR, EUA, PET-CT) seven groups for each of the three diagnoses were generated, with a good degree of discrimination. Once the clusters have been created in the training set, a machine learning algorithm has then been used to build a predictive model for the staging based on composition of the clusters. “Decision Tree” algorithms have been adopted, using the SAS Vyia ® analytics and modelling features.
Finally, a validation step has been performed on a new set of patients to predict their staging based on the trained Decision Tree model, testing the validity of the model.