One of the main features of Doc’EDS is to provide easy and fast access to semantic expansion. The function allows to search synonyms and related terms to leverage the original simple query in order to expand the number of documents (lexical variations, acronyms, etc.). In addition, the number of documents with each term is provided on the fly. This expansion is provided by the HeTOP server that provides a web service to query the CoMUF lexicon. For each concept found, its synonyms, hyponyms and related terms are fetched then automatically used as a subquery to Doc’EDS. When a term matches at least one document, it is proposed in the module to the end user who can choose to modify, add or delete terms dynamically (see Fig. 1, with an example for “osteogenesis imperfecta”).
Document visualization
The end user can navigate through the documents retrieved by Doc’EDS. The document content is displayed on the right side of the screen and it corresponds to the transformed text. Words from queries are highlighted, irrelevant content is shaded, segments and tags are visible. This allows the reader to quickly evaluate relevance and to correct the query in consequences.
Automatic analysis
The number of documents and patients are displayed for each query. An advanced tool using structured data provides statistics as aggregated data (tables and charts) from all the retrieved documents: demographic data (pyramid of ages, male/female ratio), lists of ICD-10/CCAM codes, dates/types of documents, and medical units. This tool has two purposes: 1/ it can help the user to refine the query (e.g. exclude a specific medical unit from the query) and 2/ it can provide direct quantitative information (e.g. what is the median age of patients who had an appendectomy?).
In order to analyze text content related to a given query, Doc’EDS relies on the ECMT tool [23]. ECMT is an automatic semantic annotation program that identifies terminologies and ontology (from the HeTOP server) concepts in unstructured-texts. ECMT relies on the ”bag-of-words” algorithm and also on pattern-matching designed for discharge summaries, procedure reports or laboratory results which contain symbolic data (presence or absence), numerical data.
Doc’EDS embeds ECMT to analyze corpora after performing a query as a text mining tool. Therefore, it is possible to identify frequent concepts in a specific corpus (e.g. most related diseases or most prescribed drugs).
H. Access rules and Security
Doc’EDS is only accessible in the Rouen University Hospital by DBMI team experts and developers. Moreover, even if documents are deidentified, each document is linked to a unique patient number in the CDW. Currently, this data warehouse is composed by two distinct databases separated on two different servers. Thus, nominatives elements are stored in one small encrypted database and the rest (clinical data) is stored on the other database. This type of architecture is compliant with GDPR application rules since it explicitly (physically) separates nominative data from deidentified data. Nevertheless, it is possible to re-identify patient numbers thanks to a complex decryption mechanism protected by a password only known by the DBMI experts.
I. Formal evaluation of tags and segmentation
The aim of this first formal evaluation was to compute: precision, recall, F-measure of each tag (negation, hypothesis, and family medical history), true positive (TP) percentage of segmentation functionality, tag occurrence among clinical concepts and documents with their 95% confidence interval. Random documents were drawn from hospitalisation reports, consultation and procedure reports until obtaining sufficient tags according to the estimated ratio of this document type based on structured data. Clinical concepts from documents were manually extracted and analysed by two public health residents as reference (TPL and PB). Tags were categorized in four modalities for each relevant clinical concept collected: Deleterious tags which correspond to an inappropriate tag with an impact on documents found by the query (FP: false positive) and deleterious missing tags (FN: false negative), non-deleterious tags which correspond to: appropriate tags (TP) and appropriate missing tags (TN: true negative). In order to obtain the percentage of well segmented concepts and because segmentation involves 19 different categories based on the “same rules” of detection, segmentation evaluation was common to all segmentation categories. False positive segment “A” could be a false negative segment “B” hence recall and precision are not computable. Segmentation evaluation was based on hospitalisation reports. Expert concordance (Kappa) for each tag and segmentation was computed.
J. Doc’EDS complementarity with PMSI: use cases
In hospital, patient retrieval is usually performed with PMSI (Programme de médicalisation des systèmes d’information) which are the French healthcare claims data (DRG). Patients are identified with ICM-10 and/or CCAM. Queries are limited by the lack of codes (e.g, new practices or rare disease), the use of an inappropriate code [24, 25], absence of code coming from no financial valuation (e.g. medical history), or even code evolution [26]. With a CDW, retrieval is potentially wider. Improvement retrieval against DRG was illustrated by two use cases, demonstrating Doc’EDS added-value.