Objective: We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases.
Materials and Methods: Our multistep algorithm includes a named-entity recognition step, a multilabel classification using Medical Subject Headings ontology and the computation of patient similarity. A selection of cohorts of similar patients on a priori annotated phenotypes was performed. Six phenotypes were selected for their clinical significance: P1-osteoporosis, P2-nephritis in systemic erythematosus lupus, P3-interstitial lung disease in systemic sclerosis, P4-lung infection, P5-obstetric antiphospholipid syndrome, and P6-Takayasu stroke. We used a training set of 151 clinical notes and an independent validation set of 256 clinical notes, with annotated phenotypes, both extracted from the Assistance Publique-Hôpitaux de Paris data warehouse. We evaluated the precision of the 3 patients closest to the index patient for each phenotype with the precision-at-3, and the recall and average precision.
Results: For P1-P4, the precision-at-3 ranged from 0.85 to 0.99, the recall ranged from 0.53 to 0.83, and the average precision ranged from 0.58 to 0.88, respectively. P5-P6 phenotypes could not be analysed due to a limited number of phenotypes.
Conclusion: Using a method close to clinical reasoning, we built a scalable and interpretable end-to-end algorithm to extract cohorts of similar patients.