Coded diagnoses from Swiss GP are difficult to obtain but necessary for training NLP models. In this study, we developed a set of 105 diagnostic codes, applied them to a moderately sized dataset of only about 26,000 LoFT and measured frequencies as well as reliability of codes. Over a third of the codes achieved both a frequency above 100 and an almost perfect IRR and are thus suitable for training NLP models using this dataset. The most promising codes in this regard are those that are not easily identified by methods using other data from the electronic medical record (such as laboratory tests or disease-specific medications) and LoFT are the only data source, such as musculoskeletal conditions, cancer or tobacco use.
We developed diagnostic codes with the a priori intention of generating training data for NLP models. To do this, we attempted to limit the granularity of the diagnostic codes to around 100 items in order to avoid over-dispersion, where rarely occurring codes would have insufficient frequency to train NLP models on moderately sized datasets. Within the set of coded LoFT, 51 codes were assigned at least 100 times by both raters and are therefore potential candidates for exploring the feasibility of NLP. Interestingly, however, more than half of the LoFT were coded as 'no diagnosis', suggesting that GPs use this space for additional information that does not amount to a specific diagnosis. This is consistent with findings from other studies that have analysed the content of LoFT, showing that non-specific or insufficient information is common in medical documentation (33-36) but substantially reduced the yield of LoFT for obtaining coded diagnostic data in our study. Specifically, ambiguous acronyms or abbreviations (37-39), unstructured information (39-41), as well as physicians’ and institutional stylistic preferences contribute to non-diagnostic information in free-text diagnoses (42). Raters in our study were notably challenged by non-diagnostic information in LoFT, which manifested itself in an IRA of only 93%, whereas all other codes had IRA ≥98%. We strongly expect that these difficulties will be transferred to the NLP modelling process and methods will be needed to deal not only with false positive identifications but also with ambiguity within the LoFT itself. Third party review and arbitration can be used to further process the training data, but such human arbitration is arguably not a perfect gold standard and may inevitably introduce bias in addition to that introduced when the LoFT was created. This chain of fundamental validity issues highlights important future limitations of NLP-identified diagnoses and foreseeably questions the feasibility of fully automated coding in cases where very high accuracy is required.
Unsurprisingly, the most frequently assigned diagnostic codes were those for the most common chronic or recurrent conditions, particularly those of the musculoskeletal and cardiovascular systems (43). Several of these diagnoses were already identifiable in the FIRE database based on algorithms applied to routine data such as prescribed medications (e.g., antidiabetic drugs to identify diabetes) or results of clinical or laboratory tests (e.g., body mass index for obesity) (44). However, there are several important and common diagnoses for which sufficiently specific identification criteria based on routine data are lacking, including musculoskeletal conditions, cancer, tobacco use, depression, sleep disorders and many others, which are important targets of research in general practice. These diagnoses represent the area where we expect NLP to add the most value for research using the FIRE database.
With regard to the plausibility of the code frequency, the rankings of the codes were plausible when taking into account the ranking of the corresponding disease prevalence estimates in the Swiss population. Specifically, dorsopathies, followed by essential hypertension and hyperlipidemia, are the most frequently appearing chronic diseases in this setting according to external studies (45-50). Moreover, frequencies in our study are also very similar to a study measuring reasons for encounters in general practice where diseases of the musculoskeletal and cardio-circulatory systems were by far the most prevalent, thus adding to the plausibility of our results (51-53).
With regard to IRR, we observed almost perfect agreement (Kappa ≥0.810) in two thirds of the codes and substantial agreement in another quarter. Taken together, more than 90% of codes had at least substantial agreement when rated by raters having completed medical school without further training. These findings are comparatively favorable when similar studies with inexperienced raters are considered (21, 54, 55) and equal to studies with experienced raters (56). Depending on the research question and the target diseases to be coded, Kappa values ≥0.500 are generally deemed sufficient (31, 54, 57) and thus, the codes we developed appeared to perform sufficiently. Previous studies have shown that code frequency is associated with IRR (58, 59). This finding was replicated in our study, where all of the 20 most frequent codes reached either an almost perfect or substantial IRR, while the 20 least frequent codes had a Kappa ≤ 0.600.
Strengths and limitations:
This research project describes the design and reliability testing of a custom coding framework to be used for training NLP models. The project can serve as a template for similar research, which will become increasingly important given the growing role of AI in medicine and the associated need for local training data tailored to local factors such as languages and use cases. The use of LoFT from general practice-based medical diagnosis lists is a very prominent use case in this regard, and our study provides estimates of code frequencies based on a moderately sized dataset, which can be achieved with a small investment in manual coding labour. The methods used are highly feasible and provide transparent metrics that help in further interpretation of NLP modelling results, especially when considering the IRR of coding by human raters labelling the training data.
The moderate size and locality of the dataset may be a major limitation. We tried to include LoFT data from a representative sample of Swiss GPs, but this sample still only included 27 of them, and these were nested in 10 different medical practices. The local jargon of these GPs may limit the applicability of NLP models based on these training data, and NLP models need to be tested within, but more importantly outside, this dataset.