Developing and Testing a Framework for Coding General Practitioners' Free-Text Diagnoses in Electronic Medical Records - A Reliability Study for Generating Training Data in Natural Language Processing

doi:10.21203/rs.3.rs-4131283/v1

Download PDF

Research Article

Developing and Testing a Framework for Coding General Practitioners' Free-Text Diagnoses in Electronic Medical Records - A Reliability Study for Generating Training Data in Natural Language Processing

https://doi.org/10.21203/rs.3.rs-4131283/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Diagnoses entered by general practitioners into electronic medical records have great potential for research and practice, but unfortunately, diagnoses are often in uncoded format, making them of little use. Natural language processing (NLP) could assist in coding free-text diagnoses, but NLP models require local training data to unlock their potential. The aim of this study was to develop a framework of research-relevant diagnostic codes, to test the framework using free-text diagnoses from a Swiss primary care database and to generate training data for NLP modelling.

Methods

The framework of diagnostic codes was developed based on input from local stakeholders and consideration of epidemiological data. After pre-testing, the framework contained 105 diagnostic codes, which were then applied by two raters who independently coded randomly drawn lines of free text (LoFT) from diagnosis lists extracted from the electronic medical records of 3000 patients of 27 general practitioners. Coding frequency and mean occurrence rates (n and %) and inter-rater reliability (IRR) of coding were calculated using Cohen's kappa (Κ).

Results

The sample consisted of 26,980 LoFTs and in 56.3% no code could be assigned because it was not a specific diagnosis. The most common diagnostic codes were, 'dorsopathies' (3.9%) and 'other diseases of the circulatory system' (3.1%). Raters were in almost perfect agreement (Κ ≥0.81) for 69 of the 105 diagnostic codes, and 28 codes showed a substantial agreement (K between 0.61 and 0.80). Both high coding frequency and almost perfect agreement was found in 37 codes, including codes that are particularly difficult to identify from components of the electronic medical record, such as musculoskeletal conditions, cancer or tobacco use.

Conclusion

The coding framework was characterised by a subset of very frequent and highly reliable diagnostic codes, which will be the most valuable targets for training NLP models for automated disease classification based on free-text diagnoses from Swiss general practice.

General Practitioners

Electronic Medical Records

Diagnostic Coding

Reliability

Training data

Routine data from primary care services can importantly contribute to health services research and other monitoring activities. In Switzerland, primary care is predominantly delivered by general practitioners (GPs), and 70% of the population visits a GP at least once a year (1). Importantly for research and monitoring, the majority of healthcare contacts take place in this setting of care (2, 3). Diagnostic data compiled by GPs is therefore a potential ressource for research and monitoring (4-8). However, for statistical synthesis, diagnostic data requires coding (9). Unfortunately, due to time pressure and the complexity of coding frameworks, diagnostic coding is very difficult to implement properly by GPs and there is no financial incentive for diagnostic coding in outpatients in Switzerland (10-12). Thus, coded diagnoses are scarce for reasearch and monitoring in Swiss primary care.

The increasing use of electronic medical records by GPs makes data increasingly accessible for research, with even greater potential if coded diagnoses were readily available (13-16). As a result, there is a need to advance the diagnostic coding of diagnoses obtained from GPs. Various methods can be used to achieve this, including purpose-built classification systems for primary care, such as the ICPC-2 code (International Classification of Primary Care, 2nd edition) (10, 12, 17-19). However, the ICPC-2 code classifies reasons for encounters on a consultation level, which does not necessarily correspond to all diagnoses present, potentially leading to corresponding underestimation in epidemiological studies. The most widely used system for diagnostic coding is the ICD-10 system (10th revision of the International Statistical Classification of Diseases and Related Health Problems) (20). ICD-10 is a classification system introduced by the World Health Organisation and serves as a global standard for identifying and reporting diseases and health conditions. It allows methodical documentation of disorders and diseases, injuries and other related health conditions. The ICD-10 system, however, differentiates almost 70,000 diagnoses in a highly granulated fashion, making the system very precise but also very difficult to apply for inexperienced raters and it is therefore hardly suitable for coding by GPs (10, 17, 21).

Artificial intelligence applications from the domain of natural language processing (NLP) have substantially improved in recent years, are increasingly available and have great potential to support diagnostic coding in medicine (22-24). However, to maximize their effectiveness, NLP models require training ideally on local and sufficiently sized and accurately labelled data, which may be scarce depending on healthcare setting (25). In Swiss general practice, this challenge is particularly difficult for reasons explained above. In addition, even if GPs were to code their diagnoses, the accuracy of coding would still be highly uncertain, given the pacity of training and lack of incentives GPs have in this domain. In order to face this challenge of lacking training data from Swiss general practice, we aimed to develop a framework of relevant diagnostic codes, apply it to a dataset and measure the frequency of codes as well as the reliability of coding, which will be relevant for further using the data for NLP training.

Study design, setting and ethics statement

This was a study of frequency and inter-rater reliability (IRR) in diagnostic coding using a purposely-developed coding framework in a large primary care database. To select the diagnostic codes, we harvested opinions from local stakeholders as well as epidemiological data to emphasize both the local relevance of codes and expected prevalence of diagnoses in this setting. The large primary care database involved was the FIRE database (FIRE stands for “Family Medicine Research using Electronic Medical Records”), which contains anonymized patient data from Swiss GPs’ electronic medical records (26). The local Ethics Committee of the Canton of Zurich waived approval for research with the FIRE database because patient data is fully anonymized and therefore outside the scope of the Swiss Human Research Act (BASEC-Nr. Req2017–00797). The study was conducted in accordance with the Declaration of Helsinki and good clinical practice guidelines.

Diagnostic codes

We pre-specified that the number of different diagnostic codes should be limited to approximately 100 in order to prevent over-dispersion. To take relevance for local stakeholders into account, 4 stakeholders (JB, LJ, OS, AP) independently compiled a list of diagnostic codes they deemed relevant to their research. To complete these tasks, the experts used the ICD-10 framework as a template and performed up-coding to the highest level of the code that still was meaningful to them. Unused codes from each ICD-10 chapter were grouped together into a code range containing the remaining diseases for the respective chapter. To consider the expected prevalence of diagnoses in general practice, we used four previously published lists of the 100 most frequent ICD-10 diagnoses in general practice from Nordrhein-Westfalen (NRW-lists), each list covering consecutive three-month periods ranging between the second quarter of 2021 and the first quarter of 2022 (27-30). Diagnostic codes were directly selected for the subsequent coding process if at least three out of four stakeholders independently proposed to include them. Additionally, we included codes proposed by two stakeholders if additionally appearing on each NRW-list. Codes that were proposed by only one or two stakeholders and also appeared on each of the four NRW-lists were subjected to a second committee of stakeholders (SM, AP, AW, KW) who rated the importance of each code to their research on a scale from 1 (lowest importance) to 3 (highest importance). Codes achieving at least 5 points were added to the selection diagnostic codes used in the subsequent coding process ultimately consisting of 115 different codes.

Data selection, coding process and analysis

For this study, we used data from 27 GPs nested in 10 different general practices. Specifically, from each practice, we randomly drew 300 patients with at least one consultation in the year 2019. From these patients, we exported the patient ID and diagnosis lists in free-text format from their last consultation in 2019, as imputed by the GPs. This data was transferred into a spreadsheet where each line of free-text (LoFT) was assigned to an individual cell. A pre-testing subset containing 10% of the LoFT was drawn to test the intended coding process and refine the coding framework where necessary. Pre-testing revealed redundancies and very low occurrence of specific codes, which were subsequently unified or removed from the selection and thus, the final coding framework consisted of 105 different codes (see Additional File 1).

The coding process involved two trained physicians (AW and DB) who were tasked to independently assign the diagnostic codes to each LoFT. Raters were tasked to code every LoFT, which reflected an unambiguous diagnosis. In the event of ambiguity or information insufficient to code a diagnosis (such as LoFT describing mere symptoms, laboratory test results or low certainty differential diagnostic considerations) the code for “no diagnosis” was assigned, so that every LoFT in the dataset was coded.

In all of the LoFT, we determined for each diagnostic code: 1) frequency by rater, 2) average occurrence rate (as percentage) using the total count of LoFT as denominator and the respective code as numerator, 3) inter-rater agreement (IRA) using the total count of LoFT as denominator and the count of LoFT with concordant coding (absence or presence of the respective code) of the respective code as numerator and 4) inter-rater reliability (IRR) using Cohen’s kappa as measure (31). We used counts and proportions (n and %) for descriptive statistics. We interpreted Κ ≥0.81 as almost perfect agreement K between 0.61 and 0.80 as substantial agreement. For data analysis, we used the software R (Version 4.2.0) (32).

Sample and frequency analyses

From the random sample of 3000 patients, we obtained 26,980 LoFT (of which 2,800 were used for pre-testing). To the 26,980 LoFT, raters 1 and 2 assigned 31,672 and 31,864 codes respectively (the number of codes exceeded the number of LoFT because of cases where multiple codes were assigned to a single LoFT). Taken together, raters most frequently assigned diagnostic codes: “no diagnosis” (56.3%), “dorsopathies” (3.9%), “other diseases of the circulatory system” (3.1%,) and “other diseases of the musculoskeletal system and connective tissue” (2.8%). A frequency of at least 200 (0.7% of LoFT) by at least one rater was encountered in 30 codes (see Table 1) and a frequency of at least 100 (0.4%) was encountered in 51 codes. Eleven codes were assigned with a frequency below 30 (0.1%) by either rater (see Additional File 2 for the complete frequency analysis).

Table 1: The thirty most frequently assigned codes or code ranges

ICD-Origin	Code	Rater 1	Rater 2	Avg. of LoFT%	Kappa
none	no diagnosis	15300	15091	56.3%	0.856
M40-M54	dorsopathies	1056	1066	3.9%	0.932
I00-I99	other diseases of the circulatory system	824	865	3.1%	0.848
M00-M99	other diseases of the musculoskeletal system and connective tissue	769	758	2.8%	0.743
I10	primary hypertension	713	704	2.6%	0.972
S00-T98	injury, poisoning and certain other consequences of external causes	654	690	2.5%	0.853
D00-D48	other neoplasms	581	588	2.2%	0.852
E78	disorders of lipoprotein metabolism and other lipidaemias	545	539	2.0%	0.985
E00-E90	other endocrine, nutritional and metabolic diseases	489	501	1.8%	0.876
M60-M79	soft tissue disorders	415	463	1.6%	0.734
K00-K93	other diseases of the digestive system	414	449	1.6%	0.786
L00-L99	other diseases of the skin and subcutaneous tissue	401	458	1.6%	0.833
H00-H59	diseases of the eye and adnexa	344	350	1.3%	0.900
C00-C99	malignant neoplasms	333	359	1.3%	0.839
F17	mental and behavioural disorders due to use of tobacco	312	315	1.2%	0.969
I20-I25	ischaemic heart diseases	297	305	1.1%	0.925
K57	diverticular disease of intestine	284	281	1.0%	0.973
N00-N99	other diseases of the genitourinary system	252	302	1.0%	0.780
K21	gastro-oesophageal reflux disease	262	260	1.0%	0.957
E65-E68	obesity and other hyperalimentation	260	260	1.0%	0.961
G00-G99	other diseases of the nervous system	234	263	0.9%	0.778
F32-F33	depressive episode and recurrent depressive disorder	238	249	0.9%	0.96
E00-E07	disorders of thyroid gland	223	238	0.9%	0.883
J00-J99	other diseases of the respiratory system	212	247	0.9%	0.751
A00-B99	intestinal infectious diseases	236	218	0.8%	0.785
K40-K46	hernia	221	221	0.8%	0.950
H60-H95	other diseases of the ear and mastoid process	226	207	0.8%	0.862
E11	type 2 diabetes mellitus	217	211	0.8%	0.906
I83	varicose veins of lower extremities	195	217	0.8%	0.882
D50-D90	other diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism	183	207	0.7%	0.782

Agreement and reliability

With respect to measures of coding agreement, we found IRA of >0.98 in all assigned codes except “no diagnosis” (IRA = 0.93). With respect to IRR, we found Kappa values ≥0.810 in 69 of all the 105 diagnostic codes and 28 codes showed Kappa between 0.610 and <0.810. Simultaneously a frequency of 100 by at least one rater and a Kappa value ≥0.81 was found in 37 codes (see Table 2). Among these frequently assigned diagnostic codes, we found the highest IRR in “disorders of lipoprotein metabolism and other lipidaemias” (Kappa = 0.985), “diverticular disease of intestine” (Kappa = 0.973) and “primary hypertension” (Kappa = 0.972).

Table 2: Codes that were both frequently and reliably assigned

ICD-10 Code	Code	Rater 1	Rater 2	Avg. of LoFT%	Kappa
E78	disorders of lipoprotein metabolism and other lipidaemias	545	539	2.0%	0.985
K57	diverticular disease of intestine	284	281	1.0%	0.973
I10	primary hypertension	713	704	2.6%	0.972
F17	mental and behavioural disorders due to use of tobacco	312	315	1.2%	0.969
I11-I14	hypertension with end organ damage	163	159	0.6%	0.962
I48	atrial fibrillation and flutter	147	140	0.5%	0.961
E65-E68	obesity and other hyperalimentation	260	260	1.0%	0.961
E55	vitamin D deficiency	112	113	0.4%	0.960
F32-F33	depressive episode and recurrent depressive disorder	238	249	0.9%	0.960
K21	gastro-oesophageal reflux disease	262	260	1.0%	0.957
N18	chronic kidney disease	110	108	0.4%	0.954
M17	arthritis of the knee	183	176	0.7%	0.952
N40	hyperplasia of prostate	106	115	0.4%	0.950
K40-K46	intestinal hernia	221	221	0.8%	0.950
G47	sleep disorders	150	147	0.6%	0.949
J45	asthma	174	177	0.7%	0.945
M40-M54	dorsopathies	1056	1066	3.9%	0.932
I20-I25	ischaemic heart diseases	297	305	1.1%	0.925
K64	haemorrhoids and perianal venous thrombosis	122	119	0.4%	0.921
K29	gastritis and duodenitis	143	150	0.5%	0.914
E11	type 2 diabetes mellitus	217	211	0.8%	0.906
H00-H59	diseases of the eye and adnexa	344	350	1.3%	0.900
N80-N98	noninflammatory disorders of female genital tract	133	128	0.5%	0.896
E00-E07	disorders of thyroid gland	223	238	0.9%	0.883
I83	varicose veins of lower extremities	195	217	0.8%	0.882
E00-E90	other endocrine, nutritional and metabolic diseases	489	501	1.8%	0.876
I60-I69	cerebrovascular diseases	135	130	0.5%	0.874
H60-H95	other diseases of the ear and mastoid process	226	207	0.8%	0.862
no diagnosis	no diagnosis	15300	15091	56.3%	0.856
S00-T98	injury, poisoning and certain other consequences of external causes	654	690	2.5%	0.853
D00-D48	other neoplasms	581	588	2.2%	0.852
I00-I99	other diseases of the circulatory system	824	865	3.1%	0.848
F40-F48	neurotic, stress-related and somatoform disorders	162	197	0.7%	0.839
C00-C99	malignant neoplasms	333	359	1.3%	0.839
L00-L99	other diseases of the skin and subcutaneous tissue	401	458	1.6%	0.833

Coded diagnoses from Swiss GP are difficult to obtain but necessary for training NLP models. In this study, we developed a set of 105 diagnostic codes, applied them to a moderately sized dataset of only about 26,000 LoFT and measured frequencies as well as reliability of codes. Over a third of the codes achieved both a frequency above 100 and an almost perfect IRR and are thus suitable for training NLP models using this dataset. The most promising codes in this regard are those that are not easily identified by methods using other data from the electronic medical record (such as laboratory tests or disease-specific medications) and LoFT are the only data source, such as musculoskeletal conditions, cancer or tobacco use.

We developed diagnostic codes with the a priori intention of generating training data for NLP models. To do this, we attempted to limit the granularity of the diagnostic codes to around 100 items in order to avoid over-dispersion, where rarely occurring codes would have insufficient frequency to train NLP models on moderately sized datasets. Within the set of coded LoFT, 51 codes were assigned at least 100 times by both raters and are therefore potential candidates for exploring the feasibility of NLP. Interestingly, however, more than half of the LoFT were coded as 'no diagnosis', suggesting that GPs use this space for additional information that does not amount to a specific diagnosis. This is consistent with findings from other studies that have analysed the content of LoFT, showing that non-specific or insufficient information is common in medical documentation (33-36) but substantially reduced the yield of LoFT for obtaining coded diagnostic data in our study. Specifically, ambiguous acronyms or abbreviations (37-39), unstructured information (39-41), as well as physicians’ and institutional stylistic preferences contribute to non-diagnostic information in free-text diagnoses (42). Raters in our study were notably challenged by non-diagnostic information in LoFT, which manifested itself in an IRA of only 93%, whereas all other codes had IRA ≥98%. We strongly expect that these difficulties will be transferred to the NLP modelling process and methods will be needed to deal not only with false positive identifications but also with ambiguity within the LoFT itself. Third party review and arbitration can be used to further process the training data, but such human arbitration is arguably not a perfect gold standard and may inevitably introduce bias in addition to that introduced when the LoFT was created. This chain of fundamental validity issues highlights important future limitations of NLP-identified diagnoses and foreseeably questions the feasibility of fully automated coding in cases where very high accuracy is required.

Unsurprisingly, the most frequently assigned diagnostic codes were those for the most common chronic or recurrent conditions, particularly those of the musculoskeletal and cardiovascular systems (43). Several of these diagnoses were already identifiable in the FIRE database based on algorithms applied to routine data such as prescribed medications (e.g., antidiabetic drugs to identify diabetes) or results of clinical or laboratory tests (e.g., body mass index for obesity) (44). However, there are several important and common diagnoses for which sufficiently specific identification criteria based on routine data are lacking, including musculoskeletal conditions, cancer, tobacco use, depression, sleep disorders and many others, which are important targets of research in general practice. These diagnoses represent the area where we expect NLP to add the most value for research using the FIRE database.

With regard to the plausibility of the code frequency, the rankings of the codes were plausible when taking into account the ranking of the corresponding disease prevalence estimates in the Swiss population. Specifically, dorsopathies, followed by essential hypertension and hyperlipidemia, are the most frequently appearing chronic diseases in this setting according to external studies (45-50). Moreover, frequencies in our study are also very similar to a study measuring reasons for encounters in general practice where diseases of the musculoskeletal and cardio-circulatory systems were by far the most prevalent, thus adding to the plausibility of our results (51-53).

With regard to IRR, we observed almost perfect agreement (Kappa ≥0.810) in two thirds of the codes and substantial agreement in another quarter. Taken together, more than 90% of codes had at least substantial agreement when rated by raters having completed medical school without further training. These findings are comparatively favorable when similar studies with inexperienced raters are considered (21, 54, 55) and equal to studies with experienced raters (56). Depending on the research question and the target diseases to be coded, Kappa values ≥0.500 are generally deemed sufficient (31, 54, 57) and thus, the codes we developed appeared to perform sufficiently. Previous studies have shown that code frequency is associated with IRR (58, 59). This finding was replicated in our study, where all of the 20 most frequent codes reached either an almost perfect or substantial IRR, while the 20 least frequent codes had a Kappa ≤ 0.600.

Strengths and limitations:

This research project describes the design and reliability testing of a custom coding framework to be used for training NLP models. The project can serve as a template for similar research, which will become increasingly important given the growing role of AI in medicine and the associated need for local training data tailored to local factors such as languages and use cases. The use of LoFT from general practice-based medical diagnosis lists is a very prominent use case in this regard, and our study provides estimates of code frequencies based on a moderately sized dataset, which can be achieved with a small investment in manual coding labour. The methods used are highly feasible and provide transparent metrics that help in further interpretation of NLP modelling results, especially when considering the IRR of coding by human raters labelling the training data.

The moderate size and locality of the dataset may be a major limitation. We tried to include LoFT data from a representative sample of Swiss GPs, but this sample still only included 27 of them, and these were nested in 10 different medical practices. The local jargon of these GPs may limit the applicability of NLP models based on these training data, and NLP models need to be tested within, but more importantly outside, this dataset.

We developed and tested a framework of research-relevant diagnostic codes in a primary care research database to train NLP models based on free text data. We have identified a subset of very frequent and highly reliable diagnostic codes, and the next step in the research agenda is to train NLP models with the obtained data and evaluate their performance in automated disease classification.

FIRE Family Medicine ICPC Research using Electronic Medical Records

LoFT Lines of free-text

GP General Practitioner

ICD-10 International Statistical Classification of Diseases and Related Health Problems, Tenth Revision

IRA Interrater agreement

IRR Interrater reliability

Ethics approval and consent to participate: The local Ethics Committee of the Canton of Zurich waived approval for the present study because the FIRE project is outside the scope of the law on human research and studies utilizing data from the FIRE project are thus exempt from ethics review (BASEC-Nr. Req2017–00797).

Consent for publication: Consent for publication was waived by the ethics committee due to the anonymization of the data at the practice level

Consent to participate declaration: not applicable due to the anonymization of the data at the practice level

Availability of data and materials: The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests: The authors declare that no conflicts of interest are relevant to any aspects of this work.

Funding: This study received no external funding.

Authors' contributions: SM and JB conceived and designed the study; SM and JB performed data acquisition; AW performed data analysis and drafted the original draft; SM, KW, TR and OS revised the original draft of the manuscript and the version to be published. All the authors revised and approved the final manuscript for publication.

Acknowledgements: We thank Levy Jäger (LJ) and Andreas Plate (AP) for their contribution in the code selection, as well as Donika Balaj (DB), Adriana Keller (AK) and Gino Bopp (GB) for their contribution to the coding procedure (DB) pre-testing the coding framework (AK and GB).

Authors' information (optional): Not applicable.

Statistik Bf. Konsultationen bei Generalistinnen und Generalisten nach Geschlecht, Alter, Bildungsniveau, Sprachgebiet. In: Statistik Bf, editor. 30.10.2018.
Green LA, Fryer GE, Jr., Yawn BP, Lanier D, Dovey SM. The ecology of medical care revisited. N Engl J Med. 2001;344(26):2021-5.
Senn N, Tiaré Ebert S, Cohidon C. Die Hausarztmedizin in der Schweiz – Perspektiven. Analyse basierend auf den Indikatoren des Programm SPAM (Swiss Primary Care Active Monitoring). Obsan Bulletin 2016;11/2016:4.
Meci A, Du Breuil F, Vilcu A, Pitel T, Guerrisi C, Robard Q, et al. The Sentiworld project: global mapping of sentinel surveillance networks in general practice. BMC Prim Care. 2022;23(1):173.
Clothier HJ, Fielding JE, Kelly HA. An evaluation of the Australian Sentinel Practice Research Network (ASPREN) surveillance for influenza-like illness. Commun Dis Intell Q Rep. 2005;29(3):231-47.
Liljeqvist GT, Staff M, Puech M, Blom H, Torvaldsen S. Automated data extraction from general practice records in an Australian setting: trends in influenza-like illness in sentinel general practices and emergency departments. BMC public health. 2011;11:435.
de Lusignan S, Hague N, van Vlymen J, Kumarapeli P. Routinely-collected general practice data are complex, but with systematic processing can be used for quality improvement and research. Inform Prim Care. 2006;14(1):59-66.
de Lusignan S, van Weel C. The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract. 2006;23(2):253-63.
Nicholls SG, Langan SM, Benchimol EI. Routinely collected data: the importance of high-quality diagnostic coding to research. Cmaj. 2017;189(33):E1054-e5.
Kühlein T, Virtanen M, Claus C, Popert U, van Boven K. [Coding in general practice-Will the ICD-11 be a step forward?]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2018;61(7):828-35.
Lasker RD, Marquis MS. The intensity of physicians' work in patient visits--implications for the coding of patient evaluation and management services. N Engl J Med. 1999;341(5):337-41.
Letrilliart L, Gelas-Dore B, Ortolan B, Colin C. Prometheus: the implementation of clinical coding schemes in French routine general practice. Inform Prim Care. 2006;14(3):157-65.
Biro SC, Barber DT, Kotecha JA. Trends in the use of electronic medical records. Canadian family physician Medecin de famille canadien. 2012;58(1):e21.
Chang F, Gupta N. Progress in electronic medical record adoption in Canada. Canadian family physician Medecin de famille canadien. 2015;61(12):1076-84.
Statistik Bf. Führung der Krankengeschichten an den Standorten der Arztpraxen und ambulanten Zentren. 2022.
Djalali S. Wer eHealth sucht, findet einen Haufen Papier. Schweizerische Ärztezeitung (SÄZ). 2015;96(43):1575-8.
Frese T, Herrmann K, Bungert-Kahl P, Sandholzer H. Inter-rater reliability of the ICPC-2 in a German general practice setting. Swiss Med Wkly. 2012;142:w13621.
Lamberts H, Wood M, Hofmans-Okkes IM. International primary care classifications: the effect of fifteen years of evolution. Fam Pract. 1992;9(3):330-9.
Verbeke M, Schrans D, Deroose S, De Maeseneer J. The International Classification of Primary Care (ICPC-2): an essential tool in the EPR of the GP. Stud Health Technol Inform. 2006;124:809-14.
WHO. International Statistical Classification of Diseases and Related Health Problems 10th Revision 2019 [Available from: https://icd.who.int/browse10/2019/en.
Stausberg J, Lehmann N, Kaczmarek D, Stein M. Reliability of diagnoses coding with ICD-10. Int J Med Inform. 2008;77(1):50-7.
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, et al. Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology. 2017;2(4).
Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nature biomedical engineering. 2018;2(10):719-31.
Chen M, Hao Y, Hwang K, Wang L, Wang L. Disease prediction by machine learning over big data from healthcare communities. Ieee Access. 2017;5:8869-79.
Althnian A, AlSaeed D, Al-Baity H, Samha A, Dris AB, Alzakari N, et al. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain. Applied Sciences. 2021;11(2):796.
Chmiel C, Bhend H, Senn O, Zoller M, Rosemann T. The FIRE project: a milestone for research in primary care in Switzerland. Swiss Med Wkly. 2011;140:w13142.
Nordrhein KV. Die 100 häufigsten ICD-10-Schlüssel und Kurztexte (nach Fachgruppen). In: Nordrhein KV, editor. 2 Quartal 20212021.
Nordrhein KV. Die 100 häufigsten ICD-10-Schlüssel und Kurztexte (nach Fachgruppen) In: Nordrhein KV, editor. 3 Quartal 20212021.
Nordrhein KV. Die 100 häufigsten ICD-10-Schlüssel und Kurztexte (nach Fachgruppen). In: Nordrhein KV, editor. 4 Quartal 20212021.
Nordrhein KV. Die 100 häufigsten ICD-10-Schlüssel und Kurztexte (nach Fachgruppen). In: Nordrhein KV, editor. 1 Quartal 20222022.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159-74.
Team RC. The R Project for Statistical Computing 2022 [4.2.0:[Available from: https://www.R-project.org/.
Tsai CH, Eghdam A, Davoody N, Wright G, Flowerday S, Koch S. Effects of Electronic Health Record Implementation and Barriers to Adoption and Use: A Scoping Review and Qualitative Analysis of the Content. Life (Basel). 2020;10(12).
Nguyen L, Bellucci E, Nguyen LT. Electronic health records implementation: an evaluation of information system impact and contingency factors. Int J Med Inform. 2014;83(11):779-96.
Lium JT, Tjora A, Faxvaag A. No paper, but the same routines: a qualitative exploration of experiences in two Norwegian hospitals deprived of the paper based medical record. BMC medical informatics and decision making. 2008;8:2.
Whittaker AA, Aufdenkamp M, Tinley S. Barriers and facilitators to electronic documentation in a rural hospital. J Nurs Scholarsh. 2009;41(3):293-300.
Hanauer DA, Mei Q, Law J, Khanna R, Zheng K. Supporting information retrieval from electronic health records: A report of University of Michigan's nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE). J Biomed Inform. 2015;55:290-300.
Barrows Jr RC, Busuioc M, Friedman C. Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches. Proc AMIA Symp. 2000:51-5.
Tang KL, Lucyk K, Quan H. Coder perspectives on physician-related barriers to producing high-quality administrative data: a qualitative study. CMAJ open. 2017;5(3):E617-e22.
Brown T, Zelch B, Lee JY, Doctor JN, Linder JA, Sullivan MD, et al. A Qualitative Description of Clinician Free-Text Rationales Entered within Accountable Justification Interventions. Appl Clin Inform. 2022;13(4):820-7.
Rubio-López I, Costumero R, Ambit H, Gonzalo-Martín C, Menasalvas E, Rodríguez González A. Acronym Disambiguation in Spanish Electronic Health Narratives Using Machine Learning Techniques. Stud Health Technol Inform. 2017;235:251-5.
Feder SL. Data Quality in Electronic Health Records Research: Quality Domains and Assessment Methods. West J Nurs Res. 2018;40(5):753-66.
Excoffier S, Herzig L, N'Goran AA, Déruaz-Luyet A, Haller DM. Prevalence of multimorbidity in general practice: a cross-sectional study within the Swiss Sentinel Surveillance System (Sentinella). BMJ open. 2018;8(3):e019616.
Meier R, Grischott T, Rachamin Y, Jäger L, Senn O, Rosemann T, et al. Importance of different electronic medical record components for chronic disease identification in a Swiss primary care database: a cross-sectional study. Swiss Med Wkly. 2023;153:40107.
(SGB) BSG. Häufigkeit von Rücken- oder Kopfschmerzen. In: 2023 O, editor. 2023.
Danon-Hersch N, Marques-Vidal P, Bovet P, Chiolero A, Paccaud F, Pécoud A, et al. Prevalence, awareness, treatment and control of high blood pressure in a Swiss city general population: the CoLaus study. Eur J Cardiovasc Prev Rehabil. 2009;16(1):66-72.
Walther D, Curjuric I, Dratva J, Schaffner E, Quinto C, Rochat T, et al. High blood pressure: prevalence and adherence to guidelines in a population-based cohort. Swiss Med Wkly. 2016;146:w14323.
Statistik Bf. Personen mit Bluthochdruck nach Geschlecht, Alter, Bildungsniveau, Sprachgebiet. In: Statistik Bf, editor.: BFS; 2017.
Marco Storni RL, Kaeser M. Schweizerische Gesundheitsbefragung 2017. In: (BFS) BfS, editor.: Bundesamt für Statistik; 2018.
Estoppey D, Paccaud F, Vollenweider P, Marques-Vidal P. Trends in self-reported prevalence and management of hypertension, hypercholesterolemia and diabetes in Swiss adults, 1997-2007. BMC public health. 2011;11:114.
Tandjung R, Hanhart A, Bärtschi F, Keller R, Steinhauer A, Rosemann T, Senn O. Referral rates in Swiss primary care with a special emphasis on reasons for encounter. Swiss Med Wkly. 2015;145:w14244.
Lurquin B, Kellou N, Colin C, Letrilliart L. Comparison of rural and urban French GPs' activity: a cross-sectional study. Rural Remote Health. 2021;21(3):5865.
Schäfer I, Hansen H, Ruppel T, Lühmann D, Wagner HO, Kazek A, Scherer M. Regional differences in reasons for consultation and general practitioners' spectrum of services in northern Germany - results of a cross-sectional observational study. BMC Fam Pract. 2020;21(1):22.
Wockenfuss R, Frese T, Herrmann K, Claussnitzer M, Sandholzer H. Three- and four-digit ICD-10 is not a reliable classification system in primary care. Scand J Prim Health Care. 2009;27(3):131-6.
Asadi F, Hosseini MA, Almasi S. Reliability of trauma coding with ICD-10. Chin J Traumatol. 2022;25(2):102-6.
Peng M, Eastwood C, Boxill A, Jolley RJ, Rutherford L, Carlson K, et al. Coding reliability and agreement of International Classification of Disease, 10(th) revision (ICD-10) codes in emergency department data. International journal of population data science. 2018;3(1):445.
Cheniaux E, Landeira-Fernandez J, Versiani M. The diagnoses of schizophrenia, schizoaffective disorder, bipolar disorder and unipolar depression: interrater reliability and congruence between DSM-IV and ICD-10. Psychopathology. 2009;42(5):293-8.
Koopman B, Karimi S, Nguyen A, McGuire R, Muscatello D, Kemp M, et al. Automatic classification of diseases from free-text death certificates for real-time surveillance. BMC medical informatics and decision making. 2015;15:53.
Mandrekar JN. Measures of interrater agreement. J Thorac Oncol. 2011;6(1):6-7.

No competing interests reported.

Download PDF

Reviews received at journal
01 May, 2024
Reviewers agreed at journal
18 Apr, 2024
Reviewers agreed at journal
18 Apr, 2024
Reviewers invited by journal
18 Apr, 2024
Editor invited by journal
12 Apr, 2024
Submission checks completed at journal
09 Apr, 2024
Editor assigned by journal
09 Apr, 2024
First submitted to journal
19 Mar, 2024

You are reading this latest preprint version

Developing and Testing a Framework for Coding General Practitioners' Free-Text Diagnoses in Electronic Medical Records - A Reliability Study for Generating Training Data in Natural Language Processing

Status:

Version 1

Abstract

Background

Methods

Study design, setting and ethics statement

Diagnostic codes

Data selection, coding process and analysis

Results

Sample and frequency analyses

Agreement and reliability

Discussion

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1