We identified several broad categories of under-specification observed across a set of 34 narrative phenotype algorithms, and we have presented a taxonomy of these observations. Overall, our findings suggest that while narrative descriptions of phenotype logic are a suitable mechanism for disseminating phenotype definitions, under-specification leads to ambiguity and vagueness, and it occurs often enough to pose an impediment to efficient development and correct implementation of phenotype algorithms.
We note three important considerations. First, ambiguity and vagueness as a result of under-specification were identified in all of the phenotypes reviewed, which were developed across multiple phenotype authors at 9 distinct institutions. This indicates that this is not an isolated issue and that we can expect this phenomenon to be prevalent in other narrative phenotype algorithm definitions.
Second, vague and under-specified phenotype algorithms required additional effort to resolve, thus increasing the overall implementation time for the phenotype algorithm at an institution. Within eMERGE, the use of PheKB served as a central location for the collaborative network to pose questions and allowed subsequent implementing sites to review and learn from the clarifications made (if they were not directly reflected in the phenotype definition). Such requests for clarification are not always made publicly available, such as e-mailing an author directly for clarification, and in these instances each implementing institution may need to request the same clarification. Hence, we can also assume that the issues of under-specification are greater than what was uncovered in this study.
The third consideration is that there may exist instances of ambiguity and vagueness that were not recognized by any implementer. While this is a speculative issue in that our data would not have always uncovered these occurrences, we recognize they can exist and highlight an additional area where misinterpretation may occur. This is particularly risky as it can be subconsciously ignored, particularly with vagueness. An illustrative case is in one of the narrative phenotype algorithms we examined where the original validators of the algorithm missed a case of ambiguity as the algorithm did not specify whether all available BMI values were needed or only values at a specified time point. Other implementers of the algorithm later identified this ambiguity and sought clarification.
The issue of linguistic ambiguities, vagaries, and uncertainty are not specific to the realm of phenotype algorithm development. As phenotype algorithm definitions specify the process for software implementation, we note similar issues identified with requirements specification in the field of software engineering. This includes not only describing ambiguity within software requirement documents, which includes under-specification and vagueness(12–14), but also considerations and tools for automated detection of these linguistic constructs(15, 16). Requirement specifications are not directly equivalent to phenotype definitions; requirements typically describe the objectives of what should be built, whereas the phenotype is more a representation of what has been built and should be replicated. However, similarities in detection of under-specification may be applied and warrant further investigation.
Within the healthcare domain, the use of “hedge terms” (intentional expressions of uncertainty) within clinical notes has been reported, including a review of the literature identifying 313 hedge phrases, and an analysis revealing the 30 most prevalent hedge phrases used in a clinical note corpus(17). These are artifacts of the uncertainty of medicine and the diagnostic process, which could be simple phrases such as “possible”, “likely”, and “unlikely” or more complex group concepts such as “clinically significant infection”, which require further specification using contextual knowledge. Hedge terms typically represent a different source of vagueness that, although more frequent in documenting the clinical process, could still occur in phenotype algorithm definitions.
Similarly, the classification of ambiguity and vagueness within clinical practice guidelines (CPGs) has illustrated complementary findings that intersect the previously mentioned study on hedge terms in clinical notes, as well as the work described here on phenotype algorithm definitions(8). In this work, the authors conducted a literature search and developed a 3-axis model to classify CPG ambiguity and under-specification. Axis 1 includes linguistic definitions of ambiguity, vagueness, and under-specification, and aligns with our described model. Axis 2 considers if a vague statement is potentially deliberate, and Axis 3 looks at the affected portion of the CPG - both of which are irrelevant to phenotype algorithms.
Our findings provide insight into the issue of vague and under-specified phenotype definitions, and we believe this heightened awareness can be used to guide phenotype algorithm developers to mitigate its detrimental effect. We propose potential solutions, based on our findings, that would mitigate the risk of vague and under-specified phenotype definitions.
First, we believe that explicit enumeration of categories of under-specification in phenotype algorithms raises awareness of these potential issues amongst phenotype algorithm developers. By becoming familiar with ambiguity and vagueness caused by under-specification, developers can be more mindful when writing future narrative phenotypes algorithms and be more attentive to these issues. In particular, having a list of categorized ambiguities to avoid can serve as a handy checklist when composing and reviewing an algorithm definition.
Second, additional resources are needed (including methods, tools, and standard terminologies) to further assist in reducing ambiguity and vagueness from under-specification. This includes approaches for identification and detection of “red flags” like hedge terms. Once developed, narrative phenotype algorithms could be cross-checked by hand and potentially supplemented by computable means before completion. This allows the developers to identify potential issues prior to validation or implementation. Several categories of under-specification are due to the lack of quantification for qualitative terms, such as not having a numeric threshold for “obese”. A similar check for qualitative descriptors and attributes of variables used by the algorithm would be beneficial for reducing ambiguity and vagueness.
Third, as noted in the software engineering space(15, 16), semi-automated approaches may be an approach to assist phenotype authors in detecting these issues, in addition to provided guidelines. The use of natural language processing (NLP) and natural language understanding (NLU) can process and discern relationships between the entities found in the text. For instance, with “BMI at age 21”, NLU can establish the relationship between BMI and age. Systems such as Criteria2Query have demonstrated great progress in this area and could be further adapted for this purpose in the future.(18) While NLP/NLU is not a panacea, such tools can be designed to assist and train phenotype algorithm developers to have better awareness of under-specification. Again, drawing from the software engineering domain, this could be considered as a “linter” - a tool that aids a developer in identifying both errors as well as potential issues.
Lastly, the potential to introduce ambiguity and vagueness through under-specification is mitigated in part with the use of common data models (CDMs) and harmonized terminologies(6, 19–22). For example, the eMERGE network has more recently begun transitioning to the Observational Medical Outcomes Partnership (OMOP) CDM. CDMs can facilitate the representation of the phenotype algorithm in a computable format, which increases portability of phenotype algorithms while reducing implementation times as it obviates the need for human interpretation of a narrative.(23) It is important to still consider, as computable phenotypes are often the result of a process like BQM, that issues stemming from under-specification could unintentionally creep into the final definition. For example, within the Clinical Quality Language (CQL), it is possible to constrain a population based on age using an expression like AgeInYears() > = 40. This is a convenient shorthand to express the patient’s age in years as of today (which changes each time the definition is run) and evaluate whether that value is > = 40 years(24). That statement is not vague from the standpoint of a system that executes CQL, as there are agreed semantics in the interpretation of this expression. However, the author of the CQL expression may not have considered the implication of this expression, where a more expressive statement such as AgeInYearsAt(Today()) > = 40 explicitly describes that the author intended the age to be evaluated in the context of “today” each time the CQL expression is run. Therefore, it is important to ensure computable phenotype definitions are still reviewed.
Limitations
This study has a few limitations. First, we limited our analysis to the comments posted within one specific research network, which may not mimic the processes used by other phenotype authors or consortia. However, the phenotypes we reviewed represented multiple institutions involved in eMERGE over several years, over which the network adjusted its process based on lessons learned. Secondly, other examples of ambiguity and vagueness may exist in the full phenotype definition, which we did not review, or may have been expressed via alternate communications to the phenotype author. Third, although these phenotypes were developed at different institutions, they were done as part of a collaborative network where authors were exposed to previous phenotype algorithms. We cannot rule out the possibility that this may have generally informed how future phenotypes were written. Given these factors, we recognize our codebook is likely not comprehensive in that it may not cover every possible case of ambiguity and vagueness. However, we believe the analyzed set is a reasonably representative sample, given the number of phenotype authors and diseases covered across all of the phenotypes in this study.