The language of regulatory experts sometimes appears to proliferate with formal schemas such as purpose, requirements, and specification. The difficulty we observed through our workshops is that the specifics of language differ between application domains. Through a comparative approach, we present a light-weight standardization which can be used to align the different domains thus enabling participants to develop products together, with particular focus on the emerging technology, popularly called artificial intelligence (AI), consisting largely of machine learning (ML).
In every validation process there is an object which is being validated, e.g. a piece of software. Since this article concerns the validation of AI contained in tools or products, and AI today is implemented primarily in software, unless otherwise specified, we will focus on software-implemented products as the object throughout. Other examples of objects which may be validated include the entire medical device and an individual machine learning model. These examples will be discussed in the relevant sections below.
Then there is a goal, or target, of the validation. This relates to construction and functionality of the object. The most common example of a validation target is: does the object achieve the product goals in terms of its intended use?
Finally, there are a set of methods or practices which are used to validate that the object conforms to the goals of its validation. Methods are goal, or target, specific and depend strongly upon the phase in the software development life-cycle during which they take place.
We break our results down into four principal sections. We begin by bringing together existing definitions of validation, based upon the various applicable regulatory standards and perspectives. From these we highlight a useful distinction, which is already partially evident in the regulatory literature, through which we resolve a key regulatory difference between USFDA and EU regulations. We then introduce an overview of the key validation methodologies applied with particular attention to methodologies used in developing software with applications in human-health specific industries. Finally, we examine validation through the perspectives of the two major legally-defined domains of application: medical/in-vitro devices (MD/IVD) and pharmaceutical industry practices. For the pharmaceutical approach, we further distinguish between drug development (R&D) and manufacturing processes.
Table 1 summarizes the main results from this section, in a translation friendly format, examining the definitions, approaches and requirements to validation across five domains of interest.
Table 1
Five application domains and their perspectives on validation.
| ML components / models | Software systems / computerized systems | Medical Device and IVD | Software used in pharmaceutical manufacturing automation | Software used in pharmaceutical drug development (R&D) |
Definition of validation. | Use of a sample of data to provide an unbiased evaluation of a model fit, on the training dataset, when tuning model hyperparameters. | Confirmation, through the provision of objective evidence, that the requirements for a specific intended use or application have been fulfilled (ISO 9000:2015)[1]. | Confirmation, through the provision of objective evidence, that the specified users in the specified context of use achieve the specified goals (ie. intended purpose). | The software systems definition (second column) applies equally here. | The software systems definition applies particularly in the regulatory phase of development. |
What is being validated? i.e. The object or scope of the test | A ML model, trained on a specific set of data (training set) utilizing a specific set of hyperparameters. | Two competing scopes: 1. Entire software system (as black box) (narrow validation) 2. Entire software system as well as its units and their interactions (broad validation) See section, “Narrow vs broad validation.” | Entire medical device / IVD. See section, “The MD/IVD view on validation.” | Computerized system in the target hard- and software environment. See section, “Manufacturing automation.” | Entire software system used to drive a clinical development process including referenced databases. See section, “Drug development (R&D).” |
What is the goal of the validation? i.e. the target of the test | Achievement of most appropriate model hyperparameter selection, and thereby generalizable performance, of model. | Confirmation that software fulfills intended purpose and meets all quality criteria. e.g. according to ISO 25010 [25]: - functionality - performance - compatibility - usability - reliability - security - maintainability - portability | Confirmation of safety, performance and clinical benefit of the product (narrow validation). For IVDs this validation includes clinical performance, scientific validity, and technical performance. | Similar to software systems validation (second column). Risk based confirmation of conformity with GMP requirements and that software fulfills intended purpose. | Similar to software systems validation (second column). Risk based confirmation of conformity with GMP requirements and that software fulfills intended purpose. |
Validation methods. i.e methods for testing validation | Exposing the model to a set-aside, validation-set, containing data which has not been used for training the model. Evaluating the performance of the model as expressed by metrics such as sensitivity, specificity, accuracy etc. | From the product life cycle diagram we see examples such as: - software unit tests - integration tests - system tests Depending on the test objective these tests include stress and load tests, penetration tests, blackbox-tests etc. | Clinical investigations and evaluations. Usability testing. | PQ, IQ, OQ. The precise split of responsibilities is defined in GAMP5 - GxP. See Software systems column for full list of software validation methods applied. | Lab assays, as a form of narrow validation of software performance. Software aspects may be reprogrammed from scratch for output comparison (cleanroom implementation). See Software systems column for full list of software validation methods applied. |
Selected legal, best-practices and guidance documents with respect to validation. | Upcoming EU AI regulation [27]. ISO 24028 [28]. ISO/IEC CD TR 29119-11 [29]. Human-health context: - FDA GMLP guiding principles [30]. - AI guideline by Johner Institute [31] - OnRAMP technical best-practices [10]. | See the context specific columns. Software used in quality systems: - FDA 21 CFR part 11 (electronic records) [24] - FDA 21 CFR part 820.70 (QS production and process controls) [32] - FDA guidance “software validation” [4]. - ISO 13485, e.g. Chapter 4.1.6, Chap. 7.5.6 [17]. - AAMI TIR 36 [33]. - ISO/TR 80002-2 [34]. | FDA 21 CFR part 820.30 (MD design controls and product validation) [32]. EU MDR/IVDR Annex I and II [16, 35]. ISO 13485, Chap. 7.3.7 [17] IEC 62366-1 [15]. MEDDEV 2.7/1 rev. 4 [36]. | FDA 21 CFR part 11 (electronic records) [24]. FDA 21 CFR part 210 and part 211 (GMP) [37, 38] EU Directive 2001/83/EC [39]. GAMP5 - GxP Computerized Systems [18]. FDA guidance “software validation” [4]. | FDA 21 CFR part 11 (electronic records) [24]. EU Directive 2001/83/EC [39]. GAMP5 - GxP Computerized Systems [18]. FDA guidance “software validation” [4]. |
Typical development phase. | SW component testing, or unit testing (ie. the software unit containing the model). | See section, “Narrow vs broad validation.” | For narrow validation the focus is at the end of development, before market release. | See section, “Narrow vs broad validation.” | Throughout, with particular focus on the ‘regulatory phase’ where processes may lead to impacts on human health. |
Definition and goals of validation
We begin by looking at the common underpinning definition of validation, and the related term verification, from ISO 9000 [1]. From these definitions we develop two new terms, namely narrow and broad validation, and relate these to conflicting perspectives on validation between USFDA and EU regulations. Resolving this regulatory difference, through the introduction of these terms, is a key first result.
Defining Validation
Many regulatory practices reference and are derived from ISO 9000. According to ISO 9000 the twin terms, verification and validation, are defined as follows:
Verification
confirmation, through the provision of objective evidence, that specified requirements have been fulfilled.
Validation
confirmation, through the provision of objective evidence, that the requirements for a specific intended use or application have been fulfilled.
Unfortunately, these definitions are frequently not followed. A recent IMDRF working group has highlighted this same issue [2]. Worse, the EU medical device directive (MDD)[3] even uses the term validation following two different meanings within a single sentence:
For devices which incorporate software or which are medical software in themselves, the software must be validated according to the state of the art taking into account the principles of development lifecycle, risk management, validation and verification.
Instead of referring to one set of definitions, the exact meaning of the term validation rather depends on the context and the point in the product life-cycle at which it is used.
Narrow vs broad validation
As the EU medical device directive’s requirement cited above makes clear, the term validation is used in two different meanings, which we describe as follows:
1. Narrow validation, or validation in the narrower sense:
2. Here the target of the medical software validation is limited to evaluating whether the right product was built, i.e. the intended purpose / use is met.
3. In this case, the intended purpose / use depends on and frequently also specifies the users, the context of use, and where relevant the patient population and the medical purpose. This validation typically is performed following a black box methodology and is carried out towards the end of the product development life cycle. This is validation according to the ISO 9000 definition.
4. Broad validation, or validation in the broader sense:
5. Broad validation includes narrow validation but encompases a much wider set of activities which ultimately end in narrow validation. In this case, the target of software validation is understood as a synonym for software quality assurance over the entire software development process. i.e. that the software was developed and built applying best-practices and methods.
6. These validation methods also make use of the knowledge of the inner workings of the software system. Which means the software is designed and evaluated as a white box and is carried out throughout the development process. In other words: broad validation does not only analyze (test) the object (e.g. a developed software) in order to find errors or make sure that the intended use actually is met. It also contributes to the prevention of these errors right from the beginning.
The FDA’s guidance document on software validation covers the entire software development process [4]. It describes activities such as requirements analysis, software architecture and software testing, but does not describe in detail how to evaluate that the intended use is met nor how to perform usability testing. In other words, whereas the EU’s MDD combines both broad and narrow validation, the FDA guidance document uses the term software validation exclusively in the broader sense.
Validation frequently includes verification
Importantly, software validation in the broader sense includes the software verification activities. For example, software unit tests evaluate whether a software unit performs as specified, and indeed reviews of the documentation activities. By definition such a confirmation, that specified requirements are met, is a verification activity. Also integration and software system tests, that test against given requirements, are verification activities, even though they clearly form part of software validation in the broader sense. It is important then to understand that software verification and software validation are only disjunct, as terms, when software validation is understood in the narrower sense.
Figure 1 illustrates some examples of activities undertaken as part of broad and narrow validation. Notably, both verification and narrow validation describe subsets of broad validation activities.
Principal validation methodologies and practices
Software validation methodologies and practices describe activities which take place at various stages throughout the product development and subsequent deployment. This section provides a common methodological basis for understanding the validation of ML containing applications across multiple domains.
The product life-cycle describes different phases in the development of a product. IEC 62034 (Medical device software - Software life-cycle processes) [5] describes a framework for medical device software processes. The focus is on the stages beginning from planning and ending with a software version release and post-market activities. IEC 82304 (Health software — Part 1: General requirements for product safety) [6] complements this by further covering life-cycle phases such as installation and end-of-life disposal, however, purely for software health products. IEC 82304 incorporates validation in the narrow sense, but without giving further guidance as to methodology. The V-Model represents a software development life-cycle approach to documentation (see Fig. 2) [7]. In a modern, agile development process, the entire sequence is passed through multiple times before reaching production.
A number of methodologies are of particular relevance for validation of products, destined for use in industries dedicated to human health. We highlight them in the following subsections. Their application roughly encompases a spectrum, shifting focus from broad validation towards narrow validation, i.e. from building things correctly, to having built the correct product. We begin by introducing machine learning (ML) best practices.
ML has a long history of development outside of regulated fields and, as such, has a completely different meaning for the word validation from common usage in the fields of human health. Software quality assurance (QA), which encompasses aspects of ML development, begins firmly in broad validation but also tapers to narrow validation later in the product life-cycle. Finally, user testing and clinical evaluations are necessary to observe how users will interact with a product and whether the product is both safe and effective in the real-life use context.
Confusion in medical AI validation
One of the most important results from our workshops is to highlight that the use of the term validation in machine learning is almost completely disjunct from how the term is used in regulatory affairs. Even within the ML community there is a historic discord, since largely resolved, as to the correct use of the term validation [8]. Therefore, it is incredibly important when communicating with machine learning teams to specify which type of validation is being referenced.
Validation of machine learning (ML) models is technically understood to be an approach to algorithm tuning [9]. Validation offers an approach for choosing algorithm learning parameters, so-called hyperparameters, such that the subsequently trained model is less likely to overfit the training data. The goal of validation, for an ML practitioner, is an unbiased evaluation of model performance in order to demonstrate appropriate learning algorithm parameter choice [10]. ML validation, then, is categorically not a narrow validation. Rather, it is a ML-specific aspect of broad validation. From a compliance perspective, it is important to note that having carried out ML-specific validation is no guarantee of having delivered upon intended use, the correct interpretation of validation in ML is purely around having followed specific best-practices.
Figure 3 illustrates the ML view on testing vs validation. The loop in the Train and Validate cycle can be performed as many times as desired without invalidating subsequent performance evaluation on the (set-aside) Test set.
Recent developments in the medical academic literature linguistically confuse matters. In embracing ML, the medical literature has adopted the definition of the term validation as defined for statistics [11], and applied it to ML models [12, 13]. Thus, supplanting the ML-appropriate term testing. Whereas ML engineers validate on validation sets and test on test sets, the medical ML literature has largely undocumented approaches to learning parameter choice and refers to testing on the ML test set as validation. The use of a separately acquired test set is increasingly referred to as external validation in the medical literature. Clearly then, the medical ML literature largely eschews broad validation and uses the term validation of a ML model exclusively in the sense of a narrow validation.
Software quality assurance and validation
Software quality assurance is the set of practices through which quality is ensured in the software industry [14]. Quality assurance (QA) consists of constructive and analytical quality assurance. Constructive QA has the goal to prevent errors. Analytical QA has the goal to detect errors in products, components, documents and processes.
Constructive QA frequently takes the form of processes, methods and procedures, and tools which prevent the introduction of faults, or defects, during the software development process.
Analytical QA is often assessed via tests, inspections and reviews, and audits. Both of these forms of QA tend to incorporate verification as a form of validation, i.e. broad validation. Importantly, the target of the analytical QA extends to evaluating whether the constructive QA was appropriately planned, and whether the goals of the product are achieved (i.e. narrow validation).
With regards to machine learning, software QA is considered best-practice by ML engineers, but is seen as a completely separate domain from validation as applied to the model construction and potential performance. In practice this means that professional ML engineers engage in all relevant forms of broad validation by default, ie. both software QA and ML model construction best-practices, but are overly focused on the ML model construction and tend to not be involved in the achievement of, narrow validation, of entire product goals such as clinical benefit and safety.
Usability testing and clinical evaluation
Usability testing and clinical evaluations are two approaches which are particularly relevant to products subject to human health regulations. They both form steps on the road to evaluating real-world performance of a product.
Software is tested not just in terms of code correctness but also in the context of product usability. IEC 62366-1 (Medical Devices - Part 1: Application of usability engineering to medical devices) [15] defines a set of requirements a usability engineering process must fulfill in order to achieve product usability. In particular, it focuses on potential safety issues and hazard-related use scenarios. In order to achieve a quality user interface (UI), the UI is first specified and then undergoes both formative and summative evaluations. Whereas summative evaluation is post-hoc, according to predefined goals at the end of product development, formative evaluation is carried out throughout product development and has the target to resolve usability issues before a product is released to the market. Usability testing is the method for summative evaluation and, as such, is a form of narrow validation. Other methods, such as inspections and cognitive walkthroughs are forms of broad validation.
The EU Medical Device Regulation (MDR)[16] defines clinical evaluation as a,
systematic and planned process to continuously generate, collect, analyse and assess the clinical data pertaining to a device in order to verify the safety and performance, including clinical benefits, of the device when used as intended by the manufacturer.
What this means is that evidence must be acquired for the purposes of evaluating the product, with a particular focus on a state of the art clinical effectiveness and safety profile. In this context, proof of performance and benefits are the proof that intended use is achieved. Thereby clinical evaluation is validation, as stated explicitly in ISO 13485 [17]. The evidence must be collected as part of an ongoing process of clinical investigations. Clinical evaluations of products do not end when a product receives certification, rather they then continue as post-market clinical follow-up, which is part of post-market surveillance. This can lead to the withdrawal of a product if the safety profile is found to have altered or is no longer state of the art. Clearly, although individual aspects of clinical investigations can contribute to broad validation, their overall focus is on narrow validation.
The MD/IVD view on validation
Medical devices (MD) and in-vitro diagnostics (IVD) are sometimes separately regulated in law. But in practice, they follow highly overlapping development processes and are frequently evaluated in a similar manner. Particularly in the area of software as a MD/IVD (SaMD / SaIVD) the requirements related to validation are mostly identical.
Validation of the product
The object of a MD/IVD product validation is the product as a whole i.e. the entire MD/IVD. The goal of the validation is to demonstrate clinical benefit, and achievement of intended use, alongside appropriate safety and performance. Validation, then, is built upon clinical evaluations and investigations and a summative usability evaluation. The concept of usability evaluations has already been discussed in the methodologies section above.
For software as a medical device (SaMD), the clinical evaluation is part of validation in the narrower sense. ISO 13485 [17] makes clear that clinical evaluations are part of validation:
As part of design and development validation, the organization shall perform clinical evaluations [...]
Clinical validation of IVDs uses slightly different language, namely: Performance evaluation and clinical evidence. It includes scientific validity, analytical performance and clinical performance. However, the overall approach remains comparable to that of a MD.
Furthermore, since IEC 62304 [5] is both a USFDA recognised consensus standard, and a EU harmonized standard it is clear that software best-practices must be followed throughout software-based MD/IVD development. It is worth noting that IEC 62304 specifically claims that it does not cover validation of the MD, even when it consists only of software. This can be interpreted as not covering narrow validation - similarly to the USFDA’s software guidelines - since clearly the descriptions in IEC 62304 describe software validation in the broader sense.
Validation activities inside the organization
In addition to the validation of the products - the MD/IVD - other elements of the manufacturers’ organization must also be validated, such as:
1. Computerized systems validation.
2. Software and computerized systems used in the context of the quality system must be validated, e.g. software to collect and preprocess training data, software development tools, etc. (see ISO 13485:2021 Chap. 4,1.6).
3. Validation of infrastructure, measuring equipment.
4. Any elements of the infrastructure, measuring equipment and tools, that might have an impact on the safety and performance of medical devices have to be validated (see ISO 13485:2021 Chaps. 6.3 and 7.6).
5. Process validation.
6. Also processes are scope of validation such as processes to test devices, to build and distribute software, to perform post-market surveillance (see ISO 13485:2021 Chap. 7.5).
Indeed, the paragraphs from ISO 13485 cited here are almost word-for-word identical to one another. Moreover, these three areas of validation are not necessarily disjunct. For example, the process of data collection, pre-processing and training has to be validated. This includes the validation of software and software tools. And some of these tools, e.g. the tool to perform static code analysis, are in turn measuring equipment.
Device validation versus validation of machine learning models
Within the domain of MD/IVD development, the validation of a device and the validation of a model have different goals and should not be confused. Firstly, the usability of the device may still be so poor as to lead to a device which is not fit for purpose. Similarly, even a successfully validated model, i.e. a model that exactly meets the specified quality metrics such as sensitivity, may still not deliver towards the intended purpose of the device.
An illustrative example, based upon model performance issues, might be a model which should detect a disease from a blood test. If this model is part of an IVD with an intended use for mass screening, then the specificity is a crucial parameter, as otherwise many patients would suffer from false positive results (fear, unnecessary additional diagnostic testing or even treatments). If this test, however, is used to identify highly infectious patients, then false negative results can hardly be accepted. That is, the sensitivity must rather be maximized, while accepting that this will result in a proportionately higher number of false positive results.
Manufacturers must derive the model requirements from the intended purpose / use. That is, from what the device shall contribute to diagnosing, treating, preventing, predicting, monitoring or alleviating, diseases and injuries. Furthermore, the intended user profile (e.g. what the user can be expected to understand) and intended context of use (e.g. physical parameters such as brightness and noise, and social parameters such as stress and shift operation) play important roles in product validation here. These, again, come back to the issues of usability testing and clinical evaluation.
Manufacturers additionally must take into account the “state of the art” and must maximize the benefit-risk-ratio (e.g. MDR Annex I (1) [16]). This means:
1. They have to at least meet the benefit-risk-ratio of alternatives such as competitive products which do not use ML.
2. They have to provide evidence that the chosen model (architecture, hyperparameters) outperforms, or is at least equivalent to, other alternative ML models / architectures.
Finally, it is worth noting that ML best-practices [10] must also be applied throughout the development of a ML model for application in a MD/IVD. This includes validation in the narrower sense towards the intended use of the device using the ML model, and also validation in the broader sense as to whether engineering best-practices were applied. Naturally, this extends to include the verification definition of validation which evaluates whether the ML model meets specific requirements.
The pharmaceutical industry view on validation
The pharmaceutical industry has two very different aspects, namely R&D and production, both of which require validation. As an industry grounded in research and development (R&D) there is a clear necessity to validate potential drug targets, investigative molecules, and ultimately the drugs themselves before bringing products to market. In addition, the manufacture of medications must follow strict quality standards which includes validation. Software has been used for decades, across these processes, with varying requirements for validation.
In the pharmaceutical industry software is primarily a tool and not the product itself. For example, in R&D it may be used to analyze data, whereas in manufacturing it may control production processes.
Historically, software applications were purchased to fill specific purposes. In this context, the goal of software validation was equivalent to validation in the narrower sense, i.e. black box testing.
As the use of software in pharma has expanded, and the impact of software on drugs increased, the pure blackbox verification (and analytical QS in general) was insufficient to provide sufficient evidence that the software does fulfill its intended use. So too has the need to customize applications, or indeed to develop new applications for internal use. In these cases, all of the activities related to software quality, which a software development company performs, have to be performed by the software developing pharmaceutical company instead. In this context, the goal of software validation in pharma today frequently covers the entire software development life-cycle and includes validation in the broader sense, which is the subject of GAMP5 - GxP [18].
Drug development (R&D)
The drug development process as a whole is governed by local legal implementations of the International Congress on Harmonization (ICH) guidelines [19]. The core values of these guidelines are: quality (Q), safety (S) and efficacy (E). Quality refers largely, though not entirely, to the manufacturing process. Safety and efficacy are often jointly evaluated.
The legal requirements for software validation in R&D are largely enforced via an emphasis on data integrity for the drug licensing process. Good data and record management practices (GDRP) combined with a requirement that data used, to license a drug for market, follow ALCOA principles - Attributable, Legible, Contemporaneous, Original, and Accurate - oblige software used to be developed following engineering best practices [20, 21]. This data-oriented legislative approach effectively leads to a natural split in terms of software requirements.
Software used early in the research phase of R&D is frequently ‘pre-regulatory’ and thus has a lower burden of validation associated [22]. Such software is most often used in scientific exploration (i.e. research) and the consequent risk that its use will lead to any human harm is much lower than that of software used in the later stages of drug development and human testing.
Once the software is being used to develop actual drug candidates, however, the ‘regulatory’ phase begins and the legal burdens, to demonstrate the goals of ALCOA and GDRP, quickly lead to much higher requirements in terms of software validation. Software which has outputs which may be included in the eventual regulatory dossier, and indeed may steer clinical trials e.g. PK/PD dose safety estimates, should be accompanied by a full quality system (QS) approach and risk evaluation[18, 23, 24].
From the perspective of scientists, particularly statisticians and data scientists, involved in the R&D process software validation requirements are somewhat opaque. This despite the fact that many of them are required to develop their own software. A good rule, in this case, is that software engineering best-practice methodologies should always be followed, i.e. broad validation. The complete requirements in terms of validation, also in the narrow sense, depend on an evaluation of the potential for human harm, and for large cost overruns due to failed development paths. Artificial intelligence, as a form of software, follows the same processes and requirements as any other software used in R&D.
Manufacturing automation
Pharmaceutical manufacturing automation largely follows GAMP5 - GxP [18] standards. Here the entire software system in the intended hard- and software environments is the primary goal for validation. ISO/IEC 25010 [25] proposes eight quality attributes which must be validated in order to ensure software product quality: functionality (e.g. suitability, completeness), performance, compatibility, usability, reliability, security, maintainability and portability.
Particularly important protocols to be followed in manufacturing automation are installation, operational and performance qualification (IQ, OQ and PQ respectively). The validation methods include: installation tests, load tests, compatibility with neighboring systems, and blackbox evaluation methods.
The GAMP5 - GxP model delineates how quality assurance responsibilities have to be split between the organization developing the software and the organization that installs, configures and uses the software. Typically, the software development company will be responsible for verification methods whereas the pharmaceutical manufacturer will be responsible for validation that the product functions as intended in the context of the assembly line (i.e. narrow validation). Additionally, the pharma company typically installs, configures, parametrizes the software, sometimes even adds scripts or calculations. All this has to follow protocols, i.e. constructive QS, and is broad validation.