The Blind Review: A quality assessment measure for review standardization of Institutional Review Boards

Background: Standardization of IRB reviews has become increasingly important with the rise in multinational trials. Though inconsistency is often inevitable because of varying opinions on ethics, standardizing and understanding the differences between review results is required to ensure that high IRB review quality is maintained. Therefore, we aimed to develop a quality assessment measure of IRB, named “blind review,” by reviewing the same research protocols followed by multiple IRB panels. We then analyzed the differences between the panels to understand the mechanism of IRB standardization. Methods: Based on the Human research Protection Program (HRPP) Standard Operating Procedures (SOPs), eight blind review results from January 2010 to December 2018 at a single institution with multiple panels, using the Severance Hospital HRPP database as the source, were analyzed. The review scores ranged from 0 to 60 points, including good clinical practice (GCP) requirements and protocol issues Panel agreement was estimated by observed multiple rater agreement. Differences between review scores according to member expertise and IRB member duration were analyzed using the Wilcoxon rank-sum test and Kruskal-Wallis test. Results: The observed multiple raters’ agreement increased from 0.444 (95% CI: 0.167-1.000) in 2010 to 0.479 (95% CI: 0.271-0.708) in 2014-2018, as IRB reviewer experience increased. To analyze the review mechanism, three GCP requirements and three protocol issues were scored (range 0 to 60). The mean values for GCP requirements and protocol issues were 19.25±8.21 and 18.40±9.04, respectively. The mean score of the panels in which experts participated (n=16, 28.13±10.47) was higher than those of the control group (n=32, 25.16±10.96) (p=0.93). According to IRB members’ experience, scores for the group whose career spanned less than 3 years was 25.0±10.0 (n=14), those for the group whose career spanned 3-5 years was 26.3±9.6 (n=23), and those for the group whose career spanned more than 5 years was 27.3±14.2 (n=11). These results were statistically signi�cant (p


Background
The Institutional Review Board (IRB), also known as the Independent Ethics Committee or Ethical Review Board, plays an important role in protecting the rights and welfare of human research subjects [1].IRBs review the consent process, recruitment procedure, compensation arrangement, and the scienti c validity of submitted documents, as well as conduct risk and bene t analysis [2].In recent decades, medical knowledge and research techniques have advanced remarkably.In response, laws and regulations related to clinical research ethics have become ever more complex [3,4].As multinational clinical trials increase in number, both domestic laws and international regulations are becoming more important, and whenever there is any discrepancy, the more rigorous of them should be followed in each IRB review [5].Such changes to the research environment create problems for both IRB members and investigators.These include unintentional non-compliance through incomprehension of the up-to-date regulations, which requires frequent good clinical practice training to resolve [6].There have been discrepancies between IRB decisions regarding multicenter studies, even when following the same protocol [7].These divergences may be due to different interpretations of Good Clinical Practice (GCP), different local regulatory requirements, or the individual characteristics of IRB members.
In 1982, Veatch demonstrated the inconsistency in IRB review results for the rst time [8].Goldman and Katz [9] also analyzed IRB review results using the same protocols.The inconsistency in IRB decisions is an inveterate concern, both for IRB members and investigators, presenting an additional burden for review.Several investigations have analyzed the inconsistencies in IRB panels' results [10][11][12][13][14]. Further, consistent IRB results are associated with high-quality IRB discussions and determinations, based on appropriate attention to important issues and ful lling the expectation of researchers [15,16].However, because of the characteristics of the IRB and its reviews, including its diverse members and complex study designs, inconsistencies are inevitable.Lynch et al. [17] recently demonstrated that a broader concept of IRB and HRPP quality includes (1) effectiveness, (2) procedures and structures likely to promote effectiveness, and (3) features unrelated to effectiveness but nonetheless essential.Therefore, we must analyze why such inconsistencies occur [13], based on their details.However, research on review mechanisms or their associated factors is lacking.Most studies have focused only on inconsistencies using one or two protocols.Therefore, we developed "blind review" as a quality measure that compares the review results of different IRB panels following the same protocol to determine the degree of standardization as an empirical evaluation.We evaluated the effectiveness of this method, as well as factors affecting reviews, especially those relating to IRB members' experience.This study piloted the blind review and provides information on its effectiveness based on data from a single institution.

Blind review
We developed the blind review methodology as a part of quality assurance (QA) activities within human research protection programs (HRPPs) to analyze consistency among IRB panel decisions.In this process, one protocol selected for blind review was submitted to every IRB panel except to the one that originally conducted the review.Three HRPP staff members with more than two years' experience in the Human Research Protection Center (HPC) selected the protocol.Although there were no speci c criteria for selection, we considered current research trends and speci c issues, such as Arti cial Intelligence software study, verbal consent, and germline mutation study, whose review guidelines are a work in progress.Thus, we included various research types to compare the decisions made by different IRB panels, and did not exclude poorly designed protocols.Identical protocols were submitted to each panel simultaneously via a convened meeting, and IRB members were not noti ed about the blind review.After the blind review, HPC QA staff assessed the results and shared them in the IRB member workshop.However, the investigators were not noti ed about the blind review results and thus had no obligation to respond to any comments from the blind review.The IRB's decision regarding the protocol was made by the original review panel.A detailed description of the blind review process is shown in Figure 1.

Materials
As a QA measure, the HPC at Severance Hospital, Yonsei University College of Medicine, Seoul, Korea, has conducted a blind review annually, according to the HRPP standard operating procedures (SOPs).We retrospectively analyzed the blind review records of eight protocols conducted between January 2010 and December 2018.The rst three results, from 2010, were pilot tests across four panels; the other ve, one for each year from 2014 and 2018, were across seven panels and aligned with the SOP.No blind review was conducted between 2011 and 2013 as our institution had to develop an e-IRB system from the paper review process.We collected IRB member designation logs from the HPC database to assess IRB member factors, such as IRB years of experience and major specialty.We did not enroll any human research subjects, or use their personal data.Therefore, IRB approval was not required for this study.
Review criteria speci ed by the Severance Hospital IRB Severance Hospital follows domestic regulations for research involving human participants, such as the Pharmaceutical Affairs Act, Bioethics and Safety Act, Korean GCP guidelines, and the Medical Device Act.However, if there are issues not speci ed in domestic laws, and if the following regulations are stricter than the ones speci ed in domestic laws, internal regulations will be provided by referring to International Council for Harmonization (ICH), GCP guidelines, US Department of Health and Human Services regulations, federal and local laws, regulations for Food and Drug Administration participants (21 CFR 50, 56, 312 and 812), etc.The criteria of IRB review and approval are as follows: (1) Approval: If the research activity conforms to the criteria for approval de ned by the related law, and if modifying the research is not recommended.
(2) Approved with modi cations: The research activity meets the criteria for approval, and the modi cations required by the IRB are only minor and of minimal risk.
(3) Deferred: The study does not meet the criteria for approval as de ned in the relevant laws and regulations, lacks su cient information to conduct an adequate review, and/or if substantial revisions to the protocol are necessary.
(4) Disapproved: The study is found to be unethical, without scienti c or scholarly merit, and/or does not meet the criteria for approval as de ned in the relevant laws and regulations.
(5) Tabled: A study that cannot be reviewed at the meeting because of lack of time, lack of quorum, and/or extenuating circumstances.

Scoring method for IRB review results
To assess the review quality and standardization, HPC staff developed "essential review points," based on GCP and HRPP guidelines, and "protocol review points" that should be detected by IRB members.These were to be discussed during the review.In detail, as a rst step, the QA team staff analyzed the IRB minutes and set out the discussion points.The second step was to collect the IRB minutes of all panels and con rm the details of the discussion results.Similar to conducting a systemic review or metaanalysis study, we found the speci c word and other similar words to ensure that all relevant discussions were included.Then, the QA team discussed and reached a consensus on the content of the discussion and whether it validated the blind review score.Thereafter, we developed a 3-point scale with the values 0, 5, and 10, with 10 indicating that both reviewers addressed each review point and 5 indicating that the primary reviewer addressed it but not the secondary reviewer.This was applied even when a wrong comment was made in the IRB review (it is important to know what to point out as well as make accurate comments).Thus, even if an accurate point was not made, if any IRB member discussed the topic, we marked 5 points for the question.A score of 0 was allocated when none of the IRB members discussed or mentioned the issue.For GCP requirements, we identi ed "risk and bene t analysis" (ICH-GCP 2.2), "determining the continuing review interval" (ICH-GCP 3.3.4),and "research resources" (ICH-GCP 3.1.3., 3.1.9)as mandatory review points based on GCP regulations [18].In addition, we assessed whether the panels identi ed the issues to be discussed, as given information what is required.The consent process, vulnerability of participants, waivers for stored specimens, concerns regarding the con dentiality of participants, randomization methods, and the appropriateness of placebo procedures for the control group formed the majority of these considerations.

Statistical Analysis
Descriptive data were expressed as mean ± standard deviations.Differences between review scores, according to member expertise and IRB experience, were analyzed using the Wilcoxon rank-sum test and Kruskal-Wallis test [19,20].Since Fleiss' kappa has paradoxically low values, despite the high proportion of agreement when the marginal totals of raters are imbalanced, the observed multiple rater agreement was used instead of Fleiss' kappa to assess the agreement of IRB results among panels [21].The observed multiple rater agreement was estimated by summing all pairwise agreement tables from the panels, and a 95% con dence interval (CI) was obtained by 1,000 instances of bootstrap resampling [22].All statistical analyses were conducted using SPSS statistical software version 23 (SPSS Inc., Chicago, IL, USA), SAS software version 9.4 (SAS Institute Inc., Cary, NC), and R software version 3.6.3(R Foundation for Statistical Computing, Vienna, Austria).A two-sided p-value <0.05 was considered statistically signi cant.

Baseline Characteristics of protocols and IRB
We executed eight different blind review protocols, from March 2010 to October 2018.Among the eight protocols, 75% (six out of eight) were reviewed in the convened meeting and 25% (two out of eight) in an expedited review originally prior to a blind review, based on review type criteria (Convened meeting/Expedited Review).These eight protocols were selected as the "blind review protocol," and all of them were reviewed via a convened meeting despite the actual review process.Overall, single-center studies (75%) and interventional studies (62.5%) prevailed among protocols included in the blind review.The study population and notable points of discussion are described in Table 1.

Agreement among the panels on each protocol
The IRB review results of different panels were scored on a ve-point scale, with "Approved" =1, "Approved with modi cation" = 2, "Deferred" = 3, "Disapproved" =4, and "Tabled" =5.The correspondence of IRB results among panels according to the protocol is visualized in Figure 2. We analyzed IRB agreements across two groups-the pilot assessment with three protocols and the main assessment with ve protocols.The observed multiple rater agreement was 0.444 (95% CI: 0.167-1.000)with protocols 1-3 in 2010, while an agreement of 0.448 (95% CI: 0.362-0.514)was observed between 2014 and 2018 with seven panels.To include recent blind review results, we extracted the results of panels 1-4 from all the reviewed protocols (pilot and main), which showed that agreement increased to 0.479 (95% CI: 0.271-0.708)from 0.444 (pilot) and 0.448 (main) (Table 2).

Scores and associated factors regarding IRB review results
The scores ranged from 0 to 30 points for each protocol, including GCP requirements and protocol issues, and thus the total score ranged from 0 to 60 points.We analyzed the mean score and standard deviation and compared these between protocols (Table 3).The mean values of the two categories, including GCP requirements and protocol-speci c issues, were 19.25±8.21 and 18.40±9.04,respectively.In terms of GCP requirements, the scores gradually increased from the pilot studies in 2010 through to the recent group (Protocol 1= 15±12.25,Protocol 2=15±9.13 vs. Protocol 7=22.8±6.98,Protocol 8=20±8.16).Compared to GCP comments, the scores for the protocol review issue were similar to the average.A comparison of the scores produced by the different protocols is shown in Figure 3, using a graph with a standard deviation range.Detailed scores and review issues are described in Supplementary Tables 1 and 2. To assess the impact of IRB members' characteristics on review results, we collected information on the experience of the IRB members who participated in the convened meeting.In addition, we also investigated whether there were any IRB members who had speci c insights regarding the protocols that were applied in the blind review.For example, protocol 2 focused on pulmonary hypertension; thus, the IRB panel that featured a member who was familiar with cardiology was regarded as containing a member specialized in that protocol.We excluded the IRB members who had disclosed con icts of interest regarding the protocol, as well as those panels that had no specialized members.The differences in the review results based on the inclusion/exclusion of expert IRB members are as follows (Table 4).The mean score of the panels with experts was higher than that of the control group, but this result was not statistically signi cant (p=0.93).According to the career duration of IRB members, review scores of the groups whose members' careers spanned less than 3 years or 3-5 years were shown to be generally low compared to the groups whose members' careers spanned more than 5 years (p=0.09).

Discussion
Inconsistencies in IRB decisions have been observed since the 1980s, largely due to an increase in the number of multicenter clinical trials [9,14]; furthermore, the fact that IRB panels sometimes review the same protocol in different ways remains a major issue.IRBs aim to conform to the protocol and ensure sound research design, both scienti cally and ethically.Thus, in terms of ethical considerations, differing opinions are an undisputed consequence.In addition, consistency among IRBs cannot ensure a correct decision.but major inconsistencies can affect the reliability of the IRB in terms of the wider research environment.Therefore, HRPP QA activities must verify whether IRBs identify inappropriate designs or miss essential review points.
This study analyzed the IRB minutes of panels at the same institution, following the same IRB regulations, to observe the consistency of review results.Several studies have shown that IRBs are relatively consistent in terms of their nal decision but are often inconsistent regarding the reasons for their decisions [11,12,23,24].Taylor et al. [25] suggested a process for measuring the ethical quality of local IRBs.They de ned seven points-scienti c value, assessment of risk, bene t, acceptable risk/bene t ratio, fair selection of subject, adequately informed consent process, and adequate mechanisms for respecting enrolled subjects-while also con rming how many points were satis ed when one protocol was reviewed by multiple IRBs.A revised common rule was also announced recently regarding the use of a single IRB, which aims to reduce administrative costs.As of January 25, 2018, the National Institutes of Health (NIH) required the use of a single IRB record for most domestic NIH-funded multisite research studies [26].There is growing interest in single IRBs, and a research study has developed a measuring process for IRB inconsistency called "mystery shopper" and a scoring system using a discussion theme [13].
Although many studies continue to nd inconsistencies, it is still unclear why such inconsistencies occur in multiple IRBs.Our analysis sought to clarify this inconsistency by considering both the issue of discussion and the characteristics of IRB members.We assessed the consistency of seven IRB panel results within a single institution.Partial consistency was observed, although we have only analyzed the results in an exploratory manner.Agreement was shown to be poor, though we focused on the review process rather than on the review results.In addition, to resolve the disagreements in each panel, HPC provided IRB workshops to discuss the inconsistent blind review results and review process as an HRPP QA activity.In addition, to overcome the limitations of previous studies with a scoring system, HRPP staff in this study comprehensively discussed and determined the review points.The total score can be considered a direct measuring tool for IRB QA activity.Based on our data, compliance with the GCP requirements during an IRB review showed gradual improvement over time, while protocol-speci c reviews were in uenced by the characteristics of each protocol, making it relatively di cult to achieve consistency or standardization.This indicates that IRB reviews are commonly affected by speci c protocols.However, both the observed multiple raters' agreement and the review scores increased with an increase in the experience of panel members.Therefore, IRB members' career spans and compatibility are factors that affect IRB review processes, and active and systematic training of IRB members is recommended to improve the quality of IRB reviews.IRBs are not a permanent system-rather, they should be considered a regular interaction space for experts and non-experts.There is a signi cant need to support IRBs at the institutional level because laws change continuously and often become stricter.Our data show that the continued involvement of IRB members is essential to ensure and improve the quality of IRB reviews.Continuous monitoring of IRB reviews is necessary for agreed standardization, excluding extreme and odd results rather than viewing them as being of equal merit.We recommend that the blind review process be integrated into the HRPP's activities for assessing the quality of an IRB review.Notably, proxy outcomes with review standardization can lead to participant protection within the HRPP, while IRBs were originally instituted to secure participants' rights, welfare, and well-being.Our results can be applied practically while the regulatory compliance and proportionality of multiple IRBs is integrated into the advances made in IRB and HRPP effectiveness [17].Furthermore, standard review results and IRB recommendations from many different IRBs can accelerate the initiation of multicenter clinical trials, boosting global healthcare innovation based on increasing the opportunity for patient access to novel therapeutic agents.
This study has several limitations.First, it was a retrospective study that was based on IRB minutes from a single tertiary hospital.This limits the generalizability of our data to the HRPP environment.However, we report our experience as a preliminary study-we provide basic information for the blind review and the monitoring process for IRB reviews, and hope that this will be explored in further studies.In addition, a blind review process to the reviewers can also make the results more scienti c and unbiased.Second, only a small number of protocols were included in the blind review.However, as our center has operated HRPP since 2010, and maintains 8 years of experience in IRB review quality assessment, we believe that any degree of bias was minimal.Third, we analyzed the review scores discretionally, and the review points used for consensus among HRPP staff had no scienti c criteria.Further studies should investigate and validate this as a multicenter study involving a larger number of protocols.

Conclusion
Blind review is an effective method for overseeing and ensuring the quality of IRB reviews, as well as overall GCP compliance.The usefulness of the blind review process has not been established because of insu cient evidence.However, according to our study, this process seems to be an effective monitoring method in the highly challenging environment of clinical research ethics.

Figure 1 Flow
Figure 1

Figure 1 Changes
Figure 1

Table 1 .
Baseline characteristics of protocols included in the blind review (n=8)

Table 2 .
Observed multiple raters' agreement on Blind Review results

Table 3 .
Comparison of the review scores according to the protocols

Table 4 .
The difference in review results depending on the presence of experts on each IRB panel.