Accuracy, Precision, And Agreement Statistical Tests For Bland-Altman Method

Background: Bland and Altman plot method is a widely cited graphical approach to assess equivalence of quantitative measurement techniques. Perhaps due to its graphical output, it has been widely applied, however often misinterpreted by lacking of inferential statistical support. To compare data sets obtained from two measurement techniques, researchers may apply Pearson’s correlation, ordinal least-square linear regression, or the Bland-Altman plot methods, failing to locate the weakness of each measurement technique. We aim to develop and distribute a statistical method in R in order to add robust and suitable inferential statistics of equivalence. Methods: Three nested tests based on structural regressions are proposed to assess the equivalence of structural means (accuracy), equivalence of structural variances (precision), and concordance with the structural bisector line (agreement in measurements of data pairs obtained from the same subject) to reach statistical support for the equivalence of measurement techniques. Graphical outputs illustrating these three tests were added to follow Bland and Altman’s principles of easy communication. Results: : Statistical p-values and robust approach by bootstrapping with corresponding graphs provide objective, robust measures of equivalence. Five pairs of data sets were analyzed in order to criticize previously published articles that applied the Bland and Altman’s principles, thus showing the suitability of the present statistical approach. In one case it was demonstrated strict equivalence, three cases showed partial equivalence, and one case showed poor equivalence. Package containing open codes and data is available with installation instructions on SourceForge for free distribution. Conclusions: : Statistical p-values and robust approach assess the equivalence of accuracy, precision, and agreement for measurement techniques. Decomposition in three tests helps the location of any disagreement as a means to ﬁx a new technique.


Background
The seminal paper of J. Martin Bland and Douglas G. Altman [1] is a graphical plot approach to detect equivalence of two measurement techniques. It has been applied in several medical areas to compare a variety of data [2,3,4,5,6,7,8].
Bland and Altman proposed in 1986 what would become their famous method. In a nutshell, Bland-Altman plots assess the 95% limits of agreement (LoA) given by the mean difference of individual measurements provided by two techniques plus or minus 1.96 times the standard deviation plotted as horizontal lines. When the LoA are smaller than a given tolerance (differences small enough to be considered clinically unimportant) the two techniques are assumed to have acceptable equivalence [9,10,11]. More recently, confidence intervals were added around the upper and lower LoA [12,13,14,15]. Although providing a little more room for the tolerance, LoAs may be regarded as a statistical test for the band limits but do not provide any decision for the technique's equivalence.
Bland-Altman plot method is, therefore, subjective [16]. The importance is attributable by the researcher as a threshold and it is a situation equivalent to acceptance of a null hypothesis by visual inspection of the graph without any measurement of the amount of equivalence and without inferential statistical support.
Perhaps due to the lack of this support, the equivalence evaluation gave room to misunderstandings and anecdotal interpretation of data, sometimes even against the original author's recommendation. It is frequently misinterpreted as "two exams are equivalent when the majority of all data are located within the band limits" [16,17], which always verify (ranging from 73% to 100% independently of the data distribution, according to Chebychev's inequality theorem [18,19]), or "the points inside the band must be uniformly distributed" (never stated by the original authors). Bland-Altman plot method is insufficient in the sense that it does not carry more than a visual decision.
The present work applies a three-step statistical decision allowing the researcher to determine if there are enough elements to reject equivalence of two techniques. The solution applies three nested tests with p values and robust statistical decisions by bootstrapping.

Methods
This investigation proposes the addition of statistical criteria to the Bland and Altman's plot method [1]. Since it is a pure theoretical approach, it was not submitted to any ethics committee. The method is implemented in R to be be freely distributed to applied researchers. Package containing open codes and data is available with installation instructions on SourceForge for free distribution [20].

Rationale
Three steps to claim strict equivalence between measurement techniques are proposed, respectively checking equivalence of structural means (equality of accuracy), structural variances (equality of precision), and agreement with the structural bisector line (equal measurements obtained from the same subject). Full equivalence may be assumed when there is non-rejection of equivalence in all three tests.
Regressions applied to all three tests are not crude regressions, but functional procedures whose results are estimators of true equivalence supported by theorems published and cited below. For analytical tests 1 and 2 the usual significance level of 5% was adopted. For test 3, the linear regression estimates two parameters; since the statistical decision about significance is taken separately, it is necessary to control probability of type I error, for which Bonferroni correction applies and the effective significance level is 2.5% for each parameter (α for intercept and β for slope).
In addition to the analytical tests, graphically-expressed bootstrapping is applied to compute confidence intervals or bands in order to accompany the analytical decisions. Bootstrapping is a statistical method for robust estimation of confidence intervals, which is independent of sample size and variable distribution [21]. In our application, bootstrapping is graphically represented by shadowed areas containing 95% of all simulated regressions from resampling with reposition, thus assumed as the area containing the populational regression between measurements from techniques of interest.
This proposal is not an unnecessary complication of a heretofore simple method. It adds power to a graphical judgement building a coherent statistical decision mechanism scattered in the scientific literature in unrelated papers published from 1879 to 2015, some of them as parts of statistical theory without practical applications [12,22,23,24,25,26,27,28]. The obscure term 'structural' in this context refers to functional statistical approaches necessary to purge observed measures from measurement errors [29,30], thus allowing the comparison between their true values. Interestingly, the integration of these three tests goes in reverse chronological order of their publications: accuracy was derived by Hedberg and Ayers in 2015 [22], precision was derived by Shukla in 1973 [23], and the agreement with the true bisector line depends on Deming regression [12,16], which has the oldest application we could locate in Kummel, 1879 [27]. These authors worked independently and contemporary authors were very possibly unaware of the work of Bland and Altman and vice-versa.
The statistical approach may seem somewhat convoluted because researchers have only observed data, while decisions depend on true, non-observable values. For that reason structural null hypotheses must correspond to functional procedures with observed variables and conclusions are taken from estimated regression parameters according to the mathematical theorems that connect them back to the true values. In the following we explain all structural hypotheses and their functional correspondence for the three proposed tests.
Measurements provided from a reference technique A and candidate under assessment technique B (each technique was applied once to each subject), according to the physics error theory, give: where y and x . . . are independent pairs of observed measurements, Y and X . . . are the true correspondent measurements, δ and ǫ . . . are independent measurement errors with null average. These error terms appear because all measurement techniques have a certain degree of imprecision. Assuming that Y and δ, and X and ǫ are also statistically independent and that these errors have no preferential direction (null averages, E[δ] = 0 and E[ǫ] = 0), the mean of all observed values is equal to the mean of true values (ȳ =Ȳ andx =X), demonstrated by Consequently, the observed mean difference between techniques is also equal to the structural bias (ȳ −x =Ȳ −X). These equalities allow the correspondence between functional computation and structural hypotheses, reducing all three nested tests to two ordinary least square linear regressions and one Deming regression. Connections of structural and functional tests are explained below.

Teste 1: Accuracy
This test verifies if two measurement techniques provide the same values in average. Hedberg and Ayers [22] applied a covariate with measurement error in analysis of covariance (ANCOVA) in order to test structural mean equality for these repeated measure design.
Therefore, the comparison of repeated measures in the same subjects, each one with its own measurement error, has the null hypotheses: in which ν i is the error term. Hedberg and Ayers [22] demonstrated that the intercept of this regression line is different from zero when mean true values differ. This statistical artifice takes the intercept α as the mean difference when a linear regression takes x i −x as the independent variable; with the subtraction ofx, the slope of the regression line is not affected (therefore it is disregarded), but the line is displaced in such a way that the intercept coincides with the mean difference between measurement techniques, which is our goal.
Analitically, it suffices to find wether the intercept is statistically different from zero to reject null hypothesis of equal measurement means. Graphically, rejection of the null hypothesis is represented by the ordered pair (0, 0) outside the 95% confidence interval defined by bootstrapping.

Teste 2: Precision
Verification of equal variability of measurement errors is based on Shukla [23] and also independently adopted by Oldham [25] without widespread application. The null hypotheses are: in which θ i is the error term.
Here, the structural null hypothesis computes lambda as the ratio between variance of measurement errors. If variability of errors is similar (λ = 1) then precisions are similar.
According to Shukla [23], the slope of a regression of y − x in function of x + y will differ from zero when the variance of two measurement errors differ. Here we observe that the axes proposed by Shukla are the same ones used by Bland and Altman's original concept [1] with the difference between measures represented on the y-axis and the sum (or the average, which does not change the regression) on the x-axis. It becomes evident that this arrangement can only compare variance between measurement errors and not full equivalence as Bland and Altman suggested.
Therefore, if true lambda is a value other than 1, the functional correlation is not null and the slope of the regression is also not null, thus leading to rejection of null hypothesis of equal precisions. Graphically, rejection of this null hypothesis corresponds to a horizontal line that cannot be contained in the 95% confidence band defined by the functional regression.

Teste 3: Bisector line agreement
This test applies Deming regression in order to verify if two measurement techniques measure the same values in the same subjects [12,24,26,27,28].
When true values measured by two techniques coincide, ordered pairs of these measures follow the true bisector line. Therefore, the null hypotheses are: where α = 0 and β = 1 While ordinary least square regression (OLSR) treats independent variable x free of measurement error, Deming regression takes into account measurement error in both measurement techniques. In addition, observe that ǫ is simultaneously correlated as the measurement error of x and part of the regression overall error term (δ − βǫ). Transitively, it implies that x is correlated with the combined error, preventing the computation of OLSR [29,30,31,32,33]. Linnet [26] studied several regression methods, showing that Deming regression method is robust and performs better than OLSR. Deming regression also depends on λ, which was estimated from Shukla [23]. Another situation occurs when each technique is applied to each subject more than once, for what lambda calculation was implemented according to the NCSS Manual [34].
The null hypothesis is rejected when the intercept differs from 0 or slope differs from 1, considering Bonferroni correction. However, the independent appraisal of intercept and slope is the weakest way to decide, for it decreases test power.
Graphically, two progressively stronger alternative statistical approaches were implemented for bisector line agreement: (1) the assessment of the confidence ellipse by bootstrapping that can jointly test intercept and slope and (2) the assessment of regression by bootstrapping that can take into account the whole confidence band, thus rejecting the null hypothesis when it is not possible to accommodate the bisector line inside the 95% functional regression confidence band.

Translations
All three tests are performed with two statistical strategies: analytical (decision by p value) and graphical (decision by bootstrapping). Sometimes, the analytical approach was providing non-rejection of null hypothesis while the graphical approach was showing lines out of the confidence bands for precision and bisector line agreement tests due to difference in accuracy.
Translations are proposed for we observed that the discordance between analytical and graphical approach was due to bias of a candidate technique. It is to say that the analytical tests seem to be somewhat purer than the graphical approach and they were able to detect the absence of difference in precision and agreement despite biased means.
Translation of lines by the amount of bias computed in accuracy is a simple correction. The removal of mean difference was able to displace reference lines in such a way that the analytical approach became coherent with the graphical approach by line positioning inside the confidence band obtained by bootstrapping. Specifically, for precision test the non rejection of null hypothesis occurs when some horizontal line displaced by the bias can be located inside the 95% confidence band. Similarly, Deming regression applied to the bisector agreement test shows non-rejection of the null hypothesis when lines parallel to the bisector line translated by the bias range can be located inside the 95% bootstrapping confidence regression band.

Results
We revisited five data sets from the Bland and Altman [1] (case 1), three other from Bland and Altman [2] (case 2), and one from data provided by Videira and Vieira [35] (case 3).

Case 1
Bland and Altman's graphical plot method was originally proposed to assess peak expiratory flow rate (PEFR) applying two instruments: Wright Peak Flow (taken as reference) and Mini Wright Peak Flow meters [1]. Seventeen subjects were submitted to each instrument. Both flow meters may be assumed as strictly equivalent measurement techniques as shown in Figure 1. More than one approach is represented here: p-values are analytical statistics tests (accuracy, p = 0.4782; precision, p = 0.6525; bisector concordance test, p slope = 0.6726 and p intercept = 0.6456); shadowed areas are structural regression bands obtained by bootstrapping (accuracy shows that the null hypothesis represented by the diamond symbol at the coordinates (0, 0) is inside the 95% confidence interval; precision shows the null hypothesis represented by a horizontal line inside the 95% confidence band defined by the structural regression; bisector concordance test (λ = 1.692) shows the null hypothesis represented by the structural bisector accommodated into the 95% confidence band defined by Deming regression. In addition, insert (right panel) shows the 95% confidence ellipse as an alternative to jointly test slope and intercept (null hypothesis assumes, simultaneously, slope equals 1 and intercept equal 0).

Case 2
Bland & Altman provided other three application examples of their graphical method [2].
(a) The first example is a comparison between systolic blood pressure measurements taken by an observer (J) against an automatic machine (S), detecting a systematic bias towards the machine (n = 85). The authors after removal of outliers concluded that the interval range was too big to assume equivalence between observer and machine. Our comparison showed the same bias but did not need exclusion of outliers and passed precision and bisector line agreement tests, thus supporting that there is no strict equivalence but, by discounting the bias, observer and machine may be interchangeable ( Figure 2). Here, the structural bias (overestimation by S) is represented by the 95% confidence interval above the diamond. However, by translation, precision shows that horizontal lines can be inside the 95% confidence band, and agreement is represented by bisector lines also contained inside the 95% confidence band. Coherently, translated intercept is inside the 95% confidence ellipse. Note that the analytical test shown at the header of test 3 rejected the null hypothesis due to intercept different of zero, a consequence of the bias detected by accuracy test that cannot be corrected by this traditional analytical approach.
(b) The second example is the estimated percentage of plasma volume in blood provided by Nadler and Hurley techniques (n = 99). The original authors based on their canonical plot showed increasing bias toward Nadler's technique with greater average values. In order to verify equivalence between techniques, the authors proposed two strategies: application of logarithm transformation and scaling of one of the measurements (multiplying Hurley measurements by 1.11). Figure 3 shows our approach. Upper panels confirm no equivalence between methods in any of the three tests. Second row of panels shows that Bland and Altman's proposition of logarithm transformation of both scales does not solve structural bias, but leads to precision and agreement line equivalences. The third row represents the multiplication of Hurley values by 1.11, which is a more successful strategy, however with marginal failure for accuracy. Finally, guided by our approach it was fine-tuned; we found strict equivalence multiplying Hurley technique's values by approximately 1.1038, with accuracy equivalence and improved precision and agreement line tests in lower panels of figure 3.
(c) In the third example Bland and Altman compared fat content in human milk by enzymic hydrolysis of triglycerides and by the standard Gerber technique (n = 45). Values obtained by enzymic hydrolysis are overestimated for smaller and underestimated for greater values. Therefore, these authors provide a long discussion to readjust their traditional horizontal lines into a slanting band formed by two straight lines in order to accommodate these differences. This is not required with our proposal (Figure 4), since the precision test naturally produced a slanted band. Our result opposes author's original conclusion and these techniques cannot be assumed as strictly equivalents in precision and agreement line.

Case 3
Videira and Vieira [35] compared, through questionnaires, anesthesiologists' selfperception and their peers' perceptions of skills on deciding for the use of neuromuscular blocking drugs (n = 88). These authors concluded by Bland and Altman plot method that both perceptions did not match: subjects overestimated themselves in comparison with the opinion of their colleagues. Our approach ( Figure 5) shows that accuracy test supports the bias described by those authors as the welldocumented "above-average effect" (tendency to consider oneself better qualified). Precision test however, even with translation of the horizontal line, shows that selfperception and others' opinions are not equivalent. The null hypothesis of agreement with bisector line was not rejected due to lambda compensation (if lambda were assumed to be 1, then the band would be narrower and the null hypotheses would be rejected; not shown), but this equivalence may be meaningless because these measurement techniques are not comparable in accuracy and precision according to our method.

Discussion
Bland-Altman's analysis emphasizes clinical significance and their plots largely ignore statistical inference, relying on visual inspection to draw what is considered by Watson and Petrie as subjective conclusions [16]. Our contribution adds an objective statistical inference and locates causes of non-equivalence by taking apart accuracy, precision, and bisector agreement.
Altman and Bland stated that "the use of correlation is misleading." [36], showing that Pearson's correlation is an insufficient method. Their original study [1] was motivated by the need to overcome correlation indices between clinical measurements and became used for this purpose for its easier communication by means of graphical outputs [11]. However, Altman and Bland also stated that "comparability of techniques of measurement is an estimation problem: statistical significance is irrelevant" [37] and that "these are questions of estimation, not significance tests, and show how confidence intervals can be found for these estimates" [38] from which we respectfully disagree. In fact, we look for statistical treatment comparing any two related measurement techniques and a proper method to compute confidence intervals instead of non-informative Chebychev's intervals with or without LoA additional flexibilization or adaptations to create slanted limits of agreement that these authors erroneously proposed [38].
Here we tested five published data sets. In all these examples, Bland-Altman plot method was applied, and like any other paper published since, there was no associated statistical tests. These previous papers served as reference to compare original authors' conclusions with that provided by our method. Besides partially supporting published conclusions, our proposed method is better to locate the source of non-equivalence between techniques. Peak flow expirometers [1] showed strict agreement in accuracy, precision, and agreement line. The other three data sets [2] are examples of solvable equivalence between methods, but it was shown that our three-step tests can add statistical support to decisions and provide solutions in an easiest way. The last data set [35] is a case of non-equivalence in which our graphical output showed the conceptual importance of test nesting. In this case there is mean difference (test 1) that should be solved by translation. However, although the measurements are equivalent according to Deming regression (test 3), precision (test 2) does not comply. Without the nesting nature for the rationale of the proposed approach, one could accept equivalence just by correcting bias for a meaningless comparison. It is problematic to compare pairs of measures if they come from techniques exhibiting precision discrepancies because greater differences of precision detected by test 2 causes deviations of λ (i.e., λ = 1, which is computed along test 2) which, in turn, affects the confidence band of test 3.
In order to better guide a researcher, the concept of line translations was created. This concept allows the verification of precision and bisector agreement even when two measurement techniques do not provide equality of means. Even being biased, a surrogate technique which provides equal precision and agreement may deserve a chance for practical use: if biased but equally precise and showing a constant disagreement, a simple arithmetic correction or calibration can fix the technique; if less precise but providing the same average measurement, it could be eligible as a screening step; if this imprecision imposes risks to patients, then the technique must be reviewed. Anyhow, the decomposition in accuracy, precision and agreement with bisector line analysis should be helpful to a researcher deciding where to put energy to fix a new technique when full equivalence is not obtained.
On the other way around, non-rejection of the null hypothesis is not enough. When one pursues the comparison of two techniques, the acceptance of equivalence (the acceptance of null hypothesis) is necessary to allow replacement of a given technique for another, which depends on the probability of type II error in order to establish equivalence with power of 90%. Its computation obtained from a sample a posteriori is meaningless [39], therefore planning of sample size along study design a priori is crucial. The guidelines from the Clinical and Laboratory Standards Institute [40] states that at least 100 observations are necessary to claim consistency of a candidate measurement procedure applicable to different populations (item 6.3, page 12) down to 40 observations under more controlled laboratory conditions in which samples are originated from a single population (item 7.2, page 15); however, this same source deals with more than a measure of each technique from the same patient with average or median, from which we disagree: if affects the computation of λ, wastes information and, consequently, brings an ethical problem when invasive techniques are under assessment. Linnet approached this issue for the Deming regression, stating that sample sizes between 40 and 100 usually are to be reconsidered [41]; this author states that the ideal number depends on the quotient between maximum and minimum measurements, proposing numbers from small sample sizes up to numbers in the order of 500 pairs of measurements (with mention to extreme numbers of thousands). Considering this controversial discussion related to sample sizes, some classic Bland and Altman examples applied here and many other published studies may be below the limit and allow only the rejection/non-rejection of null hypotheses without enough power to define true equivalence along the three statistical steps presented here.

Conclusions
It is possible to test wether two techniques may have full equivalence, preserving graphical communication according to Bland-Altman's principles, but adding robust and suitable inferential statistics of equivalence. Statistical p-values and robust approach decomposes the equivalence in accuracy, precision, and agreement for measurement techniques in such a way that, when full equivalence does not verify, this decomposition may help the location of the source of the problem in order to fix a new technique. Applications of the selected statistical methods using R provide automatization and standardization of an otherwise complex calculation for a better communication among researchers. Declarations -Ethics approval This investigation is purely theoretical, thus it was not submitted to any ethics committee.
-Consent for publication All authors read and approved the final version of the manuscript.
-Availability of data and materials The datasets analysed during the current study are available in the SourceForge repository, https://sourceforge.net/projects/eirasba/files/ [20]. These data can be independently downloaded from the repository (in Rdata format) but are easier to extract with the installation of the R package that was developed for the current work for free distribution. This repository contains: • a compiled, ready-to-install file in folder PackageArchiveFile. The current version to date is eirasBA 1.0.0.tar.gz. It may be replaced by higher numbers when improvements are implemented, but it will be located in the same folder.
• all source codes are in folder SourceCode/R/ • all documentation is in folder SourceCode/man/ • all raw data are in folder SourceCode/data/ The last three folders are only necessary for developers that may be interested in modify this R package. For researchers interested in apply our procedures "as is" it suffices to install the package achive file (e.g., eirasBA 1.0.0.tar.gz), which also comprises all raw data, R routines and documentation. This installation is simple under RStudio (current version 1.4.1717): • download the package achive file in a local computer, • access Tools → Install Packages..., • change the option Install from: to Package Archive File, • browse to find the local package archive file (tar.gz), • proceed to the installation. We recomend to start with examples that can replicate the full analysis presented in this paper. These examples are found by library(eirasBA) ?all.structural.tests In addition, all data analysed along this study are included in supplementary information files: • eirasBA 1.0.0.tar.gz: ready-to-install file for eirasBA (R package) containing raw data and function documentation. -Competing interests There are no competing interests to declare. The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.
-Funding Authors have no financial support related to this theoretical investigation.
-Authors' contributions All authors equally contributed along the method development and to the final written report. PSPS designed, implemented R routines and performed the applied statistics and simulations, transforming the statistical method in graphical communication with statistical output. He also contributed to the final manuscript with emphasis on introduction, results, and discussion. JEV is the practitioner who presented the problem and realized that it would be possible to transform Bland and Altman plot method in a set of statistical tests. He mainly contributed to the final manuscript with emphasis on introduction and discussion, thus providing context for the present work. AAF took part at the discussion of the project and contributed to the final manuscript with emphasis on introduction and discussion. JOS provided the theoretical support and solution for the statistical approach with correspondent R packages, providing equations to compute critical steps of the statistical tests when not available from packages and review of all implemented routines. He contributed to the final manuscript with emphasis on methods, results, and discussion. Figure 1 Graphical representation from accuracy, precision, and bisector concordance tests showing that peak flow measurements from Wright and Mini PEFR are strictly equivalent. See text, case 1.     Figure 1 Graphical representation from accuracy, precision, and bisector concordance tests showing that peak ow measurements from Wright and Mini PEFR are strictly equivalent. See text, case 1.

Figure 2
Comparison of Systolic blood pressure measured by a human observer J and an automatic machine S showing a structural bias (overestimation by S) at accuracy test, and concordance at the precision test and bisector test. See text, case 2(a).