Easy Computation of the Bayes Factor to Fully Quantify Occam’s Razor: Least-squares Fitting and Policy Decisions

: The Bayes factor is the gold-standard figure of merit for comparing fits of models to data, for hypothesis selection and parameter estimation. However, it is little-used because it has been considered to be subjective, and to be computationally very intensive. A simple computational method has been known for at least 30 years, but has been dismissed as an approximation. We show here that all three criticisms are misplaced. The method should be used with all least-squares fitting, because it can give very different, better outcomes than classical methods. It can discriminate between models with equal numbers of parameters and equally good fits to data. It quantifies the Occam’s Razor injunction against over-fitting, and it demands that physically-meaningful parameters rejected by classical significance testing be included in the fitting, to avoid spurious precision and incorrect values for the other parameters. It strongly discourages the use of physically-meaningless parameters, thereby satisfying the Occam’s Razor injunction to use existing entities for explanation rather than multiplying new ones. More generally, being a relative probability, the Bayes factor combines naturally with other quantitative information to guide action in the absence of certain knowledge.


Introduction
"If your experiment needs statistics, you ought to have done a better experiment."attributed to E. Rutherford. Nevertheless, almost every practising scientist, engineer, economist, etc, uses least-squares (LS) statistical methods to fit analytic expressions to data points. This is done for parameter estimation (uncertainties as well as values) and for hypothesis or model selection. 1 However, LS fitting poses questions. How to know if the fit is as good as may be?
How to choose between models which all fit well? How to detect over-fitting and underfitting? These questions require quantitative tests based on statistical theory. There are wellknown statistical toolssignificance testssuch as the traditional p-number or the 3- test, and the more recent AIC and BIC. Such tools are however inadequate, because they do not use the prior knowledge that we have. 2 The Bayes factor, derived from Bayes' theorem, does do this, and so has been described as the gold-standard figure-of-merit for comparing models.
However, it is rarely used, not least because it can be computationally very demanding. Here we present an easy way of calculating it so that it can be routinely used with all least-squares fitting. We demonstrate its useand usefulnesson three datasets from the literature.
Outcomes can be very different from those both of significance testing and of the BIC.
Moreover, when considering not merely whether a theorya modelis true or not, but, as a practical matter, deciding what action should be taken given the outcomes of the fitting, the Bayes factor can quantitatively support intuition.
Bayes' Theorem explicitly includes prior knowledge in its calculation of the probability of a hypothesis given data. It was an unexceptionable part of probability theory in the nineteenth century. However, the increasing formalisation of probability theory and statistics in the twentieth century led to its sidelining, on the grounds that it introduces a subjective element, our state of knowledge, or grounds for belief, about future events. It was considered that probabilities should be purely objective. Jeffreys' seminal book in 1939 began the rehabilitation of Bayesian statistics. 3 This has been slow and controversial. For an entertaining historical survey, see the article by Leonard,4 and for a non-technical discussion see Jaynes. 5 For an early technical account, see Kass and Raftery. 6 Occam's Razor ("Entities should not be postulated without necessity"), in the context of least-squares fitting, demands that we should not use more fitting parameters than are necessary. That is, we should not overfit data. Classicaltwentieth centurystatistics scarcely quantifies this. In 1974, Akaike introduced the AIC (Akaike Information Criterion) which quantified this issue by preferring the model with the highest log-likelihood (see equation (2) in Section 2) less a penalty of n, the number of parameters. 7 This has now been largely supplanted by the BIC (Bayesian Information Criterion, or Schwartz criterion, 8 SBIC), like the AIC except that the penalty for n parameters is ½nlnm where m is the number of data points. (Both the AIC and the BIC are usually presented after multiplication of these definitions by -2.) The BIC is now widely used. [9][10][11] Thousands of papers per year now cite BIC values to justify the choice of one model rather than another, e.g. in ecology. 12 However, the AIC and the BIC and many related criteria (DIC, FIC . . . WAIC) are gross approximations to the Bayes factor. Indeed, despite its name, the BIC is not Bayesian, and nor are the various related criteria. This is because they do not take into account the prior probabilities of the models. The Bayes factor does. In so doing, it quantifies two further intuitions, or corollaries, of Occam's razor. The first is that fits to data that use physicallymeaningful parameters are preferable, if they fit, to fits that use physically-meaningless parameters such as coefficients in a polynomial or Fourier series. The latter introduce new entities while the former use entities that already exist. The second, closely related intuition, is that a model that is not capable of fitting all possible datasets (that does not span the data space) yet does fit the actual dataset is preferable to a model that could fit any data presented (that does span the data space).
Despite being the gold standard, the Bayes factor is little known and less used. It has been considered to be computationally massively intensive. 6,[8][9][10][11] Except in simple problems of models with one fitting parameter, evaluating the Bayes factors of the models has required multi-dimensional integrals over parameter space. Fitting, for example, a multi-peak spectrum with tens of parameters, this requires computationally-heavy techniques such as Markov-chain Monte-Carlo integration. Because of the taint of subjectivism, in its use of what we know, many Bayesians have preferred to avoid prior knowledge and use in its place information obtained from the data, such as unit information priors. 6,13,14 Yet this concern is misplaced. What we know before analysing data is as objective, in the usual scientific sense, as the data themselves.
Here we present a formula for easy calculation of the Bayes factor after every LS fit with much less computational effort than the fit itself. This formula has been known since at least 1992, 14 and perhaps earlier. Its use in routine LS fitting has not been widely advocated, because of the subjectivity issue, and because it bypasses the computational difficulties of the Bayes factor by what has been widely thought to be an approximation, the Laplace approximation 15 although McKay already in 1992 recognised this as exact in most situations. 14 Perhaps also because the value of the Bayes factor in quantifying the two further intuitions of Occam's razor mentioned above, and its value as a guide to action, have not been sufficiently appreciated. We present the method in Section 2. In Section 3 we briefly discuss the underlying theory, and in the Supplementary Information (SI §4) we give a derivation of the formula which we hope makes the underlying ideas clearer than they were in the older literature. Then in Section 4 we apply it to three examples of data-fitting in which the use of the Bayes factor leads to very differentand betteroutcomes than traditional methods. Finally, in Section 5, we discuss the main outcomes, and consider the relevance of the Bayes factor to two live controversies. On significance (p-values etc) in fitting, we find that reliance on significance and the rejection of physically-meaningful parameters that do not pass significance tests will normally give incorrect results. On vitamin D and Covid-19, we see how the Bayes factor can combine with other information to provide quantitative support for actions that lack significant evidential support.

Methods
A least-squares fitting routine normally returns the parameter fitted values and their uncertainties, the fit residuals ri and their standard deviation r, and perhaps the parameter covariance matrix Covp, the BIC, etc. The formula we apply uses the marginal likelihood integral (MLI) calculated for each LS fit. See Section 3. Calculating the MLI is done by, where n is the number of parameters. 14 Then the Bayes factor between two models is the ratio of their MLI values. The first step in applying it is to calculate the quantity Lmax, which is the value of the likelihood L at the fitted parameter values whether LS or ML fitting is used. L is the product of the probability densities of all the m datapoints given the fit. If it is not returned by an LS routine, it is readily calculated (see SI §S1). With perhaps hundreds of datapoints, L can be a very large or a very small number, depending on the value of the standard deviation of the residuals, , so it is more convenient to work with the loglikelihood, lnL. Equation (S1) shows that for a Gaussian distribution of residuals, maximising lnL is equivalent to minimising the sum of the squares of the residuals (the SSR). If the LS routine returns the SSR, then it is particularly easy to calculate lnL. When we have the MLI values for two or more fits, their ratios give the relative probabilities for the models given the datathe Bayes factors (BF) between the models. It is more convenient to work with the logarithms, and then it is the difference of the lnMLI values, lnBF, which matters. Jeffreys and many subsequent authors have given verbal descriptions of the meaning of values of lnBF, in terms of the strength of the evidence in favour of the model with the higher lnMLI, such as <1barely worth considering, 1-2substantial, 2-5strong evidence, >5decisive. 3,6 More important than the verbal descriptions is that the Bayes factor simply expresses the relative probabilities of the models.
The lnBF value corresponds to odds of e lnBF to 1 on the preferred model, or against the other model. The descriptions and the odds also apply to comparing models by differences in lnLmax between models with the same number of parameters, and by the Schwartz BIC (SBIC = -½BIC, which we use here for easy comparison with lnL, lnMLI and lnBF).

Theory
Equation ( where Gull explains the first term in the RHS as having nothing to do with the theories or the data; it will normally be unity. Perhaps slightly tongue-in-cheek, Gull proposed that it could be adjusted to reflect the past performances of Mr A and Mr B. We take this term as unity here but we return to it in Section 5. The second term in the RHS is the ratio of the maximum likelihoods, which will normally favour B because adding fitting parameters will normally improve the fit to data. For B, it is the likelihood evaluated at the fitted value, 0. The third term in the RHS is the Occam factor, which will provide the penalty for the extra parameter in B. As Gull explains it, Mr B had to spread his probability Pr(B) over the prior range that he will have specified of possible values of  from min to max, with some pdf, that is assumed to be flat from min to max. 6,14,16 When the data are given, the probability of the  (1). See SI §4 for an explanation.
The ranges define a volume in parameter space, known as the prior parameter volume.
Similarly, the square-root of the determinant of the covariance matrix defines another, smaller volume in the same space, the posterior parameter volume. The ratio of these two volumes is termed the Occam Factor. [16][17][18] Our equation (1) is well-known in the literature, for example, it is equation (6)  are physically realistic, but from the data (e.g. unit information priors). Indeed, that is the key step in using equation (1) to derive the BIC, 14 and is the reason the BIC treats all parameters alike. Gull 16 discusses the selection of the volume in the special case of one fitting parameter only, where the covariance matrix is not needed. Sivia and Skilling 2 also consider it but in the context of maximum likelihood fitting and apparently much more complicated calculations, in which our equation (1)

How many parameters best describe data in muon spectroscopy?
Here we find that the Bayes factor demands the inclusion of more physically-meaningful parameters than the BIC or significance tests. We present some data that might reasonably be fitted with as few as three or as many as 22 physically-meaningful parameters. We find that the Bayes factor encourages the inclusion of all these parameters until the onset of overfitting. Even though many of them have fitted values that fail significance tests, their omission distorts the fitting results severely. The open or small data points from three to seven parameters are for a single peak. The solid or large datapoints from five to 16 parameters are for two peaks, and from 18 to 20 parameters for three peaks. Fig.1a shows an anti-level crossing spectrum observed in photo-excited muon-spin spectroscopy 24 from an organic molecule. 25 These spectra are expected to be Lorentzian peaks. Theory permits optical excitation to affect the peak position, the width and the strength (photosensitivity). In the field region over which the measurements are carried out, there is a background from detection of positrons, which has been subtracted from the data presented. 25 Wang et al. 25 did not attempt to fit the data; they did report a model-independent integration of the data, which demonstrated a change in area. that any figure of merit shows any substantial increase. There is no evidence here for photosensitivity. The weak peak around 7050 mT does not seem worth including in a fit, as it is evidenced by only two or three data points and is scarcely outside the error bars. Fitting with two peaks (P1 ~ 7210 mT, P2 ~ 7150 mT; solid or large data points for p = 5 -16 in Fig.1b) gives substantial increases in the SBIC and lnMLI, further increased when the photosensitivity parameters LP2, LW2 and LA2 are included. The SBIC reaches its maximum here, at n = 10, and then decreases substantially. Additional parameters fail the significance tests as well as decreasing the SBIC (Fig.1b). Conventionally, the n = 10 fit would be accepted as best. The outcome would be reported as two peaks, with significant photo-sensitivities LP2, LW2 and LA2 for all three of the 7150 mT peak parameters, but no photosensitivity for the 7210 mT peak (Table I). Table I. Photosensitivity results of fitting the data of Fig.1a with 10, 16 and 19 parameters. Parameter units as implied by Fig.1a.  Table I shows that at n = 16, LP2 is the only photosensitivity parameter to pass significance tests. LA2, which had the highest significance level at n = 10, is now the parameter most consistent with zero.
The other four are suggestively (about 1½) different from zero.
Since the Bayes factor has already radically changed the outcome by encouraging more physically-meaningful parameters, it is appropriate to try the 7050 mT peak parameters in the fit. With only 28 data-points, we should be alert to over-fitting. We can include P3 and A3 (n = 18), and LP3 (n = 19), but W3 and LA3 do cause overfitting.  Table I, is that the uncertainties on the n = 16 parameters have decreased markedly. This is due to the better fit, with a substantial increase in lnL corresponding to reduced residuals on all the data. The 7210 mT peak 2 now has photosensitivities on all its parameters, significant to at least the 2 or p-value ~0.05 level.
And the photosensitivities LW2 and LA2, both so significant at n = 10, and already dwindling in significance at n = 16, are both now taking values quite consistent with zero. In the light of Table I, we see that stopping the fit at n = 10 results in completely incorrect resultsmisleading fitted values, with certainly false uncertainties.

Fitting nearly linear data for the pressure dependence of the GaAs bandgap.
The main purpose of this example is to show how the Bayes factor can be used to decide between two models which have equal goodness of fit to the data (equal values of lnL and BIC, as well as p-values, etc). This illustrates the distinction it makes between physicallymeaningful and physically meaningless parameters. This example also shows how ML fitting can be used together with the Bayes factor to obtain better results. For details, see SI §7.  Here, using the Bayes factor, we obtain the same result from a single dataset, that of Goñi et al. 26 The two fits are shown in Fig.2. They are equally good, with values of lnL and SBIC the same to 0.01. The key curvature parameters, c and B′, are both returned as non-zero by 13.5 (SI, §7, Table S1), consequently both with p-values less than 10 -18 . However, c is a physically-meaningless parameter. The tightest constraint we have for setting its range is the values previously reported, ranging from 0 to 60 eV kbar -2 , so we use c = 100 eV kbar -2 .
In contrast, B′ is known for GaAs to be 4.49. 29 For many other materials and from theory the When we fit the Perlin data, the Murnaghan fit returns B′ = 6.6  2.4. This is outside range, and indicates that this data cannot give a reliable valueattempting it is over-fitting.
However, it is good to fit this data together with the Goñi data. The Perlin data, very precise but at low pressures only, will complement the Goñi data with its lower precision but large pressure range. We notice also that the Perlin data has a proportion of outlier data points.
Maximum Likelihood fitting handles both issues. We construct lnL using different pdfs P(r) for the two datasets, and with a non-Gaussian pdf for the Perlin data (see SI §7 equation (S6).
Fixing B′ at 4.49, fitting with the same /B0 returns 11.42  0.04 meV kbar -1 . Separate /B0 parameters for the two datasets give an increase of lnL of 4.6, with values 11.28  0.06 and 11.60  0.04 meV kbar -1a difference in b of 0.32  0.07 meV kbar -1 , which is significant at 4½. This difference could be due to systematic error, e.g. in pressure calibration. Or it could be real. Goñi et al. 26 used absorption spectroscopy to measure the band-gap; Perlin et al. 27 used photoluminescence. The increase of the electron effective mass with pressure might give rise to the difference. In any case, it is clear that high-pressure experimentation is much more accurate than previously thought, and that ML fitting exploits the information in the data much better than LS fitting.

Carbon nanotube Raman spectrum
This example demonstrates how the Bayes Factor provides a quantitative answer to the problem, whether we should accept a lower quality of fit to the data if the parameter set is intuitively preferable. It also provides a simple example of a case where the Bayes factor calculated by equation (1) is in error and can readily be corrected (see SI §8 Fig.S3).
The dataset is a Raman spectrum of the radial breathing modes of a sample of carbon nanotubes under pressure, previously fitted in the traditional way. 30 The key issue in the fitting was to get the intensities of the peaks as accurately as possible, to help understand the evolution with pressure. Here, we take a part of the spectrum recorded at 0.  However, a fourth model was motivated by the observation that the three backgrounds all look as if they are related to the sharp peaks, rather like heavily broadened replicas (see Fig.3a). Accordingly, in a fourth model, we use no background, and instead modify the peak shape, giving it stronger, fatter tails than the pseudo-Voigt peaks (Tails model). This was done by adding to the Lorentzian peak function a second Lorentzian term with much greater width, or, still better, a smooth function approximating to exponential tails on both sides of the peak position (for details, see SI §8) with widths and amplitudes as fitting parameters. What is added may be considered as background and is shown in Fig.3a. This model, at the stage of All models can be taken further, with more fitting parametersmore Fourier or polynomial terms or more peaks, and for the Tails model more parameters distinguishing the tails attached to each of the seven Lorentizian peaks. In this way, the three background models can improve to a lnL ~ -20; the Tails model does not improve above lnL ~ -50. However, as seen in Fig.3b, the MLIs worsen with too many parametersexcept when over-fitting occurs, as seen for the Poly model at 35 parameters. The Tails model retains its positive lnBF > 10 over the other models.
The other models can have an indefinite number of additional parametersmore coefficients or more peaks, to fit any data set. It is in this sense that they span the data space.
The actual number used is therefore itself a fitting parameter, with an uncertainty perhaps of the order of ±1, and a range from 0 to perhaps a quarter or a half of the number of data points d. We may therefore add to their lnMLIs the factor ~ln 4d -1 ~ -5 for a few hundred data points. This takes Tails to a lnBF > 15 over the other modelsoverwhelmingly decisive.
This quantifies the intuition that a model not guaranteed to fit the data, but which does, is preferable to a model that certainly can fit the data because it spans the data space. It quantifies the question, how much worse a quality of fit should we accept for a model that is intuitively more satisfying; here we accept -30 on lnL for a greater gain in the Occam factor.
It quantifies the argument that the Tails model is the most worthy of further investigation because the fat tails probably have a physical interpretation worth seeking (in this context, it is interesting that in Fig.3a tails have been added only to the 255, 265 and 299 cm -1 peaks; adding tails to the others did not improve the fit; however, a full analysis and interpretation is outside the scope of this paper). In the Peaks model it is not probable (though possible) that the extra peaks would have physical meaning. In the other two models it is certainly not the case that their coefficients will have physical meaning.

Discussion and Conclusions:
The most surprising outcome of Section 4 is the desirability of including in models some parameters that fail significance tests, and reporting the outcomes. This is relevant to the controversy about significance tests such as p-values.
In the story of Mr A and Mr B, the two models are explicitly given equal a priori including the extra parameter, correlations between parameters result in giving B = ( 0 ′ ±  0 ′ , λ 0 ± δλ), defining a different posterior parameter volume. The uncertainties  ′ will generally be larger than pi, and the values 0 ′ will generally be different from pi0. For illustration, suppose that 0 is non-zero but fails significance tests, being perhaps just 1 or 2 away from zero, and that the MLIs come out equal (i.e. the improvement in lnL in Model B is offset by the Occam factor, and lnBF remains at zero). Now to reject  and to report only the fit to model A is to assert that the true values pi have each a ⅔ chance of lying within VA, within the 1 ranges pi. However, that assertion is conditional on  actually having the value zero; that is, it is conditional on the truth of the null hypothesis A. And that is a condition that we do not know to be true. The failure of B to attain significance is often mistakenly described as evidence for the null hypothesis A. Amrhien et al. report that around half of a large number of articles surveyed in five major journals make this mistake. 31 It is not just a scientific mistake. 10 It can be a disastrous guide to action.
According to the Bayes factor, the models A and B have equal probabilities, ½, and so what we know is that the parameters of model A have each a ⅓ chance of lying within their 1 ranges pi around pi and a ⅓ chance of lying within the 1 ranges  ′ around ′ . In fact, in this situation (and especially if a significant non-zero 0 would be an exciting resultsee Ref. 32 and discussion below for a current example) the usual reaction to finding that 0 is 2 away from zero is to repeat the experiment, to take more data. Of course, that has some chance of finding a 0 closer to zero, but it also has a good chance of confirming a non-zero 0. So the Bayes factor is a guide to action; the significance test is not. The issue of bleach and homeopathy is readily dealt with. With an unlimited number of putative actions ′ based on models ′ to consider, their a priori probabilities should be rated as very small, except when there is evidence for them that is rated as not insubstantial.
Then the factor P( ′)/P(A) will outweighnegativelythe factor Pr( ′)/Pr(A). In conclusion, calculation of Bayes factors should be a routine part of all data fitting.
It gives advice that is the opposite of much standard practice, but which satisfies Occam's Competing Interests: The authors declare no competing interests. The open or small data points from three to seven parameters are for a single peak. The solid or large datapoints from five to 16 parameters are for two peaks, and from 18 to 20 parameters for three peaks.  26 (■) and from Perlin et al. 27 () are shown after subtraction of the straight line E0 + 9.5P to make the curvature more visible. The Perlin data is expanded 10 on both axes for clarity. Two least-squares fits to the Goñi data are shown, polynomial (dashed red line) and Murnaghan (solid blue line).

Fig.3. Carbon Nanotube Raman Spectrum.
In (a), the carbon nanotube Raman spectrum are plotted (black datapoints) with a fit (cyan solid line) using the Fourier model. The residuals for four good fits are shown, ×10 and displaced successively downwards (Fourier, Polynomial, Peaks and Tails; see text). In (b), the evolution of the MLIs is shown against the number of parameters for these four models. Table I. Photosensitivity results of fitting the data of Fig.1a

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. BayesSciRepSuppInfosubmission.docx