Remark on the Swarm data residual distribution

Abstract


Introduction
Recall that typical appropriate statistical assumptions are being made with respect to the distribution of the uncertainties σ i affecting the data used.The statistical properties of these uncertainties, however, are not always well characterized.In such circumstances, assuming that uncertainties follow a Gaussian distribution would a priori make sense, since such a distribution often arises naturally as a consequence of the central limit theorem when errors act in an additive manner (see, e.g., Feller 1971).Relying on this assumption, and provided that s i is an adequate measure of the error affecting the datum γ i , standard statistical estimations then used to infer the model.The normalized residuals γi− γi si ( here γ i being the datum value predicted by the model ) are expected to follow a standard normal distribution.
Yet, residuals often display a sharper distribution, sometimes much closer to that of so-called Laplace distribution (e.g., Jackson et al. 2000;Walker & Jackson 2000;Panovska et al. 2012Panovska et al. , 2015)).We showed in Khokhlov & Hulot (2017) that residuals may be incorrectly normalized and therefore their common statistical distribution is a mixture of gaussian distributions (Barndorff-Nielsen et al. 1982)this is, generally speaking, not at all new.In particular we demonstrated in Khokhlov & Hulot (2017) several examples of the variativity in σ determination, that indeed leads to the non-gaussian shape of the histogram.Thus we assume that observable residuals θ is the mixture of individual gaussian random variables with zero expectations and random variances β 2 .The artificial intelligence approach provides the algorithm that, in principle, can recover the distribution of variances.However, this computational algorithm (EM-algorithm) is not perfect and too sensitive to the errors in the data.In the present note we argue that the distribution of random variable β can be well approximated by the lognormal distribution with pdf We also provide the method that recovers the value of s in the real data case.

Mixture model
The unformal interpretation The mixture model is appropriate for the situation when the data is inhomogenious, for instance it comes from a several locations such that each region perturbe slightly the assumed data distribution law, i.e.
the corresponding distribution formulae differes slightly in their parameters.In practice we often face with even simpler situation: each regional data is gaussian with mean zero but the corresponding σ-values depend on the region.However we rarely can select the region with absolutely homogenious data in it, therefore we better simulate this situations by means of sequential small perurbations of the initially homogenious gaussian population.Whether the limit distribution can be described providing the very small intermediate perturbations?

Version of the general formula
If ζ is an arbitrary random variable with density f ζ , then for fixed y 0 > 0 the ratio ζ/y 0 has density f ζ (xy 0 )y 0 , see also (see, e.g., Feller 1971).Let now denominator is not fixed but a positive random variable with density g η , then we get the pdf for this ratio We may now compare the mixture of unbiased Gaussian distributions (i.e. with pdf f α = N (0, σ 2 )) by randomizing their standard deviations using a random variable β > 0: Obviously f θ can be interpreted as the pdf of the ratio.
For instance, recall the following example of Khokhlov & Hulot (2017): the uniform mixture of unbiased Gaussian distributions with standard deviations varying between 0 and 1, i.e. mixing pdf is we may treat that mixture as the ratio of standard gaussian α devided by η -the inverse of uniform distribution:

Sequential small mixtures
The small multiplicative randomization is described in terms of a random β > 0 with pdf for some small ε, we may assume β = e δ where expectancy E(δ) ∼ 0 and variance D(δ) ∼ ε 2 .
For the sequential small mixtures (with independent β i ) then we get the ratio But under the mild conditions the distribution of m i δ i rapidly converges to gaussian distribution N (a, s 2 ) with a ∼ 0 and s ∼ m i ε 2 i , thus the limit pdf for the sequential arbitrary, but small mixtures can be approximated by for a = 0 and some suitable parameters s and σ.

Real data application
Here we use the same data as in Khokhlov & Hulot (2017): we consider the absolute scalar data acquired by two of the Swarm satellites (Satellites Alpha and Bravo) at quasi-latitudes ranging between +55 • and −55 • , and computed residuals with respect to the so-called VFM model of Vigneron et al. (2015): for the the array ST 1 of one-day std of residuals, take a look at Fig. 1 borrowed from our article.
The satellite scalar data (Vigneron et al. 2015) cover a little less than a year (between November 29, 2013 and September 25, 2014) and were further selected following a number of criteria, among which magnetically quiet and night time conditions, to ensure that as little as possible non-modeled external signal is included in the data.This resulted in 42 160 data for the Alpha satellite and 42 175 for the Bravo satellite.These data can be expected to reflect the signal of the field of internal origin the model aims at modeling, any other source of signal being treated as a source of noise acting on top the very low instrumental and satellite noise (less than 0.3 nT, see Léger et al. 2015;Olsen et al. 2015;Fratter et al. 2016).The datasets used and analysed during the current study are available from the author on reasonable request.

Method and Numerical results
Rescale this array ST 1 as r → σ1 r = y where σ 1 = mean(ST 1 ) and r ∈ ST 1 ; in virtue of the eq.2 the array {y i } is expected to obey the lognormal distribution with parameter s 1 (see eq.2), let's directly calculate σ 1 and s 1 .Now repeat all these computations for arrays ST 0.25 , ST 0.5 , ST 0.75 (i.e.corresponding to time intervals of 0.25 to 0.75 day), here are the results: Thus we get the explicit expressions of the unknown parameters Eθ 2 (3) In practice we recover from real data the estimates of the moments Eθ 2 , E|θ| and then get (the estimates of) the unknown parameters.

Conclusions
Hereby we added the quantitative details of the data distribution to the qualitative analysis of it that was published in Khokhlov & Hulot (2017): namely, using formula 3, we may now recover the estimates of the parameters s and σ (the latter can be treated as an estimate for the "inner precision" of measurements);

Supplementary Files
This is a list of supplementary les associated with this preprint.Click to download. GraphAbstract.pdf 40 s 1 = 0.36 ST 0.75 : Satellite A σ 0.75 = 2.34 s 0.75 = 0.36, Satellite B σ 0.75 = 2.35 s 0.75 = 0.39 ST 0.5 : Satellite A σ 0.5 = 2.22 s 0.5 = 0.39, Satellite B σ 0.5 = 2.21 s 0.5 = 0.46 ST 0.25 : Satellite A σ 0.25 = 1.99 s 0.25 = 0.43, Satellite B σ 0.25 = 1.96 s 0.25 = 0.49As often happens, a limited amount of lognormal data cannot provide stable statistical estimates, so what are the "true values" of s and σ?To answer this question let's use the following well-known method of the statistical moments of θ, namely:

Fig. 2 Figure 1 .Figure 2 .
Fig.2actually confirm the fact that this close-to-Laplacian distribution indeed can be represented as the result of lognormal mixture according to formula 2.

Figures Figure 1
Figures