Test Set Sizing Via Random Matrix Theory

This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines"ideal"as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.


Introduction
The problem of determining a training and test set is almost as old as machine learning, and its importance is clear to any practitioner.Furthermore, the problem of finding the optimal training-testing division has never been solved analytically or asymptotically for any model using an entirely correct metric of success and parameters available to the modeler.Typically the sizes of training and test sets are determined by gut and tradition.This paper takes a substantial step towards changing this state of affairs.It measures success by determining how well the test set error matches the actual error in measurement, i.e. the training-testing division maximizes integrity.This metric describes how useful a model would be in a real business setting, the extent to which it should be relied upon or justifiably not relied upon.It should be used for any model for which it is available -a major result of this paper is that integrity can be maximized without the knowledge of the true error in measurement in the context of a linear model (or the covariance of the data or the true model parameters).This paper finds the optimal training/testing split for a linear regression, assuming data points are i.i.d.from a Gaussian with some covariance and zero mean.This model can be analyzed using Random Matrix Theory, and studying it leads to the surprising result that the number of training points should be O(m 2/3 ) if m ≫ n, where m is the number of points in the data set and n is their dimension.The leading order term is (2n + n 2 ) 1/3 m 2/3 , although convergence is slow (this assumes n is small and fixed), a better approximation is found in Corollary 8.For m nearer to n the training set size can be found by solving a simple quartic equation described in Theorem 1, which in the case that m is near n can have surprising results.In the medium-data case where m is not near n but not enormous, this analysis confirms the conventional wisdom that using approximately half of the data for training is best.Assuming prior distributions are picked appropriately, the amount of data required for modeling should only decrease, so the O(m 2/3 ) leading order term can be interpreted as an upper bound.Since it is easy to solve a quartic equation numerically, it is possible that finding the ideal a priori training, validation, and test set sizes for a given model may soon be possible to do quickly and accurately.Future work will attempt remove the normality assumptions of random variables.
This problem has been studied in the literature in both discrete and continuous contexts.Some authors have found more general results, but none are as philosophically appealing as those that maximize the integrity metric.[12] finds that the training set grows as O(m 2/3 ), as do we, using the integrity metric on a simpler problem.[13] and [2] find that the training set should grow as O(m) for different metrics than this paper's, in the former using a linear regression and in the latter using a much more general setup.This paper's contribution is to find a training-testing split that results in a loss that is what it theoretically should be.[9] and [10] look at the question of determining the test set size in the case of discrete labels (this paper's are continuous) and try to minimize test-set error.The former also finds an O(m) answer for the test set size.Both versions require tuning parameters that specify accuracy needs a priori, this work improves on theirs by requiring no such parameters and finding a training-testing split that provides an error estimate that is again what it should be.[11] studies the training-test split question using the VC-Dimension, also with the intent of maximizing test-set error on a discrete problem, and also finds a very different answer than this paper's, also reliant on a tuning parameter which describes the complexity of the target function to be learned.
The three major ensembles, or probability distribution functions, in finite dimensional Random Matrix Theory, are the Hermite, Laguerre, and Jacobi Ensembles, each of which is the eigenvalue distribution of a certain simple random matrix.There are versions of them for random matrices over the reals, complex numbers, and quaternions, and they have even been generalized further; [8] provides a clear introduction.This paper focuses on the real Jacobi Ensemble, since certain known integrals over it (Selberg [16], and Aomoto [4]), and one new one proven in this paper provide the traces of random matrices that arise in the forthcoming mathematical development.They are used to prove the following theorem: Theorem 1 For a plain vanilla linear regression with m data points each assumed to be n-dimensional Gaussians with some covariance and mean zero, the number of data points p that should be in the training set satisfies: rounded to the nearest integer, giving p = O(m 2/3 ) for m ≫ n, with leading order term (2n + n 2 ) 1/3 m 2/3 .p is the best if the expected test set error is nearly the true error in measurement, i.e. the solution has the most integrity.Note that to find p, the true covariance matrix of the Gaussians, the true model parameters, and the true error in measurement do not need to be known.
A more precise statement is given in subsection 2.3.

Background from Random Matrix Theory
Let .
Proposition 2 Selberg's Integral states [16]: These integral identities are relevant to finding the eigenvalue distribution of several ensembles of random matrices.

Proposition 4
The following matrix model: has eigenvalue probability distribution function equal to w n,α,β,γ (x)/Sn(α, β, γ), also called the Jacobi Ensemble, where , and γ = 1/2 (Different values of γ are used for Gaussian random matrices over different algebras, the complex numbers and the quaternions.It is retained to preserve generality.2γ is sometimes called β in the literature, and is the dimension of the algebra).X t X/(X t X + Y t Y ) −1 is the matrix model for the Jacobi Ensemble.[8] provides a clear introduction to the Jacobi Ensemble and its matrix models in context.
Next, Aomoto's result is generalized to calculate more moments of the Jacobi Ensemble.Following the logic of Aomoto's proof as described in [3]: By substituting a = 1, we get Similary, by substituting a = 2, we get Let us look at the rightmost averages in the two previous equations.The latter changes sign if the indices 1 and j are swapped, so it is zero.To get the former, notice that: . Also note that by factorization and cancelling: So subtracting the a = 2 equation from the a = 1 equation, , or, equivalently, Now using Proposition 2 and 3, So, Furthermore, by Propositions 2 and 3, Lemma 6 x −1 .
Notice that Theorem 5 and Lemma 6 above can be used to compute expected values of negative moments over the eigenvalues of the Jacobi Ensemble, where the conversion between (α, β) and (m, n, p) is given in the description of the Jacobi distribution above in Proposition 3.

A few simple identities
Let us say that M is any a × b matrix and S is any a × a positive definite symmetric matrix, and e and f are vectors of i.i.d.standard Gaussians of lengths a and b: E[e i e j e k e l ]S i,j S k,l = trace (M t M )

Primary results
We restate our main theorem.
Theorem 1 Let X and ǫ be i.i.d.standard Gaussians in R m×n and R m , let b be in R n , let Σ be positive definite in R n×n with Cholesky Decomposition R t R, and let y = XRb + σǫ, so the rows of X have covariance Σ.Let b = arg min b p X 1:p,: Rb p − y 1:p (all norms in this paper are Euclidean).Then as nearly as possible for an integer, giving p = O(m 2/3 ) with leading order term (2n + n 2 ) 1/3 m 2/3 for m ≫ n.
Proof b = (R t X t 1:p,: X 1:p,: R) −1 R t X t 1:p,: y 1:p = (X t 1:p,: X 1:p,: R) −1 X t 1:p,: (X 1:p,: Rb + σǫ 1:p ) and y p+1:m = X p+1:m,: Rb + σǫ p+1:m So, plugging these expressions for b and y p+1:m into the expression in Theorem 1, the task is to find the arg minp of the expected value of: X p+1:m,: R(X t 1:p,: X 1:p,: R) −1 X t 1:p,: (X 1:p,: Rb + σǫ 1:p ) After cancellation within the norm (without expanding it), this task becomes finding where A = X p+1:m,: X t 1:p,: X 1:p,: −1 X t 1:p,: .Notice that b and R cancel out, σ factors out, and they all become irrelevant.Expanding, the next task is to find the arg minp of the expected value of Using Lemma 7, the problem reduces to finding the arg minp of the expected value of:
, and γ = 1/2.Let x 1 , . . .xn be the eigenvalues of C. By Proposition 4, next is to find the arg minp of the expected value of Using Theorem 5 and Lemma 6 above, which can be used to compute expected values of negative moments over the Jacobi Ensemble, it remains to minimize over p: Differentiating with respect to p and setting the formula equal to zero gives δm,n(p) = 0.
The order of p as a function of m is found by the method of Dominant Balance.First consider p = O(m q ) where q ∈ (0, 1], the p = O(m 0 ) case is handled later (also, solving the equation δm,n(p) = 0 analytically in Mathematica yields a complicated unsimplified expression with no logarthimic terms, this is possible because it is quartic in m).Assume q = 1.Then limm→∞ δm,n(p) m 4 = p 4 m 4 = 0, so the coefficient in the O-notation term for O(m) would be zero, a contradiction.So for now it is assumed that q ∈ (0, 1).Then the only terms that matter in δm,n(p) are p 4 and −m 2 n(2+n)p, taking limm→∞ for various values of q makes all the rest of the terms zero while preserving these as potentially nonzero.Setting the sum of these two terms equal to zero and ignoring two complex roots gives p = O(m 2/3 ) with leading order term (2n + n 2 ) 1/3 m 2/3 , which checks out since p 4 and m 2 n(2 + n)p then have the same order, 8/3.
Equation ( 1) is convex for p > n+3.This means that any root of O(m 0 ) of δm,n(p) would have to be a maximum of (1), since the other three roots are accounted for.Recall that the product of convex functions is convex, and note that (1) simplifies to The numerator is obviously convex in p since it is a quadratic and p 2 has a positive sign; it is necessary that one divided by the denominator is convex too.1/(m − p), 1/(−3 − n + p), and 1/(−1 − n + p) are all individually convex in p for p > n + 3 and p < m, so their product is convex.
Proof Proof by Mathematica.First, use it to solve the quartic to get p * , then subtract some of the terms in this sequence from it and find the limit as m → ∞ of the difference times an integer power of m 1/3 .

Remark 1
The leading order term, m 2/3 (n(2+n)) 1/3 in p * (m, n) for fixed n and large m converges slowly, it is not within 1% of the expression for p * (m, n) above without the Lower Order Terms until m is twenty-seven million for n equal to ten.However, the expansion in Corollary 8 converges quickly, as is seen in Fig. 3 and Fig. 4.

Computational evidence
The plots Fig. 1 and Fig. 2 show the optimal p * (m, n), where δ m,n (p * (m, n)) = 0, for a given m where n = 5 in Fig. 1 (then 10 in Fig. 2) and σ = 1 in Fig. 1 (then 0.1 in Fig. 2), by solving the quartic equation which is δ m,n (p) = 0. Doing so for different values of m creates a continuous curve, p as a function of m, in Fig. 1 and Fig. 2 The x's that lie nearly on the curve are computed by simulation: First, for a given m, n, and σ, pick b and Σ randomly.Then do the following a million times: pick X and ǫ as i.i.d standard Gaussians in   for every value of p. Find the value of p that minimizes it, and plot it as a |x| over its corresponding value of m (n is fixed for any given figure), which is near the curve described above.Note that the figures are approximately lines.These simulations verify Theorem 1.
The above plots Fig. 3 and Fig. 4 show the log-ratio of p computed by δ m,n (p) = 0 and p approximated by Corollary 8 with the same parameters as before.Convergence is fast.

Analysis of the method on real data
Given detrended real data X and labels y (in both the averages of each columns are removed), the optimal p and the loss for dividing the data into X 1:p,: and X p+1:m,: for training and testing are found.Also found are the loss for certain standard values of p, p =  and then computing the "loss": for ten thousand permutations of the rows of X and y, and then taking the average.If there are more than one variables to predict, y is set to be the first one the others are ignored.Also ignored are columns with date information and rows with NaNs, columns with missing values indicated by "?" are interpolated as the average value of the column.X and y come from the data sets [7], [6], [5], [1], [14], and [15], which are all available at https://archive.ics.uci.edu/ml/datasets.php.Table 1., computes the average loss on each data set over ten thousand permutations of the rows of X and y for p = 1 2 m, p = 3 4 m, and p = p * (m, n) as defined above, and also gives p * (m, n)/m.
First observe that the training set sizes are very small in the big data case in the first two column, very large in the small data case (m near n) in the last two columns, and in the realm that one would expect in the case in between, in the third and fourth columns.In the small data case the loss can be way off if p is chosen poorly.

Discussion
There may soon come a time when machine learning practitioners, faced with m data points of dimension n, will plug (m, n) into an algorithm that suggests a good size for the training, validation, and test sets for their models in a matter of seconds, instead of relying on gut and hearsay.A natural generalization to this paper's results would be for linear models with more complex features.It would be interesting if O(m 2/3 ) sized training sets were a general law for large datasets.

Lemma 7 E
[e t Se] = trace (S), E[(e t Se) 2 ] = trace (S) 2 + 2 • trace (S 2 ), and E[(e t M f )(f t M t e)] = trace (M t M ) Proof E[e t Se] = a i=1 a j=1 E[e i e j ]S i,j = a i=1 E[e 2 i ]S i,i = trace (S) E[(e t Se)

1 2 m
and p = 3 4 m.It is calculated by finding b where: b = arg min b p X 1:p,: b p − y 1:p Aǫ 1:p ǫ t 1:p A t Aǫ 1:p ) .

Table 1
This table shows the empirical losses on several data sets for given training-test divisions.The last two rows show the ratio used for training and the size of the data set.