A Natural Representation of Functions for Exact Learning

We present a collection of mathematical tools and emphasise a fundamental representation of analytic functions. Connecting these concepts leads to a framework for ‘exact learning’, where an unknown numeric distribution could in principle be assigned an exact mathematical description. This is a new perspective on machine learning with potential applications in all domains of the mathematical sciences and the generalised representations presented here have not yet been widely considered in the context of machine learning and data analysis. The moments of a multivariate function or distribution are extracted using a Mellin transform and the generalised form of the coeﬃcients is trained assuming a highly generalised Mellin-Barnes integral representation. The ﬁt functions use many fewer parameters contemporary machine learning methods and any implementation that connects these concepts successfully will likely carry across to non-exact problems and provide approximate solutions. We compare the equations for the exact learning method with those for a neural network which leads to a new perspective on understanding what a neural network may be learning and how to interpret the parameters of those networks.


Introduction
Machine learning (ML) is -simplistically speaking -a form of function fitting and ML methods often try to recreate a set of training observations by fitting a function to -or 'learning from' -example data points. For a given problem, the most appropriate ML method to use will depend on the application. In general, great success has been seen using this kind of methodology on a range of problems across all domains of science and industry. Most current ML methods recreate the input data by an approximation or interpolation of a restricted sample of training points. Any learning that happens by these methods is then also in some sense restricted or limited to the domain of the training set or conditions in which the data were collected.
The form chosen to fit the data -the model -is usually selected for convenience, convergence properties, speed, or simplicity and interpretability and the coefficients/weights/parameters learned are likely to be somewhat arbitrary, not necessarily represent anything fundamental, and become much harder to interpret as the method gets increasingly complex [1]. In order to reach these high precision approximations, many parameters are used, sometimes millions or billions of parameters, especially in the case of neural networks and deep learning methods [2].
The trained model from the above methods is unlikely to be 'physically meaningful', by which we mean -unlikely to lead to transparent and understandable insight that can be condensed down into a human readable form. The outputs of the model can still be very useful, for example to interpolate between known physical results [3], but the model itself is not necessarily fundamental. This makes it hard for current ML methods to assist in the collection of fundamental things, or 'natural facts', specifically mathematical facts.
True learning -in a mathematical sense -is timeless. If an exact solution to a problem exists -for example a solution to a particular differential equation -this solution is a permanent solution which can be defined, collected, and documented in a human readable form 1 . If the same equation is written years in the future the same solution will apply. Learning such a fact correctly would then be an example of 'exact learning'. This notion is more relevant to problems from mathematics and physics where there is an unchanging 'ground truth' defined by nature, for example laws of physics.
We will consider how in a broad sense 'machine-learning-like-techniques' (i.e. advanced function fitting), can help with these 'exact learning' type problems. Specifically for this work, the problem description is: A) 'Given a high precision numeric output of an unknown multivariate function or distribution write the 'human readable equation' for that function as output, where it is possible to do so." The concession we are willing to take on this quite general goal is: B) 'Given a high precision numeric output of an unknown multivariate distribution write the unique fingerprint of the distribution.' 1 Subject to the language required to document the function existing.
The key change in goal B) is the introduction of a 'fingerprint' of an arbitrary function as a well behaved and meaningful intermediate that can identify a distribution 2 . The word fingerprint is local to this text and should not be considered a general term. The goals A) and B) can be seen as reverse engineering processes for functions. It might be helpful to first consider the analogy of reverse engineering for a simpler object such as a number, something that is performed by tools such as Simon Plouffe's 'inverse symbolic calculator' [4] and other 'inverse equation solvers' [5]. In these tools a user can enter a high precision number -e.g.
14.687633495788080676 -and the algorithm would return a plausible closed form solution -e.g. π 2 + e √ π.
Once the user has the solution, the hope is that a human expert could use that extra information to come up with further insights which would assist in the scientific understanding of both the solution and problem. problems that might be interesting to solve using this method, please refer to the supplementary information.

Brief Disclaimer
This intends to be a foundational paper that sparks an idea and lays the concepts on the table but stops short of connecting them together. This is to avoid invoking (an explanation of) the technical requirements that are needed for an implementation. Many of the tools described here are treated in a highly rigorous mathematical way in the literature. While this is necessary for the correctness of the underlying mathematics, we feel it reduces the accessibility of the concepts, innovation, and creativity. We will introduce the necessary fundamentals with a high level overview and point to more technical references where possible.
There are a number of limitations and potential 'engineering problems' that may need to be overcome Figure 1: An example of the exact learning process, in this case with an example of the kernel of the massless Feynman bubble calculation given by Gonzalez et al. [6]. (Top left), A high precision numerical function is found as the solution to a problem that likely has an analytic solution, but the closed form is hard to derive mathematically. A numeric Mellin transform is made from the 'function domain' (yellow) into the 'fingerprint domain' (blue) for a number of chosen exponent vectors s = (s 1 , · · · , s 2 ). A highly generalised closed form for the fingerprint is chosen and then machine learning (function fitting) is used to track down the parameters of the function fingerprint (a, b, c, d). With further analysis and interpretation of the solved parameters the fingerprint could be used to derive the closed form for the function that generated the original numeric data using the generalised Ramanujan master theorem (GMRT).
when training networks that are constructed using these concepts introduced below. For a further discussion of limitations please see the supplementary information. Further work is undoubtedly required to overcome these issues and some of these are discussed with the conclusions. The key message of this work is: 'Let us turn our attention to a special hierarchy of functions and associated methods while keeping machine learning in mind. What algorithms can be developed from this?'.

Background and Method
We will go through the background theory and arrangement of concepts required to exactly learn functional forms from high-quality numeric data. As with other types of machine learning, fitting and optimisation problems, it is convenient to define a loss function that should be minimised to solve the problem [7]. If this loss function takes a standard form, there are many available minimisation methods, for example using the gradient of the loss function with respect to network parameters [8]. The training process for a method built using the loss function would then resemble backpropagation for a (deep) neural network [9,2] and completed algorithm is more likely to be implementable with existing software packages such as TensorFlow [10] and PyTorch [11]. Other learning strategies may prove more efficient or robust [12].
For the time being, this conceptual loss function will be the squared difference of the 'experimental fingerprint' of the numerical solution, after the transformation into fingerprint space, and the 'theoretical fingerprint' i.e the model we are trying to fit to the data. We then answer the questions: 'What is the fingerprint of a function?' and 'How do we measure this fingerprint both experimentally and theoretically?'.
The fingerprint we choose in this work is the so-called Mellin transform of a function or distribution [13].
To understand the rationale behind choice this we first explain a pattern in the generalised representation of analytic functions which we believe to be fundamental.

Nature's Language for Analytic Functions
Many different types of mathematical function are used across the mathematical sciences. Because the language of mathematics is an ad hoc set of notation which has evolved with time it can be hard to keep a sense of order on named functions. Many people will be familiar with the functions exp(x) and log(x), or trigonometric functions such as sin(x). In engineering and physics additional special functions are developed for convenience, e.g. Bessel functions which can relate to cylindrical harmonics, along with various special (orthogonal) polynomials and these functions and identities appear in large catalogues [14].
The subset of analytic functions have power series expansions which allow easier comparison between functions. Many of these series expansions can represented in terms of so-called hypergeometric series.
A a core-component of this description is the gamma function which is the continuous analogue to the factorial function n! [13,14]. The gamma function will be pivotal for our definition of a function's fingerprint. We also note that s should generally be considered a complex number.
Functions that can be described as a hypergeometric series can be represented as a limiting case of the so-called hypergeometric function which can be written in terms of ratios of gamma functions. We can write the hypergeometric function (with a negative argument) as where the notable component is the collection of gamma functions in the series. In this work we define the fingerprint of the hypergeometric function as the so-called Mellin transform [13] of the function, which is related to the series coefficients in equation 2. For a large set of analytic functions used in science and mathematics, the fingerprint is described as a product of gamma functions and their reciprocals.
This scratches the surface of the fundamental pattern we will exploit to allow machine learning methods to adapt to unknown functions. For more examples of the Mellin transform and for a list of example functions from science and mathematics that can be described by hypergeometric series and increasingly generalised series see the supplementary information.
Many analytic functions cannot be described by the simple hypergeometric series in equation 2 [15].
There are further extensions to the series definition which begin to describe these additional functions. This process of generalisation has iterated numerous times over the course of history [16,17,18,15], each time capturing successively more of the functions which are not described by the previous iterations. It is this neat and hierarchical way of organising this huge space of functions which we will use as a language for 'exact learning'. The latest and more complex iterations have provided functions which can describe complex phenomena from physics such as the free energy of a Gaussian model of phase transition [15]. We will briefly describe the language used to define these highly generalised functions.

A Language to Define Analytic Functions
The Mellin transform converts a function into its fingerprint and the inverse Mellin transform transforms the fingerprint into the function. The inverse Mellin transform is defined by a contour integral, that uses the residue theorem to recreate the series expansion of the function [13,19]. The more generalised functions considered in this work are defined in terms of this contour integral [15,20,21], which is often called a Mellin-Barnes integral [19]. For the hypergeometric function (Equation 2) a corresponding definition through a Mellin-Barnes integral is there are technical reasons for this choice of languge over a simpler series representation. For example a different choice of contour can lead to different representations of a series [15], other methods to compute Mellin transforms find multiple ways of getting to the same answer [6]. We will not dive into the technical details, but present equation 4 to observe that the contents of this Mellin-Barnes integral is simply the fingerprint, with a change of sign s → −s.

A Fundamental Duality
The duality between a function and its Mellin transform has proved useful in numerous advanced mathematical applications including quantum field theory [6], cosmology [22], and string theory [23]. Schwinger and Feynman parametrisation for the calculation of loop integrals from Feynman diagrams make heavy use of Mellin transforms [24] and the resulting 'Mellin space' has been viewed as a 'natural language' for attempts to align quantum gravity and quantum field theory in Ads/CFT methods [25]. It is possible to convert the regular Fourier duality 3 of quantum mechanics into a 'Mellin duality', which leads to wavefunctions associated with the zeta functions commonly seen in number theory [26]. Aspects of particle physics have also previously been noted in the analytically continued coefficient space of zeta type functions [27]. Therefore, it is not out of the question that the Mellin transform representation of the 'fingerprint' also be a natural choice for exact learning. This representation is the frame in which functions are defined by their zeroes and poles, in the same way that many properties of prime numbers are defined by the zeroes of the Riemann zeta function.
There is a simple relationship between the Fourier transform F, the (two-sided) Laplace transform L Dualities in these other two transform spaces have been known for a long time to statistics including the relationship between the characteristic function and the distribution in terms of a Fourier transform [28], and the moment generating function and a distribution in terms of the Laplace transform [29]. However, in 3 Position and momentum, and time and frequency form conjugate pairs of variables.
terms of fingerprint equivalents, Laplace and Fourier transforms do not display the regularity of the Mellin transforms in terms of patterns of gamma functions alluded to in the above sections. This can be seen by reviewing extensive tables of each type of transform [30]. Many of the identities for Fourier transforms relate to trigonometric functions and Bessel functions and many of the Laplace transform identities relate to algebraic expresions and exponential terms. Both Fourier and Laplace transforms make extensive use of special functions such as error functions, exponential integrals and more complicated, niche special functions.
On the other hand, the Mellin transform identities mostly revolve around gamma functions and polygamma functions, hypergeometric functions and terms such as beta functions and other shorthands which can actually be expressed again as gamma functions e.g. π csc(πs) = Γ(s)Γ(1 − s). Loosly speaking, these 'elements' that form the Mellin transforms all have something to do with gamma functions.
The Mellin transform connects a distribution to its moments, and learning using moments has been considered before, including the generalised method of moments [31]. These methods do not consider an underlying structure to the moment space, but only that estimators of statistical quantities phrased in terms of moments can be advantageous. In general, integral transforms link strongly to inner products in vector spaces and further work would be required to establish a deep connection between 'exact learning' as presented here to well studied mathematical formalisms such as reproducing kernel Hilbert spaces [32], specifically those with a monomial kernel x s as seen in the Mellin transform. Further much deeper mathematical connections are apparent, including the generalisation of the Mellin transform in terms of a Gelfand transform, or the analogies between hypergeometric functions and 'basic hypergeometric', 'q-hypergeometric' and elliptic variants [33].
To summarise, the Mellin transform can mediate a duality between a distribution and its moments, or an analytic function and its coefficients in the same way that a Fourier transform can mediate a duality between time and frequency, or position and momentum. In terms of function fitting, an analytic function represented as a power series has infinite coefficients to fit, however, by applying the Mellin transform the coefficients are not trained individually but the whole set of coefficients are trained simultaneously.

Generalisation of Generalised Functions
For the generalised functions beyond the hypergeometric function, the product of gamma functions that forms the fingerprint begins to have numerous terms and it is helpful to define a concise notation for a 'product gamma' operation, Ξ[·], which flattens a vector or matrix and takes a product of the gamma function over the elements. The use of this custom operation will make the pattern between the functions much easier to see. Table 1 shows how the operation works for vector and matrix arguments and includes an optional vector v (or matrix V) of exponents of equal size to realise 'products of powers of gamma functions'. This can be easily extended for any array or tensor of higher dimensions.
The hierarchy of the highly generalised series that go beyond the simple hypergeometric series is presented in table 2 with the function name and notation along with the associated Mellin transform (i.e. the fingerprint). In order to make the similarities clear, as well as using the compact Ξ notation, we have slightly altered the notation for input parameters on some of the functions. In all cases beyond the hypergeometric function the inputs are vector quantities and the scalar indices p, q, m, n denote the lengths of input vectors according to the function definition. The progression in terms of complexity -and therefore flexibility to represent more functions -advances down the table. We start with the hypergeometric function [14,13] and its generalisation [14], the Fox-Wright series and its normalised counterpart [16], the highly flexible Meijer-G function [30] and its analogous extension the Fox-H function [17]. We include two more recent, extremely general extensions, the Inayat-Hussain-H [18] and the Rathie-I function [15]. We note that Rathie has also extended to even more generalised functions [20] which are briefly discussed in section 'Analogy to Deep Networks'. Table 2 is not necessarily exhaustive, for example functions such as the MacRobert-E function [34] are not represented. These functions bridge the gap between hypergeometric and Meijer-G type func- Ref. [14,30] Generalised Hypergeometric [18] [15] general functions come from adding vector power parameters i, j, k, l, which begin to describe non-trivial physical and number theoretic functions [15].
All of the functions in table 2 assume an analytic series expansion. For certain sets of parameters, the series expansions for many commonly used functions in physics and mathematics can arise. A small list is given in the supplementary information. Many of these functions have constraints on the combinations of arguments that can be used. For the sake of focus and scope in this text we will not mention any regions of convergence or the necessary analytic continuations in the following sections, but these should be considered carefully when such equations are implemented numerically and solved fingerprints are interpreted. A careful treatment is given by the following references [35,23,36] as well as the reference for each function in table 2.
Work has already been done to harness the flexibility of this relationship to cover a wide space of functional forms. Geenens derives a particular kernel density estimator [35] using the Meijer-G function and its Mellin transform, which can capture as limiting cases a number of common statistical distributions over a one dimensional domain, including beta prime, Burr, chi, chi-squared, Dagum, Erlang, Fisher-Snedecor, Frechet, gamma, generalised Pareto, inverse gamma, Levy, log-logistic, Maxwell, Nakagami, Rayleigh, Singh-Maddala, Stacy and Weibull distributions [35]. This is already testament to the potential power behind this representation. Geenens also gives an excellent summary of the mathematical properties of the Mellin transform and its applications to probability functions and some of the more technical details and important considerations surrounding the Meijer-G function itself [35].
The take away point is that all of the functions in this natural hierarchy can be defined as a Mellin-Barnes integral in the form for a suitable definition ofφ [15]. L is a special contour path which is covered in detail in references defining each function. The fundamental pattern is that for all of these generalised functionsφ is expressed as a

Ramanujan Master Theorem (RMT)
We have identified the Mellin transform is the method of extracting the fingerprint of a function. The goal of this work was to identify an unknown function from numeric data. Now we can assume a generalised theoretical fingerprint of a form from table 2 to define the desired level of complexity. Then we train the parameters in the theoretical fingerprint to reveal the function's definition. In order to do this we will elaborate on the following additional steps: 1) Numerically perform the Mellin transform to convert numeric function data to numeric 'fingerprint' data.
2) Fit the function fingerprint (machine learning techniques).
3) Reconstruct the function from its fingerprint.

Point 1) is relatively easily obtained for distributions by interpreting the definition of the Mellin transform
as the 'moments' of a probability distribution. We can approximate for a set of well sampled data points and where this is not applicable for reasons of bias and sampling, there is existing theory of unbiased estimators that can be invoked to extract moments [31]. The key point here is that we must sample the moments at a number of values of exponent s. For advanced treatments these may need to be real, fractional or even complex numbers, because the fingerprints are often defined for inputs from the complex plane. For functions which are not distributions, we require an equivalent process of extracting the coefficients numerically. For this numerous numeric Mellin transform algorithms exist.
For point 2) there already exists a number of strategies for training models and networks. There are gradient methods such as those used in deep learning [2,37,8], or stochastic sampling methods over parameters, and potentially advanced constructive techniques [12]. One would minimise the square difference between the numeric fingerprint extracted using point 1) and the theoretical fingerprint selected from table 2. Due to the rapidly growing nature of gamma functions it is useful to instead minimise the square log difference of the numeric and theoretical fingerprints. The resulting sum of log-gamma functions has an interesting interpretation and mathematical properties and this is discussed in section "Comparison to a Neural Network".
where χ k can be considered an 'alternating exponential symbol' 4 the RMT simply states that the Mellin transform of the function is given by where Γ(s) is the gamma function. This relationship sometimes relies on the analytic continuation of the coefficient function φ(s) to accept negative (and potentially complex) quantities. Functions whose φ(k) where often the Γ(−k) term will directly cancel if the Mellin transform ϕ(s) contains a simple factor of Γ(s), such as in the the hypergeometric style functions in the top half of table 2.

Conclusions from Functions of a Single Variable
In conclusion we can take a numerical representation of a function or probability distribution and apply one of a suite of algorithms to extract high-quality estimates of moments -potentially even at fractional and complex values. By assuming a very generalised fingerprint of the function, which corresponds to an extremely generalised function we can fit the fingerprint to the estimates of the moments. Because of the well behaved and ordered nature of the fingerprints as gamma functions and their products, we can easily interpret the meaning of the fitted coefficients. If the fitted coefficients are simple (e.g. integers or rational numbers) to a high degree of confidence, they could be replaced with a mathematical form. The inverse Mellin transform can either be taken using the Ramanujan master theorem in reverse (to avoid contour integration), or by using the residue theorem, both methods allow the reconstruction of the function from its fingerprint. Work will be required on developing techniques for recognising exact constants appearing in the parameters of the fingerprint.
In practice, real data cover multiple dimensions and the functions which are hardest to solve equations for are multivariate functions. It was instructive to introduce the concepts in terms of functions of a single variable. We now extend all of the above in a similar fashion. We require the generalised Ramanujan master theorem and multivariate generalised functions for example a multivariate hypergeometric series which are defined in terms of multiple Mellin-Barnes integrals.

Multiple Dimensions
The mathematical technicalities are many more for the multivariate case. As before, for the sake of focus and to introduce the core concepts only we will temporarily sweep these under the rug, but encourage the reader to seek out these details in the references provided when implementing these methods. However, the concept of 'uniqueness' must be addressed for the multivariate case. The Mellin transform alone does not uniquely define a multivariate distribution [39], and the fingerprint should be considered alongside a 'region of convergence', the pair being a better description [35].

Multivariate Moment Functional
Firstly, for convenience of notation we define a multivariate moment functional which can be seen as a vectorised power operation that acts on two length D vectors a and b as which returns a scalar value. The subscript Π symbol is a reminder to reduce the result using a product.
This operation will conveniently vectorise the ensuing equations. It is clear to see by the basic rules of exponentiation that a b Π a c Π = a b+c Π . In this notation the multivariate moment of a vector of variables x and a vector of exponents s is simply x s Π .

Multivariate Mellin Transform
In analogy to the multivariate extensions to Fourier and Laplace transforms one can define a multivariate Mellin transform [40]. For x ∈ R D we define the multivariate Mellin transform as the integral transform where dx = dx 1 · · · dx D , the key observation being that f and ϕ are now functions of vectors. The multivariate Mellin transform is a tool to extract the multivariate moments from a multivariate probability distribution, but will also generate the multivariate fingerprint, (or still fingerprint) of a function in the same way as before. The inverse transform takes a multiple Mellin-Barnes representation [23] but an exact form will not be needed due to the generalised Ramanujan master theorem (GRMT). If a multivariate residue theorem is required, keywords include Grothendiek residue and residue current [23,41].

Generalised Ramanujan Master Theorem
The GRMT takes the analogous problem of solving the multivariate Mellin transform of a multivariate function which allows an analytic series expansion. The GRMT and associated 'method of brackets' [42] is pioneered in a series of work from Gonzales et al. [6] who cover the definition and historical developments, application of the GRMT to special functions from Gradshteyn and Ryzhik [14,43,44] and solving laborious and complicated integrals from quantum field theory [42]. If a function f (x) admits a series expansion (compacted using a vector multi-index k), with multivariate coefficient function φ(k), a D × D matrix of exponent weights W and a length D vector of exponents b, with an analogous multivariate alternating exponential symbol then the multivariate Mellin transform of f (x) is given by the expression where k * is the solution to a linear equation Wk * + s = 0 [6]. For an example derivation of this result see the supplementary information section "Walkthrough of GRMT on Generalised Probability Distribution".
Although there are additional complications from the introduction of a determinant and linear equation, the use of this relationship is essentially the same as the single variable application. Now we are equipped with our multivariate tools, we review the definitions of the theoretical fingerprints in the case of many variables.

Multidimensional Hypergeometric Series
Here we consider a hypergeometric series analogue which extends into multiple dimensions. Many such functions have been investigated throughout history. Two dimensional series include the Horn hypergeometric series, and their multivariate analogues [45], four Appell hypergeometric series [46], and the Kampé de Fériet function which extends the generalised hypergeometric function to two variables [47]. For three or more dimensions there are also Lauricella functions [46] and numerous further generalisations. There are many possible definitions of the hypergeometric series in general for more than one variable. From table 2 we can draw insight into how a general multivariate series would look in order to remain compatible with the goals of exact learning while retaining a suitable flexible and general form.
For the purposes of this work we will generalise in the following way: Consider the most general function in table 2, the Rathie-I function, which (in one variable) has fingerprint If we pick out a single Ξ term and expand it in terms of gamma functions we have for the conversion to a multivariate case the exponent s becomes a vector s = [1, s 1 , · · · , s M ], where we have deliberately included the value 1 in the first element. Extend the argument of each gamma function to a full linear combination of exponents with a constant, such that g l → g l = [c l , g l1 , · · · , g lM ] now we can write where G now represents some N × M matrix of scale and shift coefficients packaged together. The end result will be a general theoretical fingerprint which can be fitted to the numeric data of the form seen in figure 1.
Note that it is important to include multivariate parameters for the analogy to the scalar scale parameter η s included in table 2. We note that if an additional element of 1 is joined to the multivariate vector s as above hence the constant shift parameters absorbed into the scale matrices for all Ξ terms we have a compact form for the multivariate extension to this Rathie-I type fingerprint When negative powers are also considered and ⊕ represents vector concatenation, the vectors a = k ⊕ −j and b = i ⊕ −l allow a further compactification ofthis fingerprint to for new parameter matrices and vectors. Further ideas and examples are given in the supplementary materials on the generalisation to multivariate series. Further work may find better concise and meaningful representations for these functions. The take away point here is that the moments of the multivariate generalised functions have a simple underlying generating procedure which is a flexible product of gamma functions and their reciprocals whose arguments contain generalised linear equations, and each gamma function can optionally be raised to a power.

Comparison to a Neural Network
We now briefly discuss how this framework could be interpreted as a form of neural network, or generalised where M and v are some new arbitrary matrix and vector parameters. We can compare this to the equations for a neural network with input vector x, a matrix of weights A and vector bias b, one activation layer σ and a weighted sum pooling operation providing a single output we see a mapping between c → ∆, σ(·) → log Γ(·), A → M, b → v, and x → s. The final weight layer w, represents the powers of the gamma functions (and reciprocal gamma functions for negative weights).
We then conclude that one possible interpretation of the general form of a neural network is an attempt to approximate the output variable as the log-moments of a multivariate distribution. There are other interpretations as well, such as a weighted sum of basis functions, and this is a very general mathematical form so a thorough study would be required to test this hypothesis.
The above comparison prompts us to consider existence of a special 'activation function', namely the log- The key requirement is that s is a complex vector and y(s) is a sum of log gamma functions.

A Top Down Solution
Using the above, we can then plausibly start from the top down and construct a general (vectorised) form for the log theoretical fingerprint which is just for new vectorised parameters α, β, γ, δ, where N + and N − control the number of terms of each form. If the weights are restricted to w k ∈ {−1, 1}, then the solution does not invoke the parameters analogous to i, j, k, l in table 2, if the corresponding scale parameters α k and γ k are set to zero then the solution is restricted to generalised hypergeometric series. By minimising the squared difference in the log theoretical fingerprint, and the log numeric fingerprint the parameters within the log-gamma functions are solved for.
If the settings are right it will be possible to write a concise (multivariate) series expansion for the fitted function by interpreting it in terms of the GRMT. The benefit of this top down form is that weights can approach zero if there are too many log-gamma terms, although this may require additional regularisation terms in the loss function and further development for best practices and strategies for training.

Gradients
The log-gamma function is (functionally) well suited to extracting gradients and higher order derivatives and it has a compact expression for high order derivatives with the n th derivative given by where ψ (k) (x) is the polygamma function. This means that closed forms could plausibly be written for high order differential routines. Once again this may be subject to optimisation over complex numbers and polygamma functions extend to the complex plane, especially under complex input vectors s. Currently, vectorised implementations of general order polygamma functions that handle complex inputs are hard to find.

Analogy to Deep Networks
We very briefly consider an extension of the analogy to neural networks with more than one layer, socalled 'deep neural networks' while also highlighting a potential pitfall for the method presented so far.
Unfortunately, some functions have complicated fingerprints which do not consist of only gamma functions and their reciprocals. Take for example the surprisingly simple function f (x 1 ) = e −x 2 1 /3−x1 , whose Mellin transform is given by where U (a, b, x) is the Tricomi hypergeometric function. The nice property of the Mellin transform is that this hypergeometric function is still related to gamma functions. If we include a 'virtual' parameter x 2 , such that when x 2 = 1 the previous expression is preserved, we could perform a second Mellin transform such as where we now see the right-hand side is comprised of gamma functions, and powers of constants in a way that is representable as a two-dimensional Mellin transform Using the above iterated process we can construct an analogy to a deeper neural network, where each (hidden) layer requires another Mellin transform with respect to a virtual parameter. If this could be generalised and implemented seamlessly, the algorithm would be able to detect a much wider range of functions and distributions. However, for data driven problems the analogy cannot be realised, because each Mellin transform was equivalent to a sampling operation from some distribution, and the new intermediate virtual distribution is not well defined from the input numeric data.
In this vein, Rathie et al. have devloped a Y-function, where the gamma functions are replaced by Tricomi hypergeometric functions [20]. The significance of this is that these hypergeometric functions themselves can be expressed through an inverse Mellin transform. These advanced functions may be closer in analogy to an iterated, or second layer, requiring a Mellin Transform of the fingerprint, to give the fingerprint of the fingerprint. Rathie goes on to define functions which replace the gamma functions with other generalised functions including incomplete gamma functions and Fox-H functions. We leave the development of this idea for future work, but if these further representations could be harnessed, many more functions could be recognised. It may be possible to fit not only log-gamma functions, but also log-hypergeometric functions to the moments. The downside of this is that high order hypergeometric functions are harder to fit due to varying domains of convergence and analytic continuation and they can be numerically expensive to evaluate.

Conclusions
We have presented a framework for representing and working with unknown functions with an aim to applying machine learning techniques and tools to mathematically classify unknown distributions. We call this process exact learning, because if the resulting parameters of the fitting process are understood in the right context it may be possible to reconstruct a formula for the unknown distribution. We developed a few ideas within this framework including the notion of a 'fingerprint' of a function in a transformed space which is easier to traverse when attempting to identify the function. We used the Mellin transform and its inverse as a tool to switch between the representation spaces, and we noted an extensive hierarchy of functions exists in the literature whose fingerprints are comprised of gamma functions, with shift, scale and power parameters along with constants and scaled constant terms. We believe these functions to be a promising starting point to build exact learning algorithms, interpret the outcomes in a human readable way and categorise and document unknown numeric solutions. We have shown that multivariate analogues exist for these functions and methods, and indicated that the exact learning method may scale to high dimensional numeric data and requires relatively few parameters to learn the fingerprint.
We found that a multivariate extension of the most generalised function can take a relatively compact form, and the log of the theoretical fingerprint begins to resemble an unsupervised network. A natural analogy exists between a single layer neural network with a 'log-gamma' activation function or basis function.
There are also possible extensions to deeper networks with more layers and this new perspective may be a step towards explaining how neural networks work and interpreting the coefficients and weights in the same way that the hierarchy of functions is developed. The log-gamma activation function has particularly nice derivatives, but a complex branch structure will undoubtedly require new techniques for efficient training and optimisation.
Further developments could include the addition of polygamma terms as well as gamma terms to the fingerprint which would relate to series expansions containing powers of logarithms as well as monomial terms.
One could also imagine a larger network which is comprised of multiple exact learning 'units' containing sums of terms, products of terms and compositions of functions. Sums of functions would be trivial to implement, but further mathematical developments would be required train such networks for products including Mellin convolution and development of the equivalent of a Ramanujan master theorem for arbitrary trees of binary operations on functions.
The supplementary materials contain extended discussion on future work and the limitations of the methods we presented here. We also implemented some basic versions of the exact learning algorithm along with a basic method for fitting the fingerprints using complex numbers. The key limitations of the method presented here is that functions are currently restricted to the domain [0, ∞) n , and must be representable by the hierarchy of functions described by a Mellin-Barnes integral, which as we showed does not cover even some simple examples of functions. The method used for sampling moments for these prototypes is crude, and the results are currently demonstrable for bound 'distribution' like functions for which the whole domain has been sampled. With numeric Mellin transforms, and advanced sampling and training methods we hope to see 'exact learning' evolve into a more sophisticated method which makes useful discoveries in many fields of science. We believe the representation presented here is a good starting point for further developments.