Machine learning the derivative discontinuity of density-functional theory

Machine learning is a powerful tool to design accurate, highly non-local, exchange-correlation functionals for density functional theory. So far, most of those machine learned functionals are trained for systems with an integer number of particles. As such, they are unable to reproduce some crucial and fundamental aspects, such as the explicit dependency of the functionals on the particle number or the infamous derivative discontinuity at integer particle numbers. Here we propose a solution to these problems by training a neural network as the universal functional of density-functional theory that (i) depends explicitly on the number of particles with a piece-wise linearity between the integer numbers and (ii) reproduces the derivative discontinuity of the exchange-correlation energy. This is achieved by using an ensemble formalism, a training set containing fractional densities, and an explicitly discontinuous formulation.


INTRODUCTION
In their now famous paper, Hohenberg and Kohn proved that the electron density ρ(r) suffices to compute all observables of a system of interacting electrons [1]. Due to a remarkable balance of computational cost and numerical precision, first principles modeling of electronic systems based on this density functional theory (DFT) is nowadays a daily practice, with great impact in material science, quantum chemistry or condensed matter [2]. The success of DFT is to a large extent based on the Kohn-Sham formulation, that utilizes a system of noninteracting electrons that has the same density as the interacting one [3]. The main ingredient of this formulation is E xc [ρ], the universal exchange-correlation (xc) functional, whose functional derivative provides an effective external potential for the non-interacting particles. Yet, while the Hohenberg-Kohn theorem proves the uniqueness of such a functional, it does not give any indication regarding its specific form. To circumvent this issue, a very large number of approximate functionals were developed in the last decades [4,5], often combining empirical knowledge, exact mathematical conditions, and a great deal of ingenuity.
Inspired by the success of machine learning (ML) in various technological applications, including image and speech recognition [6], the last couple of years have seen the development of several neural-network-based approximations to E xc [ρ]. Indeed, it is now firmly established that machine-learning offers a new generation of accurate, highly non-local, xc functionals [7]. While those functionals are designed to perform tasks of different degree of complexity, all share the aim of learning one of the maps of DFT, namely, the Hohenberg-Kohn map be-tween the external potential v(r) and the density ρ(r) [8][9][10][11][12], or the Kohn-Sham map between the density ρ(r) and the xc functional E xc [ρ(r)] and its functional derivative v xc [ρ(r)] = δE xc [ρ(r)]/δρ(r) [13][14][15].
The functionals delivered by machine-learning DFT (ML-DFT) are in general non-local, in the sense that they use multiple density points as input, and can be efficiently trained with data from reference methods. Yet, since ML-DFT functionals are mostly trained in Hilbert spaces with an integer number of particles, they are still unable to reproduce some critical and fundamental aspects of DFT. For instance, it is known that any satisfactory definition of the energy functional must depend explicitly on the particle number [16][17][18]. Furthermore, the derivative of the xc functional in terms of the number of particles exhibits a discontinuity that plays a crucial role in the description of electronic bandgaps [19][20][21][22], charge-transfer excitations [23,24], molecular dissociation [25][26][27][28], or even Mott insulators [29], to name but a few examples.
Systems with non-integer (fractional) number of electrons (N + ) are defined as statistical mixtures of systems with integer number of particles [20,30]. As such, the density ρ N + (r) and total energy E(N + ) are piecewise linear functions of , namely: with 0 ≤ ≤ 1. At integer N (i.e., when = 0), the derivatives of the density and the energy exhibit a discontinuity, and the xc potential v xc (r) jumps by a finite value [31]. The difference in the slope on the left/right side of the total energy at integer values is equal to the arXiv:2106.16075v1 [physics.chem-ph] 30 Jun 2021 fundamental gap [20]: where I is the ionization energy and A the electron affinity. Yet, in practice, standard approximations to the xc functionals that depend explicitly on the electronic density, such as the local-density (LDA) and generalizedgradient (GGA) approximations, are continuously differentiable functions of N and lack therefore a derivative discontinuity. Meta-GGAs can exhibit a discontinuity due to their dependence on the kinetic-energy density, but it is usually too small or even negative [32]. Due to their dependence on the Kohn-Sham orbitals, orbital functionals are discontinuous [33], but this comes at the price of a much higher computational effort. In addition to the discontinuity, a universally useful approximation for the xc functional must be "N -electron self-interaction-free" for all positive integer N [34], meaning that the total energy of a system with N + electrons in the range (N, N + 1) should exhibit a linear variation with respect to . For attractive interactions the energy is a convex function with straight lines joining subsets of ground-state energies [35]. Yet, approximate functionals deviate from such a correct behavior. It has been shown that semi-local density functionals are in general convex with perhaps small concave pieces [36]. Even the Hartree Fock theory leads to piecewise concave curves between integers [36]. We note that the relatively welldefined curvature of the curves is ultimately the reason for the success of the Slater half-occupation scheme [37] or the LDA-1/2 method [38]. In fact, these schemes use the derivative at the midpoint (i.e., at N − 0.5), that can be shown to be equal to the slope of the straight line between E(N − 1) and E(N ) if the curvature is constant, irrespective of its sign [39]. The centrality of these pressing issues in DFT can be further highlighted by the fact that a rigorous description of the delocalization error can be related with the energy curve of the xc functionals lying below the straight energy lines [36,40].
In this work, we propose a way to train a neural network as the ensemble universal functional of a system of fractional electron numbers that describes correctly the derivative discontinuity and the piecewise linear behavior. The ML functionals we present contain explicitly the physics of the derivative discontinuity of DFT, are highly non-local, and are trained for systems with fractional densities. For this reason, our functionals can potentially address the well known delocalization and static correlation errors of DFT [41][42][43] simultaneously.

RESULTS AND DISCUSSION
Inspired by the neural network topology proposed in Ref. [13], our neural network takes an electronic den-sity as an input and returns the corresponding xc energy, whose functional derivative can in turn be used to solve the Kohn-Sham equations. The network is a sliding window convolution (SWC) network. For a 1D system of discrete spatial points {r 1 , ..., r W }, a window with a certain kernel size κ scans each data point ρ σ (r j ) and its κ − 1 nearest neighbors η σ (r j , κ) = {ρ σ (r j−(κ−1)/2 ), ..., ρ σ (r j+(κ−1)/2 )} with σ =↑, ↓ to calculate a local energy θ loc η ↑ (r j , κ), η ↓ (r j , κ) . The total xc energy is calculated by summing over the local energies: Here, θ denotes the trainable parameters. The input channels can be the total electronic density ρ = ρ ↑ + ρ ↓ or the spin densities ρ ↑ and ρ ↓ . The corresponding xc potentials can be computed using automatic differentiation, as shown in Ref. [13]. The parameters of the neural network are updated according to the loss function: where α and β are fixed weights that can be adjusted to expedite convergence, and MSE is the mean squared error. A more detailed description of our networks can be found in the Methods section. We improve the performance of this architecture by (i) training our neural network with non-integer densities, (ii) introducing the jump of the xc potential at integer numbers into the loss function, and (iii) adding an explicit discontinuity at integer electron numbers. In the following we discuss the details of these new approaches.

Fractional particle numbers
In general, neural networks do not extrapolate well outside the distribution of the samples used for their training . Consequently, we can not expect that machines trained solely for integer densities (as usually done) will exhibit the correct linear behaviour of the energy. To illustrate this behavior we plot, in Fig. 1, the total energy calculated with a neural network trained solely with integer densities (model A in the figure) as a function of the number of particles for a 1-dimensional system. Besides the fact that model A is far from linear, we can also note that the sign of the curvature is not constant, with both concave and convex parts. This is somehow to be expected, as the network, in contrast to the usual xc functionals, only incorporates physical knowledge through the training examples with integer densities. As such, we can easily see that approaches such as LDA-1/2 are bound to fail in this case. As a first strategy to solve this problem, we decided to include samples calculated within ensemble-DFT at fractional densities in our training. To obtain this data we created a set of total energies and electronic densities for a series of 1-dimensional exact calculations. We then constructed ensemble densities and inverted the ensemble Kohn-Sham equations, in order to compute the exact xc energy and potential for these systems. We used an inversion algorithm based on Ref. [44] that we extended for both spin-DFT [45] and to ensemble systems. As a result, we created exact training and testing data with particle numbers between 1 and 3 electrons, that we used to train our models.
In Fig. 2 we compare the mean absolute error (MAE) of the total energy for a functional trained with fractional densities (blue) and a functional trained only for systems with integer densities (red) (averaged over a test set of 100 external potentials). As we can see, the model trained at integer densities yields an excellent prediction for the total energy at those integers, but exhibits a considerably larger error at fractional numbers. Remarkably, by simply adding fractional densities to the training set, the error decreases by more than one order of magnitude, and the MAE over the entire [1,3]-range remains below 2 × 10 −3 a.u. We note in passing that we trained the networks both for spin-restricted and spin-unrestricted DFT with similar results. As such, we decided to use the spin-restricted formalism in the following.
While this strategy resolved, to a large extent, the many-electron self-interaction error, a problem still remains in the vicinity of the integer particle numbers. In fact, our network is fully differentiable, and does not (in fact can not) exhibit a true derivative discontinuity (in a mathematical sense) as a function of N .

Jump in the xc potential
We can easily relate the discontinuity of the total energy at integer particle numbers with an uniform shift in the potential. Indeed, the exact uniform shift ∆ N xc of v xc at integer particle number N obeys the relation [46]: where ε N s is the Kohn-Sham gap, i.e. the difference between the lowest unoccupied (LUMO) and the highest occupied (HOMO) molecular orbital energies. Noticeably, E(N ) and E(N ± 1), as well as the eigenvalues corresponding to the LUMO and HOMO can be computed while creating the training sets.
Our second strategy consists of computing, in our learning process, both ρ N + , and the exact shift v xc (N + ) − v xc (N − ). The corresponding mean squared error is then used to extend the loss function in Eq. (4) where λ is an additional hyperparameter. We expect that this extended loss function can help the functional to learn the correct shift in the derivative. Unfortunately training our basic SWC network with this loss function failed for any learning rate tested. This can be understood from the fact that our network is still fully differentiable, and can not be forced to learn a discontinuous function. It is clear that to resolve this issue we have to allow explicitly for a discontinuous behaviour in the neural network topology.
Incorporating the discontinuity An intuitive way to introduce a discontinuity in the derivatives of the neural network is to use non-

SWC unit
FIG. 3. Integration of the non-differentiable auxiliary function u [ρ ↑ , ρ ↓ ] as a third channel to the basic SWC unit. The input data contains two channels for both spin-densities ρ ↑ and ρ ↓ respectively. differentiable activation functions (e.g., the rectified linear unit [47]). But there is no obvious reason why a non-differentiability at integer particle numbers will appear -and these networks will most likely become nondifferentiable with respect to the density ρ. To overcome this problem we take an alternative route: We define an auxiliary function (AF), which we force to be nondifferentiable at all integer particle numbers: whereñ = (ρ ↑ +ρ ↓ ) dr is the (fractional) number of particles and a and b are arbitrary positive constants. Notice that the non-differentiability of the AF comes from the non-differentiability of the function | sin(πñ)|. The local functions u θ loc are obtained by using another SWC neural network. We then replace the functional where the additional channel has been appended to the spin channels. Here each point carries the (same) information   FIG. 4. Comparison of the derivatives of the energy with respect to the particle number N for three different models evaluated at a randomly chosen external potential: the first model uses only fractional densities, the second incorporates the exact xc shift in the loss function, and the third one uses a non-differentiable AF, in addition to the exact xc shift and the fractional densities. of the value of the non-differentiable AF as illustrated in Fig. 3. As we show below, this procedure ensures that the machine can learn the non-differentiable function in an efficient manner.
To study the performance of our approach, Fig. 4 presents the predicted derivative of the total energy for three different training models: (1) a model trained only with fractional densities, (2) a model using the AF in the network architecture and trained with fractional densities, and (3) a model using the AF, the shift in the loss function and fractional densities. First, we should note that the inclusion of the AF solved the training problems discussed in the previous section. As expected, the first model does not predict the correct derivative discontinuity, despite yielding very good errors at fractional particle numbers. While the second model already demonstrates major improvements, especially at N = 2, our most important finding is that the third model does exhibit a remarkable agreement with the exact results.
In Fig. 1 the energy as a function of the particle number is displayed for two models: (A) a model trained with integer densities only and (B) a model using fractional densities, the shift in the loss function and the AF. The situation is clearly much better for model B, as the energy is very close to linear between integers and shows a cusp at N = 2. As shown in Fig. 5, this model is also able to reproduce the correct jump in the xc potential: Here we display the xc potentials and densities for a randomly selected external potential at 2 and 2.01 electrons. On the left panel of Fig. 5 we can see that the xc potential of the basic model A is correct at 2 electrons but barely changes when going from 2 to 2.01. Model B, on the other hand, shows the correct uniform shift in its xc potential.

CONCLUSIONS
We trained a neural network as an exchange correlation functional that (i) depends explicitly on the number of particles; (ii) yields total energies that are piecewise linear between the integers and (iii) reproduces the infamous derivative discontinuity of the exchange-correlation energy with remarkable accuracy. To do so we extended the sliding window convolution algorithm to systems with fractional number of particles, and developed a non-differentiable auxiliary function that allows the network to learn correctly the derivative discontinuity. The most efficient way to train a model yielding highly accurate predictions for the energy in the entire range of particle numbers is by (i) adding multiple fractional densities into the training data, (ii) training for the correct shift in the xc potential at integer particle numbers, and (iii) incorporating an auxiliary function with a discontinuous derivative with respect to the particle number at integers.
Our work pushes forward the on-going research on machine-learning functionals by incorporating the correct physics into the training process, thereby improving the path towards an exact functional [48]. As an outlook, we think it will be important to incorporate and test our results for realistic 3D systems. We also expect that our results can stimulate research on similar problems for ensemble DFT (understood as mixtures of ground and excites states), orbital free DFT or functional theories of reduced density matrices [49][50][51][52][53].

METHODS
In this section we present all methods and computational details necessary to arrive at our results. First we discuss the generation of the training data and second the details of the network implementation.

Exact calculations
To train and test our models we created a set of 1-dimensional exact calculations and Kohn-Sham inversions. We sampled 1500 external (Coulomb-) potentials and computed the exact electronic ground state densities as well as the corresponding energies by solving the eigenvalue problem for the electronic Hamiltonian where K is the total number of nuclei, the variables R j and Z j denote the position and charge of the jth nuclei respectively, N ∈ {1, 2, 3} is the total number of electrons, and r i is the position of the ith electron. We solve the exact ground-state problem with Octopus [54], using a grid spacing of 0.1 a.u. and a box size of 23 a.u. (leading to a grid with 231 points). To circumvent the integrability problem of the Coulomb interaction in 1D we used a softened interaction. The total number of nuclei K were set to be 1, 2 or 3, such that their individual charges satisfy k Z k = 3. Their positions were randomly distributed with |R k | ≤ 4 a.u.

Spin densities
As Octopus [54] only provides directly spin-densities and wave-functions for two-particle systems we now discuss the problem of obtaining the spin densities for the three particle systems.
We start by noticing that solving the eigenvalue prob-lemĤΦ = EΦ for the Hamiltonian (7) yields all manyparticle solutions, including both fermionic and bosonic states. A spin-adapted fermionic solution of the form Ψ (r 1 σ 1 , . . . , r N σ N ) can be obtained by projecting any spatial solution Φ on the Young diagrams belonging to certain spin quantum numbers (S, M ), with S being the total spin and M = i σ i [55,56]. For a detailed description we refer to [55,57,58] The expansion coefficients U (P ) S ji can be calculated using the orthogonality of X(N, S, M ; j) [59]. By taking into account the antisymmetrization A = 1 √ N ! the product of the spin and spatial parts ΦX(N, S, M ; i) one obtains a sum of products of spatial and spin functions, as follows: where P r and P σ denote that the permutations operates on spatial and spin coordinates respectively, and For the ground state of N = 3 two linear independent spin-eigenfunctions can be chosen: The (normalized) spatial parts in Eq. (9) and the corresponding spin densities ρ ↑ (x) and ρ ↓ (x) can then be found. For instance,

Kohn-Sham inversion
For the inversion of the densities we used the optimization algorithm proposed in Ref. [44], that casts the inverse DFT problem of finding the v xc (r) that yields a given density ρ(r) as a constrained optimization problem. The conjugate gradient method was used to update the xc potentials v σ xc (r). A constant weight function w ≡ 1 was also used.
The densities of N = 1, 2, 3 particles were mixed, allowing us to generate a fractional densities for each external potential. For instance, the set {1, 1.5, 2, 2.5, 3} contains additional = 0.5 fractional densities beside the integer ones. If the inversion algorithm did not converge to a MSE below 1.5×10 −7 for a given density, we removed all samples corresponding to that external potential. The computed xc potentials were shifted by a constant to be in agreement with Koopman's theorem [60].
In Fig. 6 we show a few examples of such spin-densities and inverted xc potentials. For the fractional densities plotted in Fig. 7 the shift of the xc potential caused by the ∆ xc jump is clearly visible. u. ] Densities and corresponding xc potentials for a certain external potential for different fractional particle numbers.

Network implementation
The neural networks were implemented in pytorch [61] using pytorch-lightning [62] to simplify the training process. As discussed before we used the sliding window convolution as proposed in Ref. [13]. For each network the number of zero-paddings at the system's boundaries was set to (κ − 1)/2 (κ is an odd number in our calculations). For all models presented in this paper, we chose a kernel size of κ = 201, corresponding to highly nonlocal funtionals. Each local density was fed into a fully connected network with SILU [63] activation functions, and the output layer returned the local functions described in the previous sections. The use of SILU activation functions guaranteed the smoothness of the exchange correlation potential and its higher order derivatives. All hidden layer sizes were set to 32. Including the window convolution we used 4 hidden layers. The additional SWC unit for the AF shared the same hyperparameters. The network weights were optimized with Adam while using a cyclic learning rate scheduler. We used a learning rate of 7×10 −4 with a batch size of 30. The only exceptions were models trained with the xc-jump in the loss function. In this case each batch contained one density, corresponding to a random external potential. If the ∆ xc -shift was incorporated in the loss function, the densities per batch consisted of all fractional densities of a certain external potential. For these networks we kept the learning rate used before, but changed the batch size to one. The models were trained for 30 000 epochs and the model with the best validation loss was selected for testing.