Self-supervised learning method to solve the inverse problem of Fokker-Planck Equation

The Fokker-Planck equation (FPE) has been used in many important applications to study stochastic processes with the evolution of the probability density function (pdf). Previous studies on FPE mainly focus on solving the forward problem which is to predict the time-evolution of the pdf from the underlying FPE terms. However, in many applications the FPE terms are usually unknown and roughly estimated, and solving the forward problem becomes more challenging. In this work, we take a different approach of starting with the observed pdfs to recover the FPE terms using a self-supervised machine learning method. This approach, known as the inverse problem, has the advantage of requiring minimal assumptions on the FPE terms and allows data-driven scientiﬁc discovery of unknown FPE mechanisms. Speciﬁcally, we propose an FPE-based neural network (FPE-NN) which directly incorporates the FPE terms as neural network weights. By training the network on observed pdfs, we recover the FPE terms. Additionally, to account for noise in real-world observations, FPE-NN is able to denoise the observed pdfs by training the pdfs alongside the network weights. Our experimental results on various forms of FPE show that FPE-NN can accurately recover FPE terms and denoising the pdf plays an essential role.


Introduction
The Fokker-Planck equation (FPE) is an important tool to study stochastic processes commonly used to model complex systems. The time evolution of variables in a stochastic process is affected by random fluctuations which makes it impossible to derive a deterministic trajectory. However, the randomness can be accounted for by characterizing the process with the probability density function (pdf) of the random variable. As a class of partial derivative equations (PDEs), FPE governs how the pdf evolves under drift and diffusion forces. FPE has a wide range of applications in many disciplines. For example, the master equation can be approximated by FPE which is easier to be solved numerically 1 . FPE has been used to link the Langevin equation to Monte Carlo (MC) methods in stochastic micromagnetic modeling, where the drift and diffusion terms of FPE derived from the MC method have been proven to be analytically equivalent to the stochastic Landau-Lifshitz-Gilbert (LLG) equation of Langevin-based micromagnetics 2 . The eigenfunctions of FPE have been used to approximate the eigenvectors of the graph Laplacian, providing a diffusion-based probabilistic interpretation of spectral clustering 3 . Beyond theory, FPE is also a common tool to model real-world phenomena. For example, it has been used to determine the first passage time from one protein conformation to another 4 , to describe the weights of deep networks when updating them using stochastic gradient descent 5 , to characterize the behavior of driver-assist vehicles in freeway traffic 6 and to model the distribution of the personal wealth in socio-economics 7 . In all these applications on real-world pdfs, the FPE terms are proposed first and then validated by comparing the calculated pdf with the experimental data, which can be seen as a forward problem. To date, many numerical approaches have been proposed to solve the forward problem of FPE which is to calculate the pdf with given drift and diffusion terms of FPE, including the finite element method [8][9][10] , the finite difference method 11,12 , and the path integral method 13 .
Solving the forward problem in the case of FPE can often be challenging. For many complex systems, the form and parameters of FPE are unclear. For instance, in the field of socio-economics, the coefficient of money transaction is treated as a constant which may be an oversimplification 7 . In DNA bubble dynamics, the free energy terms have to be approximated 14 . As such, successful applications of forward modeling of FPE are limited to a few simple FPE forms. For example, the most commonly used FPE is derived from the Ornstein-Uhlenbeck process 15 which has a linear drift term and a constant diffusion term. However, FPE can adopt more complicated forms. In such cases, forward modeling becomes a tedious trial-and-error process of proposing reasonable FPE terms and validating with experimental data. Alternatively, it will be more efficient to solve the inverse problem, which is to derive the terms of the FPE directly from the experimental data. Besides higher efficiency in many cases, the inverse problem can be seen as a data-driven approach for scientific discovery. It requires no prior knowledge except assuming the observed data satisfies FPE, enabling us to uncover the unknown mechanisms hidden behind the stochastic processes.
Machine learning techniques have been promising in solving various inverse problems. In particular, in recent years, neural network models have been used to learn to solve the inverse problem using observable data as the input and outputs. In the field of reinforcement learning and imitation learning, the value iteration network has been proposed 16 , where the network is trained on optimal policies to recover the approximate reward and value functions of a planning computation. To the best of our knowledge, there has been one prior work using neural networks to solve the inverse problem of FPE 17 , and this work is based on the physics informed neural network (PINN) technique 18 . The PINN-based method learns from observed samples of the system's pdf at various time points to solve the FP equation, by incorporating FPE into the network loss function. However, this method requires prior knowledge of the form of the drift terms. Furthermore, it is applicable to only two specific cases, the Brownian process and the Lévy process. More broadly, PDE-Nets have been proposed to uncover the underlying PDE models of complex systems in a supervised learning manner 19,20 . However, the PDE-Nets do not specifically handle noise in the experimental pdfs which is an inevitable source of error. Although the noise is partially removed in PDE-Nets by smoothing the pdfs with the Savitzky-Golay (SG) filter 21 , the remaining noise is still a major error source of the uncovered PDE.
In this paper, we propose a novel neural network model to solve the inverse problem of FPE. Specifically, we directly embed FPE into the neural network architecture, where the drift and diffusion terms are represented by the trainable network weights. Our network, named as the FPE-based neural network (FPE-NN), is trained in a self-supervised manner where the network is used to predict the distributions at neighboring time points based on the input distribution at a single time point. Additionally, to denoise the experimental pdfs, we train both the network weights (representing the FPE terms) and the input (representing the experimental pdfs) in an alternating way. Our FPE-NN has two key advantages over prior works. First, FPE-NN does not assume any form of the FPE terms, making it more flexible than the PINN-based method. Second, FPE-NN can denoise the pdfs in experimental data. In fact, we find that the denoising of experimental pdfs is essential to accurately recover the FPE terms. Our experiments on simulated data show that FPE-NN is able to recover unknown FPE terms, denoise the experimental pdfs, and predict future distributions accurately.

Methods
In this work, we focus on the one-dimensional and time-independent FPE, whose form is shown below: where P : R × R → R + ∪ {0} is the pdf over the observable x and the time t, g : R → R is the drift term and h : R → R is the diffusion term. In actual implementation, P(x,t), ∂ P(x,t)/∂t, g(x), and h(x) are discretized over variable x and denoted as P P P(t), ∂ ∂ ∂ t t t P P P(t), g g g, and h h h, respectively. In order to solve the inverse problem, we focus on the scenario in which a pdf is assumed to satisfy FPE and the data within some time period is measured. However, the measured pdf, denoted as P P P noisy (t), is corrupted with multiplicative noise, and the exact forms of g g g and h h h are unknown. Hence, we propose to train a FPE-based neural network model FPE-NN in a self-supervised way, based on P P P noisy (t) only, to find g g g, h h h and the true pdf which is denoted as P P P clean (t). We currently focus on the one-dimensional FPE and will expand to multi-dimensional FPE in future work.
In the following sections, we first give an overview of how to find g g g and h h h and to recover P P P clean (t) in an alternating training process. Next, we introduce the architecture of the network FPE-NN. This is followed by training details such as initialization, alternating training steps, stopping criterion, and an application of FPE-NN for predicting future distributions.

Overview of the FPE-NN training process
Generally, FPE-NN takes P P P(t) at a time point as the input, g g g and h h h as trainable weights, to predict P P P(t) at neighboring time points. The input and the weights are trained in an alternating way. To distinguish them from the true values, the training input and weights are denoted asP P P(t),ĝ g g, andĥ h h, respectively. As shown in Fig. 1, each iteration of the training consists of two steps. The first step is the gh-training step, where we keep the updatedP P P(t) constant to trainĝ g g andĥ h h. The second step is the P-training step, where we keep the updatedĝ g g andĥ h h constant to trainP P P(t). The alternating training process is repeated until the stopping criterion is met. The reason that the input and weights are not trained simultaneously is that we have to use different ways to feed in the data samples when training them. Specifically,ĝ g g andĥ h h are time independent so the samples can be fed in batches, whereasP P P(t) is time dependent hence the samples have to be fed one-by-one.

Figure 1.
Flowchart of using FPE-NN to train the inputP P P(t) and the weightsĝ g g,ĥ h h in an alternating way. Each iteration consists of two steps, first the gh-training step then the P-training step. The counting of iteration starts from 1, asP P P 0 (t),ĝ g g 0 and h h h 0 are the initialized values. In the k-th iteration, the updated input and weights in the previous iteration areP P P k−1 (t),ĝ g g k−1 , and , respectively. We first keepP P P k−1 (t) constant to improve the accuracy of the weights in the gh-training step, and save the trained weights asĝ g g k andĥ h h k . Then we keepĝ g g k andĥ h h k constant to improve the accuracy of the input in the P-training step, and save the trained input asP P P k (t). Subsequently,P P P k (t),ĝ g g k , andĥ h h k are used for the next iteration.

Architecture of FPE-NN
Here we introduce our FPE-based neural network model (FPE-NN). It consists of two major parts (Fig. 2a). The central part, named FPE Core, is designed based on FPE which takes the inputP P P(t) at a certain time point t to calculate the corresponding derivative ∂ ∂ ∂ t t tP P P(t) with weightsĝ g g andĥ h h. The remaining part is designed based on the Euler's method which uses the inputP P P(t) and the derived ∂ ∂ ∂ t t tP P P(t) to predict the corresponding distributions at multiple neighboring time points. The network architecture of FPE Core is designed based on the discrete form of FPE over variable x for fixed-time t: where is the element-wise multiplication; ∂ ∂ ∂ x x x and ∂ ∂ ∂ x x xx x x are matrices to compute derivatives in a finite difference method 19 . The element-wise multiplication of two vectors is implemented by locally connecting two layers (Fig. 2b) and the multiplication of a vector by a matrix is achieved by fully connecting two layers (Fig. 2c). The weights of the locally connecting layers are the FPE termsĝ g g andĥ h h, which are trainable. The weights of the fully connecting layers are non-trainable and have fixed values that perform derivative calculations ∂ ∂ ∂ x x x and ∂ ∂ ∂ x x xx x x (details are provided in the Supplementary). There are no biases but only weights in these two kinds of connections. Overall, the FPE Core computes the corresponding ∂ ∂ ∂ t t tP P P(t) of the inputP P P(t) at a certain time point t based on the given weightsĝ g g andĥ h h.
In the following part, we use the inputP P P(t) and the derived ∂ ∂ ∂ t t tP P P(t) from FPE Core to predict the distribution at time point t + i∆t using the Euler's method, where i ∈ {−n, · · · , n} and n is a hyperparameter to be tuned case-by-case. ∆t is the time gap between adjacent time points. As shown in Fig. 2a, a few layers are added to perform the following calculation: whereP P P pred (t + i∆t) is the predicted distribution at time point t + i∆t.
In summary, the inputs for the whole network FPE-NN isP P P(t), the trainable weights areĝ g g andĥ h h, and the output is the predicted distributionsP P P pred (t + i∆t), i ∈ {−n, · · · , n}. In order to generate a loss function for training,P P P pred has to be compared with target distributions. However, as the true pdf P P P clean is unknown, we can only compareP P P pred with some selected noisy target distributions. We will describe how we formulate the loss functions using the noisy target distributions when the details of the alternating training steps are provided. Furthermore, in order to alleviate the noise effect of the target distributions, FPE-NN is designed to make predictions at 2n + 1 time points and usually n > 1. Detailed explanations are provided in the Supplementary.

Initialization
The initialized inputP P P(t), denoted asP P P 0 (t), is derived by smoothing P P P noisy (t) with the Savitzky-Golay (SG) filter 21 , which helps to partially remove the noise. The initialized weights, denoted asĝ g g 0 andĥ h h 0 , are calculated based onP P P 0 (t) using a linear least square method (details are provided in the Supplementary). In the following two subsections, we will explain how to train and update the input and weights separately in the k-th iteration.

The gh-training step in the k-th iteration
In the beginning of the k-th training iteration (k = 1, 2, . . . ),P P P(t),ĝ g g andĥ h h have been updated in the previous iteration. The updated values are denoted asP P P k−1 (t),ĝ g g k−1 andĥ h h k−1 , respectively. For the first iteration (k = 1), we use the initialized input P P P 0 (t) and weightsĝ g g 0 andĥ h h 0 . Then we first run the gh-training step to trainĝ g g andĥ h h (yellow blocks in Fig. 1). Specifically, the input of FPE-NN would beP P P k−1 (t) which is kept constant in this step, and the weights areĝ g g andĥ h h which are trainable with initial valuesĝ g g k−1 andĥ h h k−1 , respectively. The outputP P P pred is compared withP P P k−1 , hence one data sample consists of a single distributionP P P k−1 (t) as input and multiple distributions at neighboring time pointsP P P k−1 (t + i∆t) as target output, where

4/11
i ∈ {−n, · · · , n}. The loss function is: where the superscript k indicates the k-th iteration. Becauseĝ g g andĥ h h are time-independent and shared by all the data samples, we can train FPE-NN in batch training by randomly picking a certain number of samples, which is the first summation in Eq.4. The interpretation of L k gh is that we assumeP P P k−1 (t) satisfies FPE and search for the optimal terms. The optimizedĝ g g andĥ h h are saved as the updated weights of the k-th iteration.
g g g k ,ĥ h h k = arg min whereP P P k−1 (t) is kept constant.

The P-training step in the k-th iteration
After the gh-training step of the k-th iteration, the latest updated input and weights areP P P k−1 (t),ĝ g g k andĥ h h k , respectively. Now we run the P-training step to trainP P P(t) (green blocks in Fig. 1). Specifically, FPE-NN would have a trainable inputP P P(t) with initial valueP P P k−1 (t), and the weightsĝ g g k andĥ h h k which are kept constant in this step. The outputP P P pred is compared withP P P 0 which is the smoothed P P P noisy . Therefore, one data sample has a single distributionP P P k−1 (t) as input and multiple distributionŝ P P P 0 (t + i∆t) as target output, where i ∈ {−n, · · · , n}. The corresponding loss function is: where the superscript k indicates the k-th iteration. Because the inputP P P(t) is time-dependent and specific to each sample, we could only train the samples one-by-one. The interpretation of L k P is that we search for the optimalP P P(t) which has the minimum difference toP P P 0 (t), and naturally toP P P noisy (t), when it evolves according to FPE with termsĝ g g k andĥ h h k . The optimizedP P P(t) is saved as the updated input of the k-th iteration: P P P k (t) = arg min P P P(t) L k p (P P P(t)|ĝ g g k ,ĥ h h k ) whereĝ g g k andĥ h h k are kept constant.

Training Stopping Criterion
As described previously, the alternating FPE-NN training process uses the updatedP P P(t) to improve the accuracy ofĝ g g andĥ h h, and uses the updatedĝ g g andĥ h h to improve the accuracy ofP P P(t). Eventually, FPE-NN aims to recover the true values ofP P P(t),ĝ g g andĥ h h. However, because the true values are unknown, an indirect metric must be set to evaluate the training process and used as a criterion to stop the training iteration. In this study we use the sum of the two loss functions L gh + L P as the stopping criterion, to stop the whole training process when it converges or fails to decrease within a certain number of successive iterations.

Predict future distributions
Once the whole training process is finished, the well-trained weightsĝ g g andĥ h h could be used to predict the future evolution of a different pdf, denoted as P P P noisy (t), which satisfies the same FPE but is not used in FPE-NN training. Specifically, the task would be to use the distributions of P P P noisy (t) at time points {t − r∆t, · · · ,t} to predict the distributions at time points {t + ∆t, · · · ,t + f ∆t}, where r and f are user-defined parameters. In this study, we choose r = n and f = 5. The prediction process is similar to the P-training step which keeps the FPE-NN weights constant. First, the given distributions P P P noisy (·) are smoothed with the SG filter, and the smoothed distributions are denoted by P P P noisy (·). We then use 5/11 P P P noisy (·) as the target output to train the inputP P P(t) using FPE-NN. The corresponding time distance set is {−r∆t, · · · , 0} which is different from the one used in the FPE-training process (Fig 1). Hence the corresponding loss function becomes: L app (P P P(t)|ĝ g g,ĥ h h) = 0 ∑ i=−r P P P(t) + i∆t(∂ ∂ ∂ x x x (ĝ g gP P P(t)) + ∂ ∂ ∂ x x xx x x (ĥ h hP P P(t))) − P P P noisy (t + i∆t) 2 2 (8) whereĝ g g andĥ h h are well-trained weights and kept constant. By minimizing the loss value we derive the optimalP P P(t): P P P(t) = arg min P P P(t) L app (P P P(t)|ĝ g g,ĥ h h) Finally, the optimizedP P P(t) is fed into FPE-NN again with another time distance set {∆t, · · · , f ∆t}, to predict distributions at time points {t + ∆t, · · · ,t + f ∆t}: where i ∈ {1, · · · , f }.

Evaluation metrics
The training process can be evaluated with up to six metrics. The first two metrics are the loss values, which can be used to quantify the training result based on the observation P P P noisy (t). In particular L gh measures how wellP P P(t) satisfies FPE, while L P measures how closeP P P(t) is to the measured pdf P P P noisy (t).
In the cases of validating our methods with simulated data where the corresponding true values are known, three more metrics can be used to quantify the normalized errors ofP P P(t),ĝ g g andĥ h h: E P = P P P(t) − P P P clean (t) 2 P P P clean (t) 2 , E g = ĝ g g − g g g 2 g g g 2 , where P P P(t), g g g and h h h are the true values. According to the form of the loss function L gh (Eq.4), the gradients ofĝ g g andĥ h h both are proportional to the correspondingP P P(t) at the same point of x. Usually the values ofP P P(t) at the boundary points of x are very small, hence the correspondingĝ g g andĥ h h at these points are difficult to be well trained in backpropagation. On the other hand, the boundary points are less important due to their small P P P(t) values. Hence, we also measure the error ofĝ g g andĥ h h excluding the boundary points, denoted asẼ g andẼ h , respectively. The boundary area is determined by using the student T-test 22 at each point of x. The x points whose p-values are lower than 0.01 × max(P P P noisy (t)) (p > 0.95) are classified to the boundary area. The last metric is the ability of using the trainedĝ g g andĥ h h to predict future distributions, the detailed process of which has been described previously. The normalized error of the predicted future distribution is denoted as E test : where i ∈ {1, · · · , f } indicates the number of time gaps ahead of the last known distribution,P P P pred (t + i∆t) is the predict distribution provided in Eq. 10, and P P P target (t + i∆t) is the ground truth.

Simulated data
We validate our method with simulated data. First, We solve the forward problem of FPE with chosen FPE terms and initial conditions, to generate several distribution sequences at multiple time points. Then, we assume the FPE terms are unknown, and we train FPE-NN to solve the inverse problem of FPE which is to find the FPE terms based on the generated distribution sequences. The simulated data is generated based on three selected FPE examples which have been used to model real-world data in different areas. For each FPE example, we generate 120 distribution sequences and each sequence consists of distributions at 50 time points with a uniform time gap ∆t. 100 distribution sequences are used to train FPE-NN and the remaining 20 sequences are used to test the ability of predicting future distributions.

Example 1: Heat flux (OU process)
Our first FPE example comes from meteorology, where Lin and Koshyk 23 used it in climate dynamics to describe how the heat flux at the sea surface evolves. The FPE form is: The process described by Eq.13 is called the Ornstein-Uhlenbeck (OU) process 15 , where D and θ are the diffusion and drift coefficients respectively. It was originally used in the Brownian motion of molecular dynamics and has since been applied widely in various fields [24][25][26] . The forward problem of Eq.13 has an analytical solution which is a time-varying Gaussian distribution: where the mean µ and deviation σ are: where x 0 is the origin state of x when t = 0. We set D = 0.0013 and θ = 2.86, which are the empirical parameters from the sea measurements 27 .
x is discretized into 110 bins with support [−0.01, 0.1]. The initial conditions are uniformly sampled from x 0 ∈ [0.04, 0.08] and t init ∈ [0.03, 0.05]. We obtain simulations with a random pair of (x 0 ,t init ) for each distribution sequence. We adjust the time gap ∆t to be 0.001 in order to maximize the distribution differences at different time points while keeping P(x,t) values at boundary points close to zero. The time gap of the next two examples are also tuned based on the same criteria.

Example 2: DNA bubble (Bessel process)
The next FPE example is from the biology area, where FPE has been used to model the dynamics of DNA bubbles formed due to thermal fluctuations 14 . The observable x(t) is the bubble length and its pdf satisfies: Eq. 16 describes a generalized Bessel process which also has many other applications 28 . In the DNA bubble study, µ is estimated as 1 and ε is approximated as 2(T /T m − 1), where T is the environment temperature and T m is the melting temperature which is determined by the DNA sequence. In this work, we set µ = 1 to follow the DNA bubble study and arbitrarily choose ε = 0.2. Eq. 16 has a very complicated analytical solution 28 . Hence for convenience, we generate data by solving the forward problem with the 4th-order Runge-Kutta method for temporal discretization. x is discretized into 100 bins with support [0.1, 1.1]. The initial condition is a normal distribution whose mean µ and deviation σ are randomly sampled from [0.4, 0.8] and [0.05, 0.1] respectively. We adjust the time gap ∆t to be 0.001, based on the same criteria as mentioned before.

Example 3: Agent's wealth
The final FPE example comes from economics, where Cordier et al 7 uses FPE to describe the wealth distribution of agents. The agent is a terminology in economics referring to an individual, company, or organization who has an influence on the economy through producing, buying, or selling. The observable x(t) is the agent's wealth and its pdf satisfies: Eq. 17 has no analytical solution and we solve the forward problem by using the 4th-order Runge-Kutta method. We set λ = 0.4 and m = 1 following Cordier's report 7 .
x is discretized into 100 bins with support [0, 1.0]. Similar to the DNA bubble example, the initial condition is a normal distribution whose µ and σ are randomly sampled from [0.3, 0.7] and [0.05, 0.1], respectively. The time gap ∆t is adjusted to be 5e-4, based on the same criteria mentioned in the heat flux example.

Noise addition
After we generate the distribution sequences (denoted as P clean ) by solving the forward problem of FPE as described previously, noise is added to the distributions to model real-world scenarios: where ξ is a random number sampled from the standard normal distribution N(0, 1), and W is a constant which is adjusted to ensure that the corresponding E P of P noisy (x,t) is approximately 0.01. Simulated data with higher corresponding E P is also tested using our method, and the results are provided in the Supplementary.

Results
As described previously, we generate the simulated data by solving the forward problem of FPE with given FPE terms and initial conditions. After that, we assume the FPE terms are completely unknown, and train FPE-NN to solve the inverse problem which is to find the FPE terms based on the simulated data. A typical training process with the data from the agents' wealth example is shown in Fig. 3. L P decreases rapidly in the first few iterations while L gh decreases at a slower rate. Both of their curves become almost flat after the 25-th iteration and decrease very slowly. The curve of E P generally coincides with that of L P with a turning point at the 25-th iteration. In contrast, E g and E h are continuously decreasing during the iterations. The result demonstrates that with the alternating training process, the input and weights help each other gradually recover the true values over the iterations. We also found that further improvements on the accuracy ofĝ g g andĥ h h, after reaching a certain point, have minor effect on the accuracy improvement ofP P P(t). On the other hand,ĝ g g andĥ h h are more sensitive to the minor improvements in P P P(t). We also test E test at several selected iterations as shown in Fig. 3c. The result suggests that a larger time gap is more sensitive to the accuracy ofĝ g g andĥ h h. However, further improvement inĝ g g andĥ h h after a certain point has little effect on future distribution prediction. Overall, the learning curves demonstrate our self-supervised method can trainP P P(t), g g g and h h h in an alternating way and let them gradually recover the true values.
The training results of the three different FPE examples are summarized in Table 1. Generally,P P P( ( (t t t) ) ),ĝ g g andĥ h h have been well trained using FPE-NN in all the three simulated datasets. The noise of the trainedP P P(t) is significantly suppressed, as the final noise ratio E P could reach to less than 0.002. In contrast, the corresponding E P for P P P noisy (t) is around 0.01. The trainedĝ g g andĥ h h also fit well with their true values. Fig. 4 shows the plots of the final values ofĝ g g andĥ h h (red diamond), excluding the boundary points, are quite close to the true values (black line). In contrast, if we do not train the input but just smooth P P P noisy (t) with SG-filter, the optimized weightsĝ g g 1 andĥ h h 1 (blue cross) have much higher error. Hence, further reduction of the noise in P P P noisy (t) is essential to improving the accuracy of trainedĝ g g andĥ h h.
We also examined how the training result is affected by the number of output distributions, which is the hyperparameter n in Eq.4 and Eq.6. As demonstrated in the Supplementary, n > 1 can help to suppress the noise effect ofP P P(t). However, it is not always better to have more neighboring time points, because the local truncation error from Euler's method may become too large to be ignored. Therefore n is a hyperparameter which needs to be tuned case-by-case. In this study, we tested four different values of n for each FPE example. As shown in Table 1, L gh and L P always increase when n increases, which is because of the 8/11 are the trained weights in the first iteration. They are the optimal FPE terms can be found based on the smoothed P P P noisy (t),P P P 0 (t). The figure shows thatĝ g g 1 andĥ h h 1 are significantly different from their true values, suggesting that merely smoothing P P P noisy (t) is insufficient. To further reduce the noise by trainingP P P(t) with FPE-NN is essential for theĝ g g andĥ h h optimisation. larger truncation error caused by the larger time distance. However, the errors ofP P P(t),ĝ g g andĥ h h are not monotonically increasing or decreasing. Different datasets have different best n within the four options.

Discussion
In this study, we proposed a new method to find the FPE terms based on the measured noisy pdfs, without any prior knowledge except assuming the pdfs satisfy FPE. Such an inverse problem of FPE could help us to study the stochastic processes in a more efficient way. We designed our network FPE-NN based on FPE and the Euler's method. The input of FPE-NN is the distribution at a time point, the trainable weights areĝ g g andĥ h h representing the FPE terms, and the output is the distributions at several neighboring time points. In an alternating way, we use FPE-NN to train the input and the weights separately, forcing them to recover their true values. By training the input we successfully reduced the noise in the measured pdfs, which is a key factor for finding the accurate FPE terms. We validated our method with simulated data from three typical FPE examples. The result shows that the final trained input and weights all can agree well with the true values.
Although we are using a uniform grid to generate the simulated data in this paper, FPE-NN also can be used with minor changes in cases where the spatial gap ∆x and the time gap ∆t are not uniform. As shown in the Supplementary, the matrices which perform the derivative calculations of ∂ ∂ ∂ x x x and ∂ ∂ ∂ x x xx x x can be easily modified to accommodate a non-uniform ∆x. For the non-uniform ∆t, we could treat the time distance set (Fig. 1) as part of the input, to make it more flexible and specific for each data sample.
Currently, we are focusing on one-dimensional and time-independent FPE, but our method can be easily extended to higher spatial dimensions. In future work, we will try to solve the inverse problem of time-dependent FPE. Furthermore, our method can perform well when the initial noise ratio (E P ) is around 0.01. However, the recovered FPE terms become less accurate with data of higher noise ratio. Hence, another possible future direction is to improve the noise robustness of the proposed method.