Deep neural networks for high harmonic spectroscopy in solids

Neural networks are a prominent tool for identifying and modeling complex patterns, which are otherwise hard to detect and analyze. While machine learning and neural networks have been finding applications across many areas of science and technology, their use in decoding ultrafast dynamics of quantum systems driven by strong laser fields has been limited so far. Here we use deep neural networks to analyze simulated noisy spectra of highly nonlinear optical response of a 2-dimensional gapped graphene crystal to intense few-cycle laser pulses. We show that a computationally simple 1-dimensional system provides a useful"nursery school"for our neural network, allowing it to be easily retrained to treat more complex systems, recovering the band structure and spectral phases of the incident few-cycle pulse with high accuracy, in spite of significant amplitude noise and phase jitter. Our results both offer a new tool for attosecond spectroscopy of quantum dynamics in solids and also open a route to developing all-solid-state devices for complete characterization of few-cycle pulses, including their nonlinear chirp and the carrier envelope phase.

Electrons provide the fundamental first step in response of matter to light. The feasibility of shaping light pulses at the scale of individual oscillations, from mid-IR to UV [1,2], offers rich opportunities for controlling electronic response to light on sub-cycle timescale (e.g. [3][4][5][6][7][8][9][10][11][12][13]), leading to a variety of fascinating phenomena such as optically induced anomalous Hall effect [14][15][16], topological phase transitions with polarization-tailored light [13], or the topological resonance [7]. Over multiple laser cycles, control of electron dynamics with light also enables the so-called Floquet engineering -the tailored modification of the cycle-average properties of a light-dressed system, see e.g. [17] for a recent review.
In this context, starting with the pioneering work [18], high harmonic spectroscopy has developed into a powerful tool for exploring laser-driven electron dynamics in solids, see e.g. recent reviews [19][20][21]. Examples include identification of the common physical mechanisms underlying high harmonic generation in atoms, molecules and solids (e.g. [10,22]), observation of Bloch oscillations [23], resolving interfering pathways of electrons in crystals with about 1-fsec precision [24], inducing [13] and monitoring topological [25][26][27] and Mott insulator-to-metal [28] phase transitions, resolving coherent oscillation of electronic charge density order [29], identifying the van Hove singularities in the conduction bands [30], and reconstructing effective potentials seen by the light-driven electrons with picometer accuracy [9].
Here we apply machine learning to the analysis of high harmonic generation from a crystal, which allows us to 'kill two birds with one stone': reconstruct the band structure of the crystal and fully characterize incident few-cycle laser pulses, including both their nonlinear chirp and the phase of the carrier oscillations under the envelope (CEP).
The fundamental role of the CEP in nonlinear light-matter interaction has been understood theoretically in [31][32][33][34], stimulating first experiments in the gas phase [35]. Powerful gas-phase methods for characterizing few-cycle pulses have been developed, including stereoabove-threshold ionization (stereo-ATI) [36][37][38], attosecond streak camera and its modifications [39][40][41][42][43], and half-cycle high harmonic cutoffs [44]. Using nonlinear response of solids for characterizing the CEP has also been pursued [45][46][47]. Yet, all-optical, all-solid-state characterization of few-cycle pulses, including their CEP, remains a challenge. We hope to change this situation. Particularly relevant to our work are the earlier proposal on using interference patterns in spectrally overlapping regions of even-and odd-order harmonics in solids [48] and the use of two-color high harmonic spectroscopy for all-optical reconstruction of the band structure [49] from the two-dimensional high harmonic spectra, recorded as a function of the harmonic frequency and the two-color delay.
Two-dimensional spectra of the nonlinear-optical response may provide sufficient information to recover the pulse. One prominent example is frequency-resolved optical gating, which uses the second-order optical response recorded as a function of the time-delay between the two incident pulses, the target pulses and the auxiliary gate pulse (e.g. [41,50,51].) Extending this analysis to highly nonlinear optical response remains an open problem. The crucial importance of addressing this problem stems from the fact that such analysis would allow one to characterize the laser pulse directly in the interaction region.
In special cases, such as the case of the two-color high harmonic spectroscopy of attosecond pulses using fundamental and the second harmonic (e.g. [52]), the 2D harmonic spectra recorded as a function of the two-color phase and the harmonic frequency may carry sufficient information for reconstructing attosecond pulses as they are produced, directly in the interaction region. Such reconstruction does, however, require detailed understanding of the physics of the microscopic quantum response. The possibility of solving a full reconstruction problem in a general case, recovering both the pulse and the quantum system, remains completely unexplored.
When there is no simple and/or well known functional dependence between the response data and the parameters one wishes to reconstruct, the problem is well suited for neural networks. Such networks aim to find a smooth analytical function f θ (x) which connects the input x i and the desired output y i . If it is successful, one can conclude that the data x indeed does contain the information y. Pertinent examples include applications to solving the Schrödinger equation, where neural networks can output highly accurate results [53,54].
Our results show that, given sufficient training set, the 2D spectra of the high harmonic response as a function of the nonlinear response frequency and the CEP of an unknown driving pulse allow for simultaneous reconstruction of both the pulse and the unknown crystal band structure, even in the presence of substantial noise including CEP jitter.
To demonstrate the method, we assume no apriori knowledge about the incident pulse and use rather limited knowledge about the nonlinear medium. For the quantum system, we begin with a modified Rice-Mele model [55] with nearest neighbor, next nearest neighbor, etc. hoppings, see Figure 1(a) (and Supplementary information for further details.) Both the on-site energies and the couplings are assumed to be unknown. The reconstruction procedure is expected to output both the parameters of the pulse and of the lattice. We are thus faced with a nonlinear optimization problem with a very large input dimension, which requires finding optimal interpolation between the existing trial samples.
Such a regression tool is provided by deep neural networks (DNNs) [56], already used for such diverse applications as boosting the signal-to-noise ratio in LHC collision data [57], establishing a fast mapping between galaxy and dark matter distribution [58], and constructing efficient representations of many-body quantum states [59]. The inherent resilience of neural networks to noise is an important asset for pulse shape characterization. The emergence of photonic implementations of feed-forward neural networks [60] outlines a perspective of implementing this regression scheme in an all-optical way.
The vector potential of the incident laser field, A(t), is generated in the frequency domain with the unknown to the neural network quadratic and the cubic phases: The complex parameter µ is defined in such a way that, in the absence of the cubic chirp, the pulse has a temporal width σ ≡ T 0 : µ ≡ σ 2 − iα 1 + α 2 σ 4 , where α is the chirp parameter. The chirp parameters α and λ are expressed via dimensionless quantities β and , α = π(σ 2) 2 ×β, Finally, ϕ sets the CEP of the pulse, also unknown to the neural network and chosen randomly. The system evolution is simulated for 40 laser cycles; an additional Gaussian cutoff is introduced at the leading and trailing edges of the pulse for numerical stability (see

Supplementary information.)
The typical pulses we have used for reconstruction are shown in Fig. 2, both in time and frequency domain. The latter shows the spectral phases (red curves) alongside the spectral amplitudes (blue curves), as a function of ω ω 0 , where ω 0 is the carrier. Our typical simulated "measurement" assumes that one can systematically vary the (unknown) CEP. Thus, we perform calculations by varying ϕ + ∆ϕ, with ∆ϕ spanning the full range ∆ϕ ∈ [0, 2π). For each ∆ϕ, we measure the absolute value of the spectral amplitude of the laser-induced current j(ω) , which is given by the Fourier transform of the calculated current j(t).
The input data is composed of the absolute values of the integer harmonic amplitudes j(N ω, ϕ + ∆ϕ) . The resulting 2D map as a function of ∆ϕ and ω is used as the input into the neural network. The network must then infer the (randomly chosen in each trial) intraband hoppings {t j }, the unknown initial CEP ϕ, and the pulse parameters α and λ.
The values of σ, frequency ω, and h 1 are kept constant throughout. For more information about the inputs used, see Supplementary information.
In the reconstruction procedure, we have used two separate neural networks, one for recovering the band parameters, and another for recovering the CEP and pulse parameters.
Both are constructed using the same architecture, consisting of 4 convolutional layers and two fully-connected ones, see Fig. 3. Constructing them as a single neural network with split fully connected layer impedes the overall performance of the scheme. For either problem,  To speed up training, we train networks to work on more complex datasets (e.g. with chirped pulses) by using a network pre-trained on a simpler dataset (e.g. with no chirp).
The rest of the training procedure remains the same, but the overall number of epochs is reduced to 50.
The recovery results are demonstrated in Fig. 4. The number of unknown band parameters is set to 4. The agreement between the actual and the reconstructed data, both for the system and for the pulse, is excellent. Tables with detailed analysis of the average performance are given in the supplementary material.
We now also demonstrate that our neural network recovers the system parameters with After demonstrating that our neural network recovers the band structure and pulse parameters of the simple source (Rice-Mele) model with high accuracy, we apply our approach to more complex systems. As an example of such more complex system, we used the 2dimensional gapped graphene system. Its parameters (on-site energy and first-neighbor hopping) were generated in the vicinity of experimental values for hBN, within 4.0 ÷ 7.27 eV and 0.08 ÷ 0.16 a.u., respectively. We simulated its responses by integrating the semiconductor Bloch equations. To simulate experimental conditions, we applied noise using the same procedure as described above.
Due to the greater computational cost, we could only use 1280 samples as opposed to 65536 for the source problem. Such a dataset is too small to train an entire new network.
Instead, we applied the transfer learning approach by using the networks pre-trained on datasets with no quadratic or cubic phase, varying the CEP, and 4 unknown band parameters, and partially retraining them (see Supplementary). We discovered that, in spite of the greater complexity of the underlying physical system, such a setup recovers the CEP with a precision comparable to the original problem (see Figs. 5), and achieves good relative accuracy on the band parameters.
This adds new significance to the achieved results. Indeed, we have now demonstrated that the spectra from the initial model, while only resembling a real-world setup in a qualitative sense, provide a useful training ground for the neural network, allowing it to acquire useful abstract concepts (parameterized by the deeper layers) which it can later apply to more practical problems, for which it was also harder to generate as many training samples.
Our results demonstrate that solid-state HHG spectra contain more information than one extracts with conventional methods: not only the pulse parameters, but also the parameters of the quantum system are robustly reconstructed. Our approach replicates the advantages of gas-phase HHG spectroscopy, namely, the ability to resolve the CEP of laser pulses and the complex pulse shapes with polynomial spectral phase nonlinearities. At the same time, it requires neither the XUV pulses nor the photoelectron spectroscopy, such as the stereo-ATI. Moreover, it allows for all-optical solid-state implementation. In terms of required observables, it is closest to the method based on measuring the half-cycle cutoffs in gas-phase high harmonic generation [44]. However, it also allows one to deal with very strong chirps. Applying neural networks to the analysis of half-cycle cutoffs and high harmonic generation spectra in the gas phase could be very interesting, especially in molecules, where multiple coupled harmonic generation channels present challenges for unravelling the underlying laserdriven multi-electron dynamics [62].
One could apply the developed neural network design to standard gas-phase experiments, to analyse the possibility of resolving the spectral phase and the CEP for short pulses by processing HHG spectra generated by known inert gases (Ar, Ne, etc.). In this case the neural network can be trained using TDSE simulations of the necessary responses before being applied to real experimental data.
The key difficulty of using high harmonic spectroscopy in solids is that, without apriori knowledge of the band structure, one lacks closed-form solutions for electron dynamics, similar to those available in the gas-phase. Our method circumvents this difficulty. Pulse characterization device implementing our principles could be tabletop, all solid-state, and capable of operating at ambient conditions.
Another interesting direction to pursue would be to apply novel physics-informed neural network architectures [63,64] to resolve Hamiltonians of systems with many degrees of freedom (such as molecules) using time-resolved HHG spectra, such as those obtained from solids driven by mid-IR fields [24]. Neural networks can also be used for processing the sets of harmonic spectra connected by other relations, such as being measured for different angles between the crystal axes and the driving field, to uncover effective laser-modified potentials for the charge motion, extending the pioneering work in Ref. [9] to recover effective potentials of active band electrons. Here, once again, one can take advantage of our idea of using neural networks to extend a method of processing analytically-tractable systems to intractable ones, recovering effective structures.

III. DISCLOSURES
The authors declare no competing interests.

IV. DATA AVAILABILITY
The generated datasets, trained neural network weights, and reconstructed values are available in Dataset 1 (Ref. [65]). The codes running the TDSE simulation and neural network reconstruction of parameters were written in the Julia language [66] and are available publicly on GitHub [67].

I. MODIFIED RICE-MELE MODEL
The system is described by the Hamiltonian Here s α = −1 for α = A and s α = +1 for α = B, N is the maximum hopping order, which was The Hamiltonian also includes a term proportional to h 1 , which describes hopping between sites A and B. This term explicitly breaks the inversion symmetry of the model, so that its nonlinear-optical response allows us to distinguish between pulses with CEPs differing by π. We set h 1 to be much smaller than the typical values of t j ; we used h 1 = 0.01 eV and h 1 = 0.05 eV. This simplifies modelling relaxation. This system allows one to generate a rich variety of band structures and band-gaps g (k), see Fig. 1 (b-d) for some examples.
When we set the lattice constant a = 1, k c = 2π, in the k space, the field-free Hamiltonian transforms to:Ĥ (k) = (k)σ z + h 1 (σ x cos k + σ y sin k) The field is introduced via the Peierls substitution, which transforms the laser-driven Hamiltonian in the momentum space as: The system is thus decomposed into an ensemble of non-interacting two-level systems, characterized by the canonical momentum k. We evolve each TLS using the master equation with a decoherence term:ρ Here ρ od denotes the off-diagonal part of the density matrix in the Hamiltonian eigenstate basis, T 2 is the dephasing time, set to T 0 2, where T 0 is a single fundamental laser period.

II. DRIVING PULSES IN TIME DOMAIN
In time domain, the pulse shape corresponding to the frequency-domain representation in the main text can be written in closed analytical form: The envelope function in the above equation is: The normalizing factor is introduced to keep the peak absolute value of the envelope function at 1 when there's no cubic phase, and conserve the total pulse power at all parameter values.
For numerical stability, we introduce a cutoff function that ensures that A(t) = 0 at the beginning and end of the pulse: with the numerical vector potential taking the form

III. EVALUATION OF THE AIRY FUNCTION
We have approximated the Airy function required for computing the pulse shape in (7) with Maclaurin series [1, Eq. 9.4.1] and asymptotic series [1, Eq. 9.7.5, 9.7.9] for small and large arguments, respectively. For z < 3.5 we used the Maclaurin series up to N = 25; for z ≥ 3.5, the asymptotic expansion for N = 5 was used. This approximation allows for a maximum relative error below δ = 3 ⋅ 10 −4 .

IV. SOLUTION OF THE TIME DEPENDENT SCHRÖDINGER EQUATION.
We discretize the 1D Brillouin zone of the crystal into 32 k points (finer discretization does not worsen the NN's performance). For each point, we initialize the system in the pure ground state of the Hamiltonian (2) for A(t 0 ) = 0, then evolve this state according to (5).
The dephasing time τ 2 = T 0 2 is selected phenomenologically. For all computations, the field 80 times per cycle, we record the current, computed as: ⟨j⟩ k (t) = tr(ρ k (t)J (k + A(t))) (11) In the above equation, we neglect the 2π N k (N k = 32) factor.
Each "recording" interval is separated into 2 integration steps. Each integration step consists of ν = 4 iterations of the following procedure: Within each integration step, the field A(t) is kept constant. The matrix exponent is computed numerically exactly in the assumption that H(k) is traceless.

V. NEURAL NETWORKS
Here we first give a very brief general introduction into neural networks, and then discuss applications to our particular problem.
Suppose we search for a model that best describes a known experimental data set (x i , y i ) with K total samples. Generally, the workflow is: 1. Define the model as a trial function f θ (x) that depends on a vector of parameters θ. 3. Optimize the parameters θ to minimize the divergence D.
The most basic example of such an approach is the linear regression: we define the model Obviously, the result must be the same, which can be verified analytically.
Evidently, linear regression cannot model most data sets. The simple nonlinear regression methods generally rely on fitting the experimental data to an ansatz selected by hand. Such methods include logistical regression, polynomial regression, and the more exotic expansions over various bases such as the Legendre polynomials and harmonic functions. However, selecting the correct nonlinear regression requires extensive trial-and-error testing of various ansätze, which is generally not possible except for simple cases with known symmetries and properties of the experimental data set. A convenient solution to this complication is using neural networks, which serves as a universal ansatz by being capable of approximating any continuous finite function to an arbitrary precision [2]. This property is the key advantage of neural networks, which allows them to be successfully applied even to data sets where the very existence of a connection between the input and the output is not evident.
We will now discuss the core principles of a neural network. A neural network is constructed of building blocks called neurons. The neuron, demonstrated schematically on Fig. 1(a) is defined as follows. b is the bias. The output is computed as a = f (Wx + b), commonly rewritten as a = f (W [1] x [1] ) with W In a neural network, the neurons are arranged in layers where each neuron has the same input vector x and the same activation function f . For this layer, the bias b is expressed by a vector, the weightsŴ by a matrix, and the output is given by a = f (Ŵ x + b), where f is applied individually to each argument. The simplest neural network, called the singlelayer perceptron, consists of only two such layers, and is schematically demonstrated on Fig. 1(b). The first (hidden) layer contains all the nonlinearities, and gets the raw data as the input. The second (output) layer may or may not be linear, i.e. have an identity activation function, and in the linear case simply applies a learnable linear transform to the outputs of the hidden layer.
The universal approximation theorem [2] states that any continuous finite function can be approximated by a perceptron with a sufficiently large hidden layer with a non-polynomial activation function. However, for many cases this turns out to be computationally nonoptimal. A more efficient representation that requires less parameters can be achieved by using stacked layers, with the output of k-th layer being the input for k + 1-th. A network constructed in such a way is commonly named "fully connected". For N-layer deep fully connected networks, the equation The general heuristic states that the higher layers process higher-level features, and thus can learn highly nonlinear and nonlocal patterns. A 'dual' version of the universal approximation theorem [3,4] demonstrates that an arbitrary continuous function can also be approximated by a neural network with a fixed (but not arbitrarily small) number of neurons per layer and sufficiently large number of layers. Shown on figure 2 is the general scheme of a deep fully connected neural network. Deep neural networks can be trained in an optimal way thanks to a procedure called 'backpropagation' [5] that allows to efficiently differentiate the loss function with respect to each of its parameters θ ≡ {Ŵ [N ] . . .Ŵ [1] }. This enables optimizing them with the gradient descent procedure, which consists of updating θ against the gradient of the divergence function, called 'loss' for neural networks: δθ ∝ −∇ θ ∑ i L(f θ (x i ), y i ), ideally yielding a function that minimizes the error for each sample simultaneously. However, the 'naive' gradient descent is vulnerable to noise and local minima, and in practice, more sophisticated methods are employed, such as ADAM [6]. ADAM is based on correcting the gradient descent update by using an adapted momentum of the parameters which prevents them from getting stuck in a local minimum. For the same goal, the input data is separated into 'batches' that consist of a defined number of samples. The parameters are updated after processing each batch. The process of optimizing the FFNN parameters, or 'training' the FFNN, therefore consists of the following steps.
1. Select a batch of data as input.
2. (forward pass) Evaluate the output of each consecutive layer to obtain the NN output.
3. Compute the loss function between the produced output and the desired output.

(backward pass)
Calculate the gradient of the loss function with respect to each parameter by reverse propagating the loss gradient from the last to the first layer.
5. Compute the update to the parameters using a parameter update procedure (gradient descent, ADAM, ADAGrad, ADAMax, etc) and apply it.
The above steps are repeated for the entire data set. Processing the entire data set according to the training procedure is called a 'training epoch'. A neural network can be trained for any number of epochs, ranging from tens to thousands.
Therefore, deep neural networks can be used to model complex dependencies in a computationally optimal way, often requiring several orders of magnitude less computational time than exact methods. However, their application is not limited to regression problems.
The neural networks have been implemented using the Flux.jl [7,8] library.

VI. CONVOLUTIONAL NEURAL NETWORKS
In practice, pure fully connected networks are rarely used on datasets where the existence of local features in the input data can be assumed (e.g. real-life photographs.) When dealing with a dataset where spatial coordinates can be assigned to each element of the input data, a fully connected layer cannot take advantage of the information about the relative position of pixels. Hence, such a layer assigns as much meaning to nonlocal features as to local ones -but nonlocal features are more likely to be spurious, making the network prone to overfitting. The networks designed to remedy this problem are called convolutional neural networks (CNNs for short). The basic unit of a CNN is a convolutional layer constructed as described below.
Suppose the input image I is a 3-dimensional W × H × C tensor, where W is the image width, H is the height, and C is the number of channels (which can be, for example, the red, green, and blue channels of a photograph). A convolutional layer is defined by its convolutional filter, bias, and activation function. A convolutional filter W is then a 4dimensional d 1 × d 2 × C × C out tensor, where d 1 × d 2 are the spatial dimensions of the receptive field of the filter (usually d 1 = d 2 = d), and C out is the number of channels in the output feature map, which can be arbitrarily high. This convolutional filter is applied in the following way: After this, a C-dimensional bias b is added, and a nonlinear activation function f is applied like it's done for fully connected neural networks. The output of a convolutional layer is then: Just like fully connected layers, the convolutional layers can be nested upon each other to extract increasingly complex and nonlocal features with each subsequent layer.
In practice, convolutional layers are almost universally alternated with pooling layers which reduce the amount of computations required. A pooling layer with a receptive field Usually, the number of channels in convolutional filters is doubled after each maximum pooling operation to make the nonlocal features more expressive than local ones.
CNN's are also commonly regularized with batch normalization layers. For each channel i of the input data, a batch normalization layer first normalizes the channel data to mean 0 and variance 1 across the entire data batch, then shifts it to have a new mean β i and variance γ i , where β and γ are trainable parameters. Although the usefulness of batch normalization in stabilizing and accelerating deep neural network training is demonstrably significant [9], the reason for it is currently a disputed matter [10].
An architecture demonstrating the described principles can be seen in Fig. 3; see [11] for detailed review.
In our work, we aimed to first train our neural network on a source problem (the Rice-Mele model dataset), and then retrain it for a target problem (the gapped graphene dataset).
We found that a fully connected neural network was able to attain performance comparable to a CNN on the source problem. To do so, it required the input of the 2D frequencyphase spectra concatenated with their Fourier transforms along the phase axis. However, as FCNNs are prone to overfitting, they turned out to be ill-suited for retraining and performed poorly on the target problem. CNNs, on the other hand, can be retrained by only retraining the fully connected layers, without affecting the convolutions. Following this procedure, we attained good performance on both the source and the target problems (see figs. 4,5).

VII. INPUT DATA
The phase ∆ϕ was discretized into 32 steps. Per each laser cycle, 80 time points were recorded, allowing us to recover integer harmonics up to N = 40, thus a single 2D plot has 41x32 points of the form log 10 j(N ω, ∆ϕ) . These were normalized to the mean value of 0 and standard deviation of 1. The normalized images were then presented as 41 × 32 × 1 tensors, i.e. 2D images with a single channel.
We have generated data sets that consist of 65536 samples for pulses with no cubic phase, and 262144 samples for pulses with cubic phase. Next, 1% of samples from each dataset have been set apart to form the test set. The required output were the band parameters, the CEP in the form of (cos ϕ, sin ϕ), and the dimensionless chirp and cubic phase.
During the training process, the data have undergone random circular shifts: before a training step, all 2D plots j(N ω, ∆ϕ) in each batch were transformed by a circular shift along the ∆ϕ axis. This has allowed us to train with smaller datasets and helped to combat overfitting.

VIII. SOURCE PROBLEM PERFORMANCE
The average performance of the selected neural network on the test set of the Rice-Mele model dataset for different unknown parameters is given below. For parameters that do not change sign, the average relative error is also indicated.

IX. MODELING HIGH-HARMONIC RESPONSE OF GAPPED GRAPHENE
To demonstrate that the concepts learned by the neural network on the original task can be applied to more difficult problems, we consider the case of hexagonal boron nitride (hBN), a two-dimensional material described by a Hamiltonian equivalent to gapped graphene: The parameters of this material are taken within the following ranges: m ∈ using the code introduced in [12,13]. The initial state is a mixed state with no coherence between the eigenstates, and given by the fully-filled valence band. The time-dependent interaction is performed in the length gauge and in the dipole approximation, where H 0 (k) is the field-free Hamiltonian in Eq. 17 and E(t) is the time-dependent electric field. The position operator is given by r = i∂ k + A(k), where A(k) is the Berry connection [14]. The current is then obtained as where N k are the number of k-points, and v k = −i[r, H 0 (k) + E(t) ⋅ r] = −i[r, H 0 (k)] is the velocity operator.
We checked the high harmonic spectrum was converged for a k-grid of n kx = n ky = 200 points and a time step of ∆t = 0.8 a.u. Same as before, for each crystal sample, consisting of a mass, first neighbor hopping, and initial CEP, we measure the current between H0 and H40.

X. TRANSFER LEARNING
The method of transfer learning is applied to problems for which there isn't enough data to train a neural network (commonly named "target" problems), but which have a similar problem ("source" problem) with sufficient training data available. The first stage consists of training the model (in our case, a neural network) to solve the source problem. The deeper layers of our network learn more abstract representations of the input data. However, as the information flow approaches the output, the concepts learned by the layers become less abstract and more specialized on the exact problem. Thus, in the second stage, the neural network is partially retrained for the target problem, with only the last layers being retrained. The retrained network may or may not be used as part of a larger neural network which modifies its inputs or uses the outputs of its hidden layers.
Alternatively, transfer learning may be used for two similar problems even when there is sufficient data for the target probkem. A network pre-trained on a source problem may be retrained completely for the target, but its pre-trained parameters still prove to be a better initial guess than random initialization.
We thus use the idea of transfer learning in two senses. First, for the source (TDSE) problem, networks trained on datasets with less unknown nonlinear phase parameters are used as initial guesses to train networks to recover more complex spectral phases -i.e., the networks designed to recover the CEP and band parameters for unchirped pulses are retrained to process chirped pulses, after which they are retrained on the dataset with a cubic phase.
We also use transfer learning to train our neural network to recover pulses and band parameters from gapped graphene spectra. After pretraining it on the source dataset we use 1280 gapped graphene responses as target dataset. Due to its small size, we use 20% of