On the learnability of quantum neural networks

We consider the learnability of the quantum neural network (QNN) built on the variational hybrid quantum-classical scheme, which remains largely unknown due to the non-convex optimization landscape, the measurement error, and the unavoidable gate errors introduced by noisy intermediate-scale quantum (NISQ) machines. Our contributions in this paper are multi-fold. First, we derive the utility bounds of QNN towards empirical risk minimization, and show that large gate noise, few quantum measurements, and deep circuit depth will lead to the poor utility bounds. This result also applies to the variational quantum circuits with gradient-based classical optimization, and can be of independent interest. We then prove that QNN can be treated as a differentially private (DP) model. Thirdly, we show that if a concept class can be efficiently learned by QNN, then it can also be effectively learned by QNN even with gate noise. This result implies the same learnability of QNN whether it is implemented on noiseless or noisy quantum machines. We last exhibit that the quantum statistical query (QSQ) model can be effectively simulated by noisy QNN. Since the QSQ model can tackle certain tasks with runtime speedup, our result suggests that the modified QNN implemented on NISQ devices will retain the quantum advantage. Numerical simulations support the theoretical results.


Introduction
Deep neural network (DNN) has substantially impacted the field of machine learning in the past decade [1]. Most real-world applications, such as object detection [2,3], question answering [4,5], social recommendation [6], among many others, could be accomplished by DNN-based learning algorithms with state-of-the-art performance because of the powerful computational hardware and the flexible architecture of DNN. As shown in Fig. 1 (a), DNN adopts a multi-layer scheme. The inputs were processed through the feature embedding layers F x (·), followed by the fully-connected layers W (·), where the choice of each layer and the combination rule can be tailor made for various learning tasks. Training DNN is a process to uncover the intrinsic relation between the input and the output of the given dataset. A huge amount of effort has been dedicated to understanding and explaining the learnability of DNN from the perspective of the convergence and the generalization [7,8,9,10,11]; namely, the capabilities and limitations of DNN learning models.
Quantum machine learning is a central application of quantum computing [12]. With the aim of solving real-world problems beyond the reach of classical computers, firm and steady progress has been developed during the past decade [13,14,15,16]. Among these breakthroughs, a quantum extension of DNN, i.e., the quantum neural network (QNN), which is separately proposed in [17,18,19, 20], For DNN, the feature embedding layers F x (·), which contains a sequence of operations with the arbitrary combination such as convolution and attention, maps the input '0' to the feature space. W l (·) is the l-th fully-connected layer. For QNN, an encoding quantum circuit U x maps the classical input '0' to the quantum feature space. U l (θ) is the l-th trainable quantum circuit. Classical information for optimization is extracted by quantum measurements. received great attention due to the huge success of DNN and the superior computational power of quantum machines. As shown in Fig. 1 (b), QNN also adopts the multi-layer architecture, where the inputs were converted into corresponding quantum states by the encoding quantum circuit U x , followed by trainable quantum circuits U (θ) = L l=1 U l (θ), where θ is the adjustable parameter of quantum gates, and a classical optimizer. There is a close correspondence between DNN and QNN: the feature embedding layers 'F x ' of DNN coincide with the encoding quantum circuit U x of QNN, while the fully-connected layer W l (·) of DNN coincides with the trainable quantum circuit U l (θ) of QNN. Celebrated by the stronger power of quantum circuits to prepare classical distributions [21,22], QNN could possess a stronger expressive power than its classical counterparts [23] and accelerates a wide range of machine learning problems.
Despite the promising prospects, theoretical results about QNN remain largely unknown. The difficulties mainly come from two sides. First, the versatile structures of QNN and their non-convex optimization landscapes, similar to the DNN, heavily challenge the analysis. Second, due to the nature of quantum mechanics, the classical optimizer only receives estimated statistical information with a finite number of measurements, and the error will pile up with the increased number of iterations. Although some studies have overcome partial difficulties from the aspect of vanishing gradients [24,25], robustness [26,27], information scrambling [28], memory capacity [29], and no-free lunch theorem [30], the fundamental question, namely, 'What is the learnability of QNN ', is left open.
The importance of exploring QNN's learnability is further increased in the noisy intermediate-scale quantum (NISQ) era [31,32], since QNN can be easily built on NISQ machines and its performance is robust against gate noise. Empirical studies have shown that QNN can accomplish various supervised learning tasks, e.g., classification [33,18,20], regression [19,34]. However, no theoretical results can conclude any quantum advantage of these outcomes. To theoretically explain the empirical observations, exploring the learnability of QNN under ERM framework [35] could be very fruitful, because ERM underpins many core results in statistical learning theory and offers learning guarantees for a wide range of supervised learning tasks. Furthermore, it is unclear how gates noise affects the learnability of QNN. This answer substantially affects the feasibility of QNN on NISQ machines to purse quantum merits.
Problem setup. We follow the convention in statistical learning theory, and examine the learnability of QNN under the framework of empirical risk minimization (ERM) [36] as a first step. In this way, analyzing the learnability of QNN amounts to checking the utility bounds generated by QNN.
Let z = {z j } n j=1 ∈ Z be the given dataset with Z being the sample domain, where the i-th sample z j = (x j , y j ) includes a feature vector x j ∈ R D and a label y j ∈ R. ERM aims to find the optimal θ * ∈ R d by minimizing the objective function L within the constraint set C ⊆ R d , i.e., θ * = arg min θ∈C L(θ, z) := 1 n n j=1 (y i ,ŷ i ) + r(θ) , whereŷ i is the predicted label that is determined by θ and x i , is the loss function that measures the disparity between true labels {y j } n j=1 and the predicted labels {ŷ i } n i=1 , and r(·) is a regularizer. To ease the discussion, throughout the paper, we consider the square loss , and use r(θ) = λ θ 2 2 /2 with λ ≥ 0. Note that our analysis can be easily generalized to other loss functions that satisfy S-smooth and G-Lipschitz properties as discussed in Sec. 3.
The common optimization rule to tackle ERM is the batch gradient descent method [1]. Depending on the available resources, the sample indices are divided into B disjoint batches z j . The optimization rule at the t-th iteration is θ (t+1) = where η is the learning rate, the gradient ∇L(·) is j are the sum average of the true labels and the predicted labels for the i-th batch B i , respectively. When no confusion will occur, we use L(θ (t) ) and L i (θ (t) ) instead of L(θ (t) , z) and L(θ (t) , B i ) in the rest of study.
The training of QNN is similar to those of DNN. In particular, QNN also generated a sum average of the predicted labels, based on θ and B i , after the measurement component in Fig. 1 (b). However, the major difference between the gradient-based optimization of QNN and DNN is as follows. In DNN, the gradient in Eqn. (2) can be easily obtained via backpropagation [1]. However, due to the nature of quantum mechanics, the gradient of a quantum unitary operator (e.g., trainable quantum circuit layer U l (θ)) is, in general, not a legitimate quantum operator anymore [37]. To overcome this shortcoming, the parameter shift rule [19,37] is proposed to estimate the gradients of a quantum unitary operator using K measurements. However, difficulties arise since only approximatedŶ (t) i and ∂Ŷ (t) i /∂θ (t) are available due to a finite number of measurements, and the precision deteriorates when more iterations occur. The detailed steps will be discussed in Sec. 3.
Furthermore, we would like to incorporate the unavoidable gate noise of the trainable quantum circuit U (θ) in our studies. This can be done by considering the worst-case scenario, i.e., modeling the gate noise at each circuit depth to be quantum depolarization noise N p [38]. Intuitively, if a quantum state passes through N p , with probability 1 − p, the output remains unchanged; otherwise, all information of the input is lost and the output is the maximally mixed state. Note that the achieved results can be easily extended to a more general noisy channel (See Appendix K for details).
We adopt two standard utility metrics to quantify the performance (learnability) of QNN: where θ (T ) is the output of QNN after T iterations and ∇L(·) denotes the gradient of the function L(·). The metric R 1 evaluates how far QNN is away from the stationary point, ∇L(θ (T ) , z) 2 = 0, in expectation [39,40]. The utility metric R 2 evaluates the expected excess empirical risk [41,42]. Due to the hardness to find the global optima in the non-convex landscape, R 2 can only be applied to some special non-convex objective functions, i.e., the objective functions satisfy the Polyak-Lojasiewicz (PL) condition [43,44]. We will show that, under a mild assumption, the objective function of QNN also meets the PL condition, and R 2 can be employed to analyze its performance.
Contributions. The main contributions of this study are as follows. Our first contribution is deriving QNN's utility bounds for ERM. As aforementioned, the non-convex optimization landscape, the piled up estimation error due to quantum measurements and the inevitable gate noise, heavily challenge the analysis of QNN's utility bounds. To the best of our knowledge, this is the first study towards understanding the learnability of QNN with the provable guarantee. Theorem 1. QNN outputs θ (T ) ∈ R d after T iterations with utility bounds where K is the number of quantum measurements, L Q is quantum circuit depth, p is the gate noise, and B is the batch size.
Theorem 1 indicates that a larger number of measurements K, a smaller gate noise rate p, a shallower circuit depth L Q , and a smaller number of trainable parameters d can yield a better utility bounds for both R 1 and R 2 . We remark that the achieved utility bounds R 1 and R 2 are very general, and cover various types of encoding quantum circuits U x and trainable quantum circuits U (θ). In particular, our results cover all typical encoding circuits, e.g., amplitude encoding [45,46,47], kernel mapping [18,19,20], dimension reduction method [48], and basis encoding methods [49,17], and a diverse architectures of the trainable quantum circuit, as long as it is composed of the parameterized single qubit gates and two qubits gates [50].
Note that the variational hybrid quantum-classical learning models have also been empirically applied to explore fundamental properties of physical systems, e.g., ground energies approximation and thermal averages computation [51,52]. These problems are generally more sensitive to the global minimum than that of machine learning problems. Therefore the utility bounds in Theorem 1 can serve as a powerful tool to support those results.
A central topic in classical machine learning is exploring whether the noise affects the learnability of a given learning task. A notable example is that the class of parity functions is probably approximately correctly (PAC) learnable; however, learning parity with noise is thought to be computationally hard [53]. Here we lift this essential question from the classical scenario to the quantum scenario: whether there exists any concept class that separates the learnability of the noiseless QNN with noisy QNN, i.e., noiseless QNN uses polynomial samples to learn this concept class, while exponential samples are needed for the NISQ case.
Our second contribution is providing a negative answer towards the above question.
Theorem 2. If QNN with noiseless gates PAC learns a concept, then there exists a modified QNN with certain types of noisy gates that can also learn this concept using polynomial samples.
The result of Theorem 2 indicates the same sample complexity between noiseless QNN and noisy QNN to learn a specific concept class. This implies that if QNN achieves certain learning tasks with quantum advantages, then we can implement QNN on NISQ machines with a simple modification to preserve advantages as well.
The key technique used to achieve Theorems 2 is differentially private (DP) learning [54,55,56,57]. The intuition to employ DP is as follows. The behavior of QNN with gate noise resembles learning with noise and DP learning, where a certain type of noise is injected into the learning model. However, DP learning dispels learning with noise [58], where the former can effectively tackle some tasks that are computationally hard for the latter. Hence, it is beneficial to explore whether QNN with gate noise belongs to a DP learning model instead of learning with noise. The exploration about the DP property of QNN leads to our third contribution. Lemma 1. The QNN with gate noise can be treated as a ( , δ)-DP model with δ ≥ 0 and Together with the fact that non-private and DP algorithms share the same learnability in terms of sample complexity [58], we complete Theorem 2.
Last, we explore the learnability of QNN implemented on NISQ machines. In particular, we aim to find certain tasks that can be achieved by these two learning models with quantum advantages. To reach this goal, we explore whether quantum statistical query learning (QSQ) model can be efficiently simulated by these two learning models, since QSQ can efficiently tackle parity learning, juntas learning, and DNF (disjunctive normal form) learning problems with quantum advantages, whereas these problems are computationally hard for classical SQ models [59]. Our third contribution is exhibiting that QSQ can only be efficiently simulated by QNN with gate noise, and establish a computational separation between the original QNN and modified QNN.
Theorem 3. A QSQ learning model can be efficiently simulated by QNN with gates noise using polynomial samples.

Related work
Previous quantum machine learning literatures that are related to our work can be divided into two groups: quantum learning theory and quantum neural networks. We address that, none of the studies listing below have concerned the learnability of QNN.
For the first group, the studies [60,61,62,63] exhibited that the sample complexity of quantum and classical probably approximately correct learning (PAC) (or agnostic) learning models is equal up to a constant factor under the distribution-independent setting. A recent study [59] generalized the classical statistical query model (SQ) to the quantum statistical query (QSQ) model and compare the learnability among SQ learner, (noisy) quantum PAC learner, QSQ learner, and PPAC learner. However, how to use these results to analyze the learnability of QNN is inexplicit.
For the second group, beyond the hybrid scheme as discussed in this study, there are different schemes and platforms to implement QNN. Specifically, several studies have investigated how to implemented QNN on noiseless quantum machines [64,65,66], quantum reservoir [67], and quantum annealers [68]. Since these proposals adopt distinct frameworks, they are incomparable with our results.

Preliminary
We unify the notations throughout the whole paper. We denote D as the feature dimension (x ∈ R D ), d as the number of training parameters (θ ∈ R d ). Define N as the number of qubits and n as the number of training examples. Denote the set {1, 2, ..., m} as [m]. A random variable X that follows Bernoulli distribution is denoted as X ∼ Ber(p), i.e., Pr(X = 1) = p and Pr(X = 0) = 1 − p. With a slight abuse of notations, we denote b as the b-norm, while (without subscript) is the loss function. We use O(·) (orÕ(·)) to denote the complexity bound (hide poly-logarithmic factors). See Appendix A for details.
Quantum computing. We show basic insights of quantum computing. Quantum state works in the Hilbert space H with H C. Let |0 = 1 0 and |1 = 0 1 be the standard basis states for C 2 . A quantum bit (qubit) lives in a two-dimensional Hilbert space formed by |0 and |1 . Multiple qubit basis states follow the tensor products rule, e.g., |0 ⊗ |1 ≡ |1 |0 ∈ C 4 describes a basis state of a 2-qubit system. A pure state |a with N -qubits follows |a = d i=1 a i |i with d = 2 N and a 2 = 1, where the basis state |i ∈ {|0 , |1 } ⊗N is also called computation basis. |a is in superposition if a 0 > 1. The conjugate transpose of |a is denoted as a|. We use density matrix to describe more general quantum states. Given a mixture of m pure states and Tr(ρ) = 1. There are two main types of quantum operations in quantum computation. The first one is quantum channel, which is a completely positive trace-preserving map, e.g., applying a channel N to a density matrix ρ ∈ C d×d generates the state N (ρ) = a M a ρM † a with a M a M † a = I d . Note that, a quantum gate, which is a unitary matrix, is a special quantum channel. The second one is quantum measurement, which extracts classical information from quantum state. An m-outcome measurement, a.k.a. positive-operator-valued measure (POVM), is modeled by m positive semidefinite matrices Given ρ, the probability to get outcome b is p b = Tr(Π b ρ).
Definition 1 (Depolarization channel). Given a quantum state ρ, the depolarization channel N p acts on D-dimensional Hilbert space is defined as N p (ρ) = (1 − p)ρ + pI/D.
Differential privacy. Differential privacy (DP) is a rigorous and standard notion for data privacy, which aims to train an accurate learning model without exposing the precise information in individual training example, e.g., genomic data and medical records for patients [55].

Utility bounds of quantum neural network towards ERM
A well-known consequence in ERM study is that the utility bounds of a given learning model massively depend on what kind of and how much error contained in its gradient [69,54,70]. Specifically, when the gradient is perturbed by a sufficiently large amount of noise, the optimization may not converge and the utility bound is poor [71]. Meanwhile, empirical and theoretical evidence has corroborated that, injecting certain types of noise into the gradient does not affect or can even accelerate the convergence [54,72,73]. A similar issue also happens to optimize QNN. In particular, when QNN is realized on quantum chips, the presence of sampling error and gate error enables that the classical optimizer only has access to an estimated gradient instead of the analytic gradient. However, theoretical results about how the involved estimation error of gradient affects the optimization remain largely unknown. Moreover, the heuristic study [74] showed that the conclusions based on certain quantum learning models, which are built under the ideal setting that omits the gate error or sample error, may not be applicable to experiments. Therefore, it is crucial to establish the analytical relation between the estimated and analytic gradients, since this relation is not only the precondition to analyze the utility bounds of QNN towards ERM, but can also be used to quantify how the hybrid classical-quantum learning schemes perform on real quantum devices as an independent interest.
We first elaborate the workflow of QNN. As shown in Figure 1 (b), QNN first employs a state preparation unitary U x to encode classical inputs {x j |j ∈ B i } into quantum states, followed by the quantum circuit U (θ) with tunable parameter θ to produce the state γ B i . We refer the interested reader to Appendix B for implementation details of U x and U (θ). Finally, a two-outcome measurement POVM Π is applied to the state γ B i and produces the outcome V i that can be viewed as a binary random variable with the Bernoulli distribution Ber(Ŷ i ), whereŶ i := Tr(Πγ B i ). Denote the obtained statistics, i.e., the sample mean, byȲ i = 1 K K k=1 V k after repeating the above procedure K times. The law of quantum mechanics ensuresȲ i →Ŷ i when K → ∞. However, in reality, only a finite number of measurements is allowed, and this results in the sample error (measurement error).
In addition, the quantum gates in NISQ machines, which are used to implement U x and U (θ), are prone to having errors [32]. The gate noise can be modelled by applying certain quantum channels to each quantum circuit layer, and we use the depolarization channel N p in Definition 1 in the following analysis. Note that our analysis works for more general channels, as discussed in Remark 1. Specifically, with applying N p to each layer of quantum circuit, the quantum state before Recall that the updating rule of QNN at the t-th iteration: In order to obtain the gradient ∇ j L(θ (t) ), the parameter shift rule is developed [19,37], since the gradient of a quantum unitary operator may not be a legitimate quantum operation and cannot be realized on quantum circuits. Specifically, the parameter shift rule proceeds by separately feeding tunable parameters θ (t) and θ (t,± j ) := θ (t) ± π 2 e j to the trainable circuit U (θ), where e j is the basis vector with the j-th entry being 1 and zero otherwise. Following the notations used above, we denotê Y (t) i andŶ (t,± j ) i as the expectation values of quantum measurements when feeding parameters θ (t) and θ (t,± j ) into trainable quantum circuit U (θ) in the noiseless scenario. The corresponding analytic gradient is However, in practice, QNN could only generate statisticsȲ refer to the expectation values of quantum measurements when feeding parameters θ (t) and θ (t,± j ) into the noisy trainable quantum circuit U (θ). This leads to the estimated gradient as Our main technical contribution here is showing that the estimated gradient, which is caused by the gates noise and the sampling error, can be related to its optimal gradient, and can be explicitly formulated. An informal result is summarized below (See Theorem D.1 in Appendix D for details).

Theorem 4. It follows that
j,1 only depends on Y i , θ (t) , andp, and ς (t,j) i follows the distribution P Q that is formed by Y i , θ (t) , the number of measurements K, and p with zero mean.

The achieved result in Theorem 4 indicates that the estimated gradient
j,1 and perturbed by a random variable ς (t,j) i . This enables us to quantitively measure how far the estimated gradient is away from the analytic gradient, which is the precondition to leverage the optimization theory to analyze the performance of QNN. Moreover, the result of Theorem 4 implies that, compared with the finite measurements, the gate error is more harmful for the QNN's optimization, which may lead to diverging. In particular, the term C (i,t) j,1 , which is independent with K, will always exist and induce a biased optimization direction wheñ p = 0. For the worst case, withp = 1, the analytic gradient information is exactly lost. In contrast, K only determines the variance of the distribution P Q with zero mean, where classical and quantum literatures [75,73] have provided the convergence guarantee even if K = 1.
Beside the effects of gradient error, the utility bounds also heavily depend on the properties of objective functions. In the following, we show that L used in QNN satisfies S-smooth, G-Lipschitz, and PL condition. The formal definitions of these concepts and the achieved results are given below.
The proof of this lemma is provided in Appendix C. The analysis of the utility bounds R 1 and R 2 of QNN towards ERM, which are summarized in Theorem 1, can be effectively conducted by leveraging Theorem 4 and Lemma 2. Theorem 1 provides the following theoretical guidances to design QNN-based learning algorithms, i.e., a larger amount of measurements K and lager batch size B, smaller deporlarizing error p, smaller parameter space d, and shallower quantum circuit L Q , can yield a better utility bounds R 1 and R 2 .
The full proof of Theorem 1 is provided in Appendix E. The proof strategy of Theorem 1 is as follows. Recall that the utility bound R 1 measures how far the trainable parameter of QNN is away from the stationary point. A well-known result in optimization theory [76] is that the stationary point of a function can be efficiently located by a simple analytic gradient-based algorithm, once the function satisfies the smooth property. Hence, to achieve R 1 , we can utilize the smooth property of L and the result of Theorem 4, which reformulates the estimated gradient by the analytic gradient, to analyze the stationary convergence of QNN. The key component to achieve R 2 is the PL condition. Recall that the utility bound R 2 evaluates the disparity between the expected and the optimal empirical risk. The study [43] indicates that, if a non-convex function satisfies PL condition, then every stationary point is the global minimum. Alternatively, PL relates the stationary point with the optimal empirical risk, which is determined by the global minimum. Hence, by leveraging the PL condition and the result of R 1 , we can obtain the utility bounds of R 2 .

The learnability of quantum neural networks with noisy gates
In the analysis of ERM, we exhibit that the gate noise of QNN massively affects its utility bounds. In this section, we aim to understand how noise in QNN affects its learning capabilities in terms of the sample complexity; namely, whether any concept class that can be probably approximately correctly (PAC) learned by QNN with noiseless gates can also be PAC learned by QNN with noisy gates. If the answer is negative, this concept class is unlikely to be efficiently learned on the NISQ quantum devices. Moreover, it will demonstrate the inequivalent learnability of noiseless QNN and noisy QNN. In the classical literature, the class of parity functions serves as an excellent example to separate the learnability of the PAC learning model with the statistical query (SQ) learning model [53]. Furthermore, there is an even more pressing need to understand what kinds of concept classes can be efficiently learned by QNN with quantum advantages. Towards this question, we explore whether any concept class that is learnable in the quantum statistical query (QSQ) model [59] is also learnable by noiseless and noisy QNN, enlighten by the fact that QSQ model can tackle certain learning tasks that outperform its classical counterpart.
In order to answer the above questions, we attempt to relate noisy QNN with the differentially private (DP) learning model [77], driven by the observation that DP models share a similar behavior with noisy QNN. Specifically, analogous to QNN, DP models involves certain types of noise to achieve the privacy guarantee. If noisy QNN were also a DP model, then we can conclude the same learnability of QNN and noisy QNN, since a concept class that is learnable by a (non-private) algorithm with polynomial sample complexity can also be learned privately using a polynomial number of samples [58]. The Lemma 1 provides an affirmative response, which exhibits that QNN with noisy gates can be treated as a DP learning model (The proof details is given in Appendix F).
The learnability of DP models [58] has been extensively explored in the literature. Two studies [60,59] separately proved that the sample complexities of classical and quantum (differentially private) PAC learning are equal, up to constant factors. Combining with the fact that PAC = PPAC [58], we can conclude that the sample complexity of PAC, PPAC, quantum PAC, and quantum PPAC learning are equivalent. The conclusion together with Lemma 1 allows us to achieve Theorem 2, i.e., if noiseless QNN PAC learns a concept class, then QNN with gate errors can also learn this concept class using polynomial number of samples. We present the full proof of Theorem 2 in Appendix G.
The result of Theorem 2 indicates that there does not exist a concept class that can be efficiently learned by noiseless QNN, while it is computationally hard for noisy QNN. This result provides a theoretical guarantee to realize QNN on NISQ chips to seek potential quantum advantages.
We further utilize the theoretical results of QSQ model to quantify what kinds of learning problems can be tackled by noisy QNN with quantum advantages. The study [59] shows that QSQ can efficiently tackle parity, juntas, and DNF learning tasks, which are provably hard to learn by the classical statistical query (SQ) models [53]. We proved in Theorem 3 that the QSQ model can be efficiently simulated by QNN, and whose proof is given in Appendix H. Therefore we conclude that these tasks can also be accomplished by QNN with quantum advantage.
All results in this section assume depolarization gate noise; however, they can be extended to a more general model of gate noise. See Appendix I for details.

Numerical simulations
We employ the UCI ML hand-written digits datasets [78] to validate the correctness of utility bounds R 1 and R 2 of QNN, as achieved in Section 3 and 4. In the rest of this section, we first introduce the employed dataset and the required preprocessing steps. We then elaborate the employed parameterized quantum circuits that are used in QNN. We last demonstrate our numerical simulation results.
The employed dataset includes in total 1797 hand-written digits images with 10 labels, where each label refers to a digit and each image has 64 attributes. The data preprocessing has three steps.
First, we clean the dataset and only collect images with labels 0 and 1. After cleaning, the total number of images is 360, where the number of examples with label 0 (label 1) is 178 (172). Some collected examples are shown in the left panel of Fig. 2. Alternatively, our simulation focuses on the binary classification task. Second, we utilize a feature dimension reduction technique, i.e., principal component analysis (PCA) [79], to reduce the feature dimension of each data example from 64 to 3. The middle panel of Fig. 2 exhibits the reconstructed hand-written digit images using the reduced data features. Such a step aims to balance the relatively high dimension features of the data example and the limited quantum resources available in present-day. After applying PCA, we denote the employed dataset as is the i-th data feature and y i ∈ {0, 1} is the i-th label. The last step is uniformly and randomly splitting the dataset z into two groups, i.e., the training dataset z t and the test dataset z p . The size of the training dataset z t and the test dataset z p is 280 and 80, respectively. The construction of parameterized quantum circuit, i.e., the data encoding circuit U x and the trainable unitary U (θ), follows the proposal [18]. In particular, the data encoding circuit U x uses the kernel encoding method, and the architecture of trainable unitary U (θ) follows the layer structure. The right panel of Fig. 2 illustrates the implementation of data encoding circuit and trainable circuit used in QNN. Three qubits are employed to build such two circuits. The data encoding circuits U x is composed of Hadamard gates H = 1 1 0 . We now employ the preprocessed hand-written digits dataset and quantum circuits as described above to study the learnability of QNN under depolarization noise. Specifically, we apply depolarization channel N p to every quantum circuit depth, where the depolarization rate is set as p = 0.0025. The depth of trainable circuits U (θ) is set as L = 5 and L = 20, respectively. The corresponding number of trainable parameters is 15 and 60, respectively. We also train QNN without noisy channels N p under the setting L = 5, 20, which aims to estimate the optimal parameter θ * and the minimized objective function L * . The number of iterations for all numerical simulations is set as T = 400. For QNN, the number of measurements to estimate the expectation value is set as K = 20.
The simulation results are shown in Fig. 3. We now elaborate how numerical simulations accord with our theoretical results. Two red arrows indicate the gap between optimal result L * and the achieved results L(θ (T ) ). With increasing the circuit depth L, the gap becomes large for QNN. Such a phenomenon follows our theoretical result, where a larger d andp lead a poor utility bound R 2 .

Conclusion
In this study, we explore the learnability of QNN from the aspect of ERM framework and sample complexity. The achieved utility bounds towards ERM indicate that, more measurements, lower noise, and shallower circuit depth contribute to a better performance of QNN. Built on the conclusion that the same learnability between the noiseless QNN and QNN with noisy gates, we obtain the theoretical evidence that supports implementation of QNN on NISQ chips to pursue quantum advantages. Moreover, we demonstrate that QNN with noisy gates can efficiently learn parity, juntas, and DNF with quantum advantages even with gate noise.
Our work also generates plausible new directions for NISQ study that we plan to explore in the future. First, we will use other advanced DP results to analyze various variational hybrid models on NISQ machines with provable guarantees. Second, we aim to tackle private learning tasks with quantum merits because the gate noise of NISQ machines will benefit the design of quantum DP mechanism.
The organization of the appendix is as follows. In Appendix A, we unify the notations used in the whole appendix. In Appendix B, we elaborate the implementation details of the quantum encoding circuit U x and the trainable quantum circuit U (θ) used in QNN. In Appendix C, we present the proof of Lemma 2, which quantifies the properties of the objective function with respect to the optimization theory. Then, in Appendix D, we exhibit the proof of Theorem 4, as the precondition to achieve utility bounds of QNN. In Appendix E, we exhibit the proofs details of Theorem 1 that achieves the utility bounds of QNN towards ERM. The following four sections explain the learnability of QNN from the perspective of sample complexity. Specifically, in Appendix F, we provide the proof details of Lemma 1. Next, in Appendix G and H, we separately prove Theorem 2 and Theorem 3. Eventually, in Appendix I, we generalize all achieved results to a more general quantum channel.

A The summary of notations
Here we unify the notations used in the appendices. A random variable X that follows Delta distribution is denoted as X ∼ Del(x 0 ), i.e., Pr(X = x 0 ) = 1 and Pr(X = x 0 ) = 0. A random variable X that follows uniform distribution is denoted as We denote the p norm of v as v p . In particular, v refers to the 2 norm.

B Implementation details of encoding circuit and trainable circuit of QNN
The selection of encoding circuits U x and trainable circuit U (θ) is flexible in QNN. We now separately explain the implementation details of these two circuits supported by QNN.
Encoding circuit U x . The typical encoding circuits can be divided into four categories. A common feature of these encoding methods is that their implementation only costs low circuit depth, driven by the restricted quantum resources. The first category is the direct amplitude encoding [80,45,46,47]. Specifically, the encoder circuit satisfies U x : b,j . This method requires a low feature dimension D, since the quantum gates complexity to build U x is O(D). The second category is the kernel mapping [18,19,20], where B i is encoded into a set of single-qubit gates with a specified arrangements, e.g., The third category is the dimension reduction method proposed by [48]. Specifically, instead of encoding B i , the amplitude or kernel encoder circuits U x is exploited to encode a projected features g(B i ) ∈ R Bs×D , where g(·) is a predefined function and D D. The fourth category is the basis encoding [60,59,17], which is broadly used in quantum learning theory. Specifically, the encoding circuit U x is employed to prepare a quantum example |ψ = x∈{0,1} N D(x) |x, c(x) , where D(x) is the data distribution over x, c(x) corresponds to the label of the bit-string x [60,61]. In most cases, the distribution D(x) is uniform. Hence, the state |ψ can be efficiently prepared by setting B = 1, and applying Hadamard gates and control-not gates [38] to the initial state |0 ⊗N +1 .
Trainable quantum circuits U (θ). The trainable quantum circuits, a.k.a, parameterized quantum circuits [50,23], used in QNN can be written as a product of layers of unitaries in the form U (θ) = L l=1 U l (θ l ), where U l (θ l ) is composed of parameterized single-qubit gates and fixed two-qubits gates. Each trainable layer can be decomposed into U l (θ l ) = ( N k=1 U l,k (θ l ))U eng , where U l,k (θ l ) represents the composition of trainable single-qubit gates and U eng refers to entanglement layer that contains two-qubits gates. Depending on the detailed architecture, the implementation of U l (θ l ) can be categorized into three classes. The first class is the hardware-efficient circuit architecture, where the selection of U k (θ l )) and U eng is according to the given NISQ machine that has the specific sparse qubit-to-qubit connectivity and a specified set of quantum gates [24, 81,25]. The second class is the tensor network inspired architecture. In particular, the layout of quantum gates is following different tensor networks, e.g., the matrix product state, the tree tensor network, and the multi-scale entanglement renormalization ansatz (MERA) [82]. The third class is the Hamiltonian based architecture, where the entanglement layer U eng refers to a specific Hamiltonian, e.g., the study [19] Notably, almost all quantum approximate optimization algorithms follow the Hamiltonian based architecture [83].

C Proof of Lemma 2
The Lemma 2 indicates that the objective function L(θ) used in QNN satisfies S-smooth and G-Lipschitz properties. Moreover, it also satisfies the Polyak-Lojasiewicz (PL) condition under the assumption with λ ≥ 1/π. To ease the discussion, we first formulate the explicit form of L(θ). Without loss of generality, we set B = n, where each batch B i only contains the i-th input x i . Denote the prepared quantum states as refers to the prediction of QNN given the i-th input x i , U (θ) is the trainable circuit, Π is the employed two-outcome POVM, and y i is the true label of the i-th input. Moreover, since the tunable parameters θ in QNN refer to the rotation angles, we set its range as θ ∈ [π, 3π] d .
Proof of Lemma 2. We employ the three lemmas presented below to prove Lemma 2, whose proofs are given in the following subsections.
with S > 0. In other words, to obtain S, we need to obtain the upper bound of the second derivative of L(θ), i.e., S ≥ ∇ 2 L(θ) ∞ . Following the notation used in Eqn. (4), the gradient for the parameter θ j is whereŷ (± j ) i = Tr(ΠU (θ ± π 2 e j )ρ B i U (θ ± π 2 e j ) † ), the second equality employs the conclusion of the parameter shift rule with ∂ŷ i ∂θ j =ŷ 37], and the last inequality uses the facts π ≤ θ j ≤ 3π, ∂θ j ∂θ k can be derived using the results of Eqn. (6). In particular, where the first equality comes from the last equality of Eqn. (6), and the last inequality employs supported by the parameter shit rule andŷ i ,ŷ (± j ) i ∈ [0, 1]. The result of Enq. (7) implies that ∇ 2 L ∞ ≤ 3 2 + λ. In conjunction with Eqn. (5), the objective function is S-smooth with S = 3 2 + λ.
Moreover, the mean value theorem gives that, if f : Combining Enq. (8) and (9), the G-Lipschitz condition in Eqn. (8) is equivalent to We now replace f , b, and a used in Eqn. (10) with L, θ (1) , and θ (2) to prove that the objective function L is G-Lipschitz. Specifically, we need to find a real value G that satisfies where θ ∈ (θ (2) , θ (1) ).
The upper bound of the term ∇L(θ), θ (1) − θ (2) is In conjunction with Eqn. (11) and (12), G-Lipschitz of L requests By leveraging the result of Eqn. (6) with ∇ j L(θ) ≤ 1 + 3λπ, we obtain the upper bound of the left side in Eqn. (13) is This leads to the objective function L of QNN satisfying G-Lipschitz with G = d(1 + 3πλ).

C.3 Proof of PL condition, Lemma 5
Proof of Lemma 5. Recall the definition of Polyak-Lojasiewicz as formulated in Definition 3, it requires that the objective function L satisfies where L * = min θ∈C L(θ). We first derive a lower bound of ∇L(θ) 2 . In particular, we have The lower bound of max j (∇ j L(θ)) 2 as shown in Eqn. (16) follows where the last inequality is achieved by exploiting the last second line of Eqn. (6), and the facts θ j ∈ [π, 3π] andŷ i , y i ,ŷ Combining the assumption λ ≥ 1/π and the above results, the lower bound of Eqn. (16) satisfies We then derive the upper bound of the term (L(θ) − L * ) in Eqn. (15). In particular, we have where the first inequality comes from the definitions of L * , i.e., , and the second inequality employs the definition of L(θ) with and λ 2 θ 2 ≤ λ 2 d θ 2 ∞ = (3π) 2 λd/2. By combining Eqn. (17) and (18) with Eqn. (15), we obtain the following relation The above relation indicates that the objection function L(θ) satisfies PL condition with

D Proof of Theorem 4
Theorem 4 establishes the relation between the analytic gradient ∇ j L i (θ (t) ) and the estimated gradient ∇ jLi (θ (t) ) of QNN. Its formal description is as follows.
Theorem 5 (The formal description of Theorem 4). Denotep = 1 − (1 − p) L Q with L Q being the quantum circuit depth. At the t-th iteration, we define five constants with , K refers to the number of quantum measurements, andŶ (t) i and Y i are the sum average of the predicted and true labels for the i-th batch B i . The relation between the estimated and analytic gradients follows The intuition to achieve Theorem 5 is as follows. As explained in the main text, the discrepancy between the estimated gradient ∇ jLi (θ (t) ) and the analytic gradient ∇ j L i (θ (t) ) is caused by the difference between the estimated resultsȲ  ), due to the involved depolarization noise N p and the finite number of measurements K. Specifically, the noisy channel N p shifts the expectation values, and the finite number of measurements K turns the output of quantum circuit from the determination to be random. Under the above observation, the estimated gradients ∇ jLi (θ (t) ) can be treated as the random variable that is formed by three random variablesȲ . Therefore, to explicitly build the relation between ∇ jLi (θ (t) ) and ∇ j L i (θ (t) ), we should first formulate the distribution of the estimated gradients usingȲ (t) i andȲ (t,± j ) i , and then connect the obtained distribution with the analytic gradients. The following lemma summarizes the distribution of the estimated gradients usingȲ , whose proof is given in Subsection D.1.
The mean ν (t,± j ) i and variance (σ Proof of Theorem 5. We now utilize the established relations as shown in Lemma 6 to obtain the relation between the estimated and the analytic gradients. Recall that, at the t-th iteration, given the input B i and K measurements, the estimated gradient for j-th parameter θ j of noisy QNN is The term ∆ (t,j) i . Following the notations used in Lemma 6, the mean and variance of the term

supported by the definition of moments and the independent relation betweenȲ
By leveraging the explicit form of ν (t,± j ) i , the random variable ∆ (t,j) i can be rewritten as where ξ (t,j) is a random variable with zero mean and variance (σ Following the notations used in Lemma 6, an equivalent representation of where ξ (t) is a random variable with zero mean and variance (σ Combining the above equation and the explicit expression of ξ (t) and ξ (t,j) , we obtain the relation between the estimated and the analytic gradients. Specifically, the estimated gradient can be formulated as and the last two constants, which separately correspond to the variance (σ Tr(Π)

D.1 Proof of Lemma 6
To achieve Lemma 6, we first simplify the learning model of QNN with the depolarization noise. In particular, all noisy channels N p , which are separately applied to each quantum circuit depth, can be merged together to a specific circuit depth and presented by a new depolarization channel Np.

Lemma 7.
Let N p be the depolarization channel. There always exists a depolarization channel Proof of Lemma 7. Denote ρ (k) as ρ (k) = k l=1 U l (θ)ρU l (θ) † . Applying N p to ρ (1) gives where D refers to the dimensions of Hilbert space interacted with N p . Supporting by the above equation, applying U 2 (θ) to the state N p (ρ (1) ) gives Then interacting N p with the state U 2 (θ)N p (ρ (1) )U 2 (θ) † gives By induction, suppose at k-th step, the generated state is Then applying U k+1 (θ) followed by N p gives According to the formula of depolarization channel, an immediate observation is that the noisy QNN is equivalent to applying a single depolarization channel Np at the last circuit depth L, i.e., We then use the simplified QNN given by Lemma 7 to explore the relation between the generated statisticȲ (t) i and the expectation valueŶ (t) (the same rule applies to connectȲ (t,± j ) i withŶ (t,± j ) ). At the t-th iteration, given the tunable parameters θ (t) and inputs B i , the ensemble corresponding to the generated state of QNN before taking quantum measurements is {p l , γ i,2 = I D /D. After applying a two-outcome POVM Π to measure such an ensemble K times, the generated statistics (sample mean) isȲ is a random variable that satisfies Fact 1.
k is a random variable that follows the distribution P Q (V (t) Fact 1 implies that the mean and variance of V respectively. Moreover, since each outcome V (t) k follows the distribution P Q , the mean ν Following the same routine, where the mean ν (t,± j ) i and the variance (σ E Proof of Theorem 1 Theorem 1 quantifies the utility bounds R 1 and R 2 of QNN under the depolarization noise towards ERM framework. For ease of illustration, we restate Theorem 1 below. Theorem 6 (Restate of Theorem 1). QNN outputs θ (T ) ∈ R d after T iterations with utility bounds where K is the number of quantum measurements, L Q is the quantum circuit depth, p is the gate noise, and B is the number of batches.
The high level idea to achieve the utility bounds R 1 and R 2 is as follows. Recall that R 1 measures how far the trainable parameter of QNN is away from the stationary point. A well-known result in optimization theory [76] is that when a function satisfies the smooth property, its stationary point can be efficiently located by a simple gradient-based algorithm. By leveraging this observation and the relation between the estimated and analytic gradients as achieved in Theorem 5, we can quantify how the estimated gradients of QNN converge to the stationary point, which corresponds to the utility bound R 1 .
Recall that the utility bound R 2 evaluates the disparity between the expected empirical risk and the optimal risk that is determined by the global minimum. To achieve R 2 , we utilize the result of the study [43], which claims that if a non-convex function satisfies PL condition, then every stationary point is the global minimum. Since the objective function used in QNN satisfies PL condition as shown in Lemma 2, we can effectively combine the PL condition with the result of R 1 to obtain the utility bound R 2 .
Proof of Theorem 6. We employ the following two theorems to achieve Theorem 6, whose proofs are given in Subsections E.1 and E.2, respectively. Theorem 7. Given the dataset z, QNN outputs θ (T ) after T iterations with utility bound Theorem 8. Given the dataset z, QNN outputs θ (T ) after T iterations with utility bound As for R 1 , with setting T ← ∞ and after the simplification, the utility bound as shown in Theorem 7 follows As for R 2 , with setting and after simplification, the utility bound as shown in Theorem 8 follows E.1 Proof of Theorem 7: The utility bound R 1 The proof of Theorem 7 employs the following Lemma, where its proof is given in Subsection E.3.
2 with S being the smooth parameter is upper bounded by Proof of Theorem 7. Recall that the optimization rule of noisy QNN at the t-th iteration follows Since the objective function L(θ) is S-smooth, as indicated in Lemma 2, we have Combine the above two equations and setting η = 1/S, we have Recall the definition of the estimated gradient is ∇ jL (θ (t) ) = 1 ) and the explicit expression of ∇ jLi (θ (t) ) is Alternatively, the gradient for the j-th parameter ∇ jL (θ (t) ) follows Combining Eqn. (39) with Eqn. (40) and taking expectation over ξ , we obtain The first inequality uses the result of Eqn. (40). The second inequality uses E[ξ ] = 0 as shown in Theorem 5, and −G/d ≤ ∇ j L(θ (t) ) ≤ G/d supported by G-Lipschitz property.
By leveraging Lemma 8, Eqn. (41) can be further simplified as The first inequalities comes from Lemma 8, and the second inequality employs (1−p) 4 2SB ≤ (1−p) 2 2S and the following result where the first inequality uses the upper bound of C and the second inequality uses (1 −p) 2 ≤ 1.
An equivalent representation of Eqn. (42) is By induction, with summing over t = 0, ..., T − 1 and taking expectation of Eqn. (44), we obtain where the second inequality uses L( E.2 Proof of Theorem 8: The utility bound R 2 Proof of Theorem 8. The proof of Theorem 8 is similar with that of Theorem 7. In particular, following the same routine, we obtain the result of Eqn. (42), i.e., Then, we call the conclusion of PL condition as formulated in Lemma 2 and acquire An equivalent reformulation of Eqn. (47) is By induction, with summing over t = 0, ..., T and taking expectation, we obtain where the second inequality uses L(θ (0) ) − L * ≤ 1 + 90λd and 1 + x ≤ e x for all real x.

E.3 Proof of Lemma 8
Proof of Lemma 8. As shown in Theorem 5, the explicit formula of the estimated gradient is By using the above result, we obtain The first and second inequalities uses C and F Proof of Lemma 1 As shown in Theorem 5, the estimated gradient is center around the analytic gradients and is perturbed by the random noise ς (t,j) i that follows the certain distribution. This behavior resembles a class of differentially private (DP) learning algorithm [55], where a certain type of noise is attached to the gradients to achieve the privacy and utility guarantees. Driven by the similarity between noisy QNN and DP models, here we investigate whether noisy QNN can be treated as a DP learning model.
The proof of Lemma 1 leverages the composition property of DP model as summarized below.
Proof of Lemma 1. Recall that, for noisy QNN, the estimated gradient of j-th parameter at t-th is The composition property of DP as shown in Proposition 1 indicates that, if the mechanism M(θ (t) j , B i ) that corresponds to the quantum circuits as shown in Fig. 1 (b), which is used to output Y (t) i andȲ (t,± j ) i , satisfies DP property, then QNN with noisy gates also achieves the DP promise. Alternatively, to guarantee the DP property of QNN, we should prove that the random mechanism M(θ (t) j , B i ) is a DP model. Without loss of generality, here we focus on the setting with B = 1 and B i = z, since the privacy keeps unchanged when we vary B from 1 to N .
As explained in Theorem 5, the randomness of the mechanism M(θ where q = (1 −p) Tr(Ŷ (t) i ) +p Tr(Π)/D. In conjunction with Eqn. (54) and the definition of DP as formulated in Definition 2, the random algorithm M(θ (t) j , z) is DP if the following relation is satisfied, i.e., whereȲ (t) i refers to the sample mean of QNN given the tunable parameters θ (t) and the neighborhood dataset z . Combining Eqn. (54) and Eqn. (55), we obtain where q = (1 −p) Tr(Ŷ (t) i ) +p Tr(Π)/D, the inequality uses the facts q Ky (1 − q) K−Ky ≤ q and q Ky (1 − q ) K−Ky ≥ (q (1 − q )) K .
By replacing q and q with their explicit expressions, Eqn. (56) can be further simplified as where the nominator employs Tr(Ŷ The result achieved in Eqn. (56) indicates that the mechanism M(θ We then use the DP property of the mechanism M to derive the privacy parameter of QNN at the t-th iteration. By leveraging Proposition 1, the privacy parameters ( , δ ) of QNN to generate the estimated gradient of the j-th parameter ∇ j L i is Since the d trainable parameters of ∇L i are independent with each other, the definition of DP requests that, given two neighborhood input datasets z and z , the following relation should be satisfied at the t-th iteration, In conjunction with Eqn. (59) and Eqn. (60), we obtain = d , and , δ = dδ .
Since the mechanism of QNN that is used to generate ∇L i at the t-th iteration satisfies the ( , δ )-DP property, we can utilize Proposition 1 again to show QNN with T iterations is also an ( , δ)-DP model, i.e., G Proof of Theorem 2 The key ingredients to achieve Theorems 2 are classical and quantum differentially private (DP) learning techniques [54,84,55,57], and quantum PAC and QSQ learning models [60,59]. The intuition to employ DP is as follows. The behavior of QNN with gate noise resembles DP learning, where a certain type of noise is injected into the learning model. Moreover, a recent study [59] proved that if a learning problem is quantum PAC learnable, then it is also quantum privately PAC (PPAC) learnable. Such an observation implies that if QNN with gate noise belongs to the DP learning model, then we can conclude the same learnability between noiseless QNN and QNN with gate noise.
To incorporate the achieved result of QNN with other quantum learning theory conclusions, the quantum examples discussed below concentrate on a specific type as formulated in Definition 4, which is broadly employed in quantum PAC learning and quantum statistical query (QSQ) learning. Note that the quantum encoding circuit U x can efficiently prepare such quantum examples, as explained in Appendix B. Proof of Theorem 2. Given access to quantum examples |ψ c * as formulated in Definition 4, we can leverage the results of quantum learning theory [60,59] to exploit the learnability of noiseless QNN and QNN with noisy gates. In particular, the two studies [60,59] proved quantum PAC = PAC and quantum PPAC = PPAC. Since a well known classical result [58] is PAC = PPAC, we obtain the following relationship in terms of sample complexity, i.e., quantum PAC = quantum PPAC = PAC = PPAC .
Eqn. (63) indicates that the learnable concept classes for non-private learning model and the DP learning model are same. Consequently, if a concept is PAC learnable by a QNN, then such a concept is also PAC learnable by a DP learning model, i.e., QNN with noisy gates, where its DP property has been proved in Lemma 1.

H Proof of Theorem 3
Theorem 3 quantifies the required query complexity of QNN with noisy gates to simulate one query of QSQ model. The definition of quantum statistical query (QSQ) learning model and its relevant theoretical results [59] as shown below. The key technique to achieve Theorem 3 is the concentration inequality, which bounds the deviation of a random variable that corresponds to the output of QNN from a certain number. In particular, with treating the output α in QSQ model as the sample mean of QNN, the relation α − ψ c * |M|ψ c * | ≤ τ evaluates how is the probability when the distance between the sample mean α and its expectation ψ c * |Mψ c * is within τ . Such a question can be effectively answered by using concentration inequality.
Lemma 9 (Modified from Lemma 4.2, 4.3, and 4.5 in [59]). Let C be the concept class of parities, k-juntas, or poly(n)-sized DNFs (Disjunctive Normal Forms), then there exists a poly(n) queries QSQ algorithm with tolerance τ =Õ(ε) that ε-learns C under the uniform distribution. All of these concepts are computationally hard for SQ models.
Proof of Theorem 3. Following the notations used in Definitions 4 and 5, supposed that the encoding circuits U x prepares the quantum example |ψ c * and the trainable unitary U (θ) is identity I 2 ⊗N +1 . Then, with applying the observable M to the generated state of QNN, the expectation value of quantum measurements under the depolarization noise setting Np followsν = (1 −p)ν + 1 2 N +1 with ν = ψ c * |M|ψ c * , supported by Lemma 7. The measurement outcome V k is a random variable that satisfies V k ∼ Ber(ν).
By the Chernoff-Hoeffding bound for real-valued variables, we obtain the relation between the sample mean 1 K K k=1 V k with K measurements and the target resultν, i.e., Moreover, the distance between the target result ν and the shifted expectation valuesν follows In conjunction with the above two equations, we obtain, with probability at least 1−2 exp(−δ 2 n/2) Note that, to guarantee the termpν + Tr(M) 2 N +1 + δ 2 is upper bounded by τ , the parameterp should satisfyp Under the assumption of Eqn. (67), with setting δ ≤ 2(τ −pν − Tr(M) 2 N +1 ), QNN simulates the QSQ model as formulated in Definition 5, i.e., Under this setting, the relation between the number of measurements K and the successful probability b obeys After simplification, we conclude that, whenp ≤ , with the successful probability at least 1 − b, the required number of measurements to attain 1 I Generalization the results to more general quantum channels In this section, we generalize the achieved results in main text from the depolarization channel to a more general channel E p 1 , i.e., where ρ, κ ∈ C D×D , κ is a mixed state that can either be correlated or uncorrelated with ρ, and p 2 + p 3 = p 1 with p 1 , p 2 ≥ 0 and p 3 > 0. It is worth noting that the quantum channel E p 1 is sufficiently universal, which covers most Pauli channels associated with the depolarization channel [38,27]. The outline of this section is as follows. In Subsection I.1, we discuss the utility bounds of QNN under the general channel setting. Then, in Subsection I.2, we analyze the DP property of QNN when it is perturbed by the general channel. Last, in Subsection I.3, we quantify the learnability of QNN under the general channel setting from the perspective of sample complexity.

I.1 Utility bounds of QNN
Analogous to the depolarization channel setting, we first simplify the noisy QNN model to ease analysis. Specifically, after applying E p 1 to each circuit depth, the generated state follows where (1 − p 1 ) L Q + p 2 + p L Q 3 = 1, and κ is a mixed state that can either be correlated or uncorrelated with (U (θ)U x ) ρ (U (θ)U x ) † . Without confusion, we setp = 1 − (1 − p 1 ) L Q .
We now employ the simplified model, i.e., the right hand side of Eqn. (71), to establish the relation between the estimated gradients ∇ jLi (θ (t) ) and the analytic gradients ∇ j L i (θ (t) ). Recall that ∇ jLi (θ (t) ) = (Ȳ (t,± j ) k /K refer to the sample means when feeding θ (t) and θ (t,± j ) into the trainable circuit. As with depolarization channel, the sample meanȲ (t) i or Y (t,± j ) i is a random variable follows certain distribution. In particular, following the notations used in Theorem 5, the mean and variance ofȲ . By expanding the sample means using their explicit forms as shown above, we obtain the relation between the estimated and analytic gradients, i.e., ∇ jLi (θ (t) ) = (1 −p) 2 ∇ j L i (θ (t) ) + C (i,t) where ς t,j i = C − Y i )(Tr(Πκ (t,+ j ) ) − Tr(Πκ (t,− j ) )) + p 2 Tr(Πκ (t) ) + . We next use the relation between the estimated and analytic gradients to separately quantify the utility bounds R 1 and R 2 of QNN under the noisy channel E p 1 setting. Utility bound R 1 . As with Eqn. (41), with taking expectation over ξ (t) i and ξ (t,j) i , we obtain where the inequality employs E[ξ where the first and second inequalities uses C In conjunction with the above two equations, we achieve where the inequality uses C (i,t) j,1 ≤ 5 + 3(1 − (1 −p) 2 )λπ. After rewriting and taking induction, we have With setting T → ∞, we achieve the utility bound R 1 , i.e., Utility bound R 2 . With combining Eqn. (75) and PL condition, we obtain After rewriting and induction, we have , the utility bound is

I.2 Differential privacy of QNN
Analogous to QNN with depolarization noise, the DP property of QNN perturbed by the cannel E p 1 is determined by the DP property of the mechanism M(θ j , z) to output Y (t) i , supported by the composition property as shown in Proposition 1.
As discussed in Subsection I.1, the distribution of sample mean Y (t) i is similar to the depolarization case, where the only difference is that the values of mean and variance of the random variable are varied. In other words, the mechanism M(θ j , z) used in QNN perturbed by the cannel E p 1 is also DP. This observation promises that QNN perturbed by the cannel E p 1 is a DP learning model.

I.3 Learnability of QNN
The generalization of Theorem 2. Celebrated by the DP property of QNN as discussed above, we can effectively generalize the result of Theorem 2 to the noisy channel E p 1 setting, i.e., if QNN with noiseless gates PAC learns a concept, then QNN perturbed by the noisy channel E p 1 can also learn such a concept using polynomial samples.
The generalization of Theorem 3. Analogous to the depolarization noise setting, the distance between the target result ν = Tr(M (U (θ)U x ) ρ (U (θ)U x ) † ) and the shifted expectation valueν = (1 −p)ν + p 2 Tr(Mκ) + p L Q 3 Tr(M)/D of QNN under the noisy channel E p 1 follows |ν −ν| ≤pν + p 2 + p L Q 3 /D. Then by employing Chernoff-Hoeffding bound, we achieve, with probability at least 1 − 2 exp(−δ 2 n/2), , with setting δ = 2(τ −pν − p 2 − p L Q 3 /D), the relation between the number of measurements K and the successful probability b obeys After simplification, we conclude that, whenp ≤ , with the successful probability at least 1 − b, the required number of measurements to attain 1