Quasi-Equivalence of Width and Depth of Neural Networks

While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.


Introduction
Over the past years, deep learning Fan et al., 2018a) has become the mainstream approach of machine learning and achieved the state-of-the-art performance in many important tasks (Dahl et al., 2011;Kumar et al., 2016;Wang, 2016). One of the key reasons that accounts for the success of deep learning is the increased depth, which allows a hierarchical representation of features. There are a number of papers dedicated to explaining why deep networks are better than shallow ones. Encouraging progress has been made along this direction. The idea to show the superiority of deep networks is basically to find a special family of functions that are very hard to be approximated by a shallow network but easy to be approximated by a deep network, or that a deep network can express complicated functions that a wide network could not (Szymanski and McCane, 2014;Cohen et al., 2016;Mhaskar and Poggio, 2016;Eldan and Shamir, 2016;Montufar et al., 2014;Bianchini and Scarselli, 2014). For example, in Eldan and Shamir (2016), a special class of radial functions was constructed so that a one-hidden-layer network needs to use an exponential number of neurons to obtain a good approximation, but a two-hiddenlayer network only requires a polynomial number of neurons for the same purpose. With the number of linear regions as the complexity measure, Montufar et al. (2014) showed that the number of linear regions grows exponentially with the depth of a network but only polynomially with the width of a network. In Bianchini and Scarselli (2014), a topological measure was utilized to characterize the complexity of functions. Then, it was shown that deep networks can represent more complex functions than what the shallow counterparts express. Besides, width-bounded but depth-unbounded universal approximators were also developed (Lu et al., 2017;Lin and Jegelka, 2018;Fan et al., 2018c) in analogy to the depth-bounded but width-unbounded universal approximators (Funahashi, 1989;Hornik et al., 1989).
Recently, the effects of width are discussed by more and more studies Chen and Liu, 2017;Zagoruyko and Komodakis, 2016). Since width and depth are the most basic topology measures of a neural network, exploring the roles of width and depth in neural networks is a problem of strong interest and importance. Currently, there exist both width-bounded and depth-bounded universal approximators. Since both widthbounded and depth-bounded networks can represent any function, they can represent each other as well, which suggests the width-depth equivalence of neural networks. Nevertheless, how a neural network learns a mapping is quite different from the way used in proving the universal approximation. Moreover, the core of the width-depth conversion is to employ a network to learn another network instead of any function. Therefore, the width-depth conversion based on universal approximation falls short to capture the relationship between width and depth.
Specifically, we argue that the width-depth conversion via universal approximation is simplistic, inefficient, and lack of insight: 1) (Simplistic) As mentioned earlier, the way used in enabling universal approximation is to divide a target function into many functions over tiny hypercubes. In practice, a network usually does not do so, e.g., a ReLU network divides the space into polytopes. 2) (Inefficient) The published universal approximation analyses do not consider the character of a target function, and just inefficiently divide the function into many pieces over tiny hypercubes. However, in the width-depth transformation, the problem is how to express a wide (or deep) network using a deep (or wide) network. Thus, the properties of networks should be used to make a much more efficient transformation.
3) (Lack of insight) Universal approximation theoretically guarantees the representation capability of neural networks but it does not give any new insight in terms of difference in consumed computational resources and a practical procedure for the transformation from a wide (or deep) network to a deep (or wide) counterpart. As a small summary, utilizing universal approximation theorems to do width-depth transformation do not accurately characterize the width-depth conversion. We still do not fully resolve essential questions on width and depth, e.g., how to non-trivially do transformation between a wide and a deep network, and what the mechanism of interaction is between the width and depth of a network.
To fill this gap, inspired by the De Morgan law, here we demonstrate from two perspectives that the width and depth of neural networks are quasi-equivalent. The first perspective leverages that a ReLU network is a piecewise linear function over polytopes, while second perspective utilizes the nested structure of deep networks and the parallel structure of wide networks. Specifically, in the first perspective, we revisit the De Morgan law: where A i is a propositional rule (e.g., IF input ∈ [a i , b i ] m , THEN input belongs to some class), and such rules are disjoint. A neural network can be linked to a rule-based system such as a collection of propositional rules. Straightforwardly, we can construct either a deep network to realize a union of propositional rules (left side) or a wide network that realizes the complement of the intersection of those rules after complement (right side). As a result, the constructed deep and wide networks are equivalent to each other. Furthermore, we elaborate the quasi-equivalence of general regression and classification networks by constructing two transforms mapping an arbitrary ReLU network to a wide network and a deep network, respectively, thereby verifying a general quasi-equivalence of the width and depth of ReLU networks. Our constructive scheme is largely based on the fact that a ReLU network partitions the space into polytopes (Chu et al., 2018). This enables us to have a simplicial complex in the space and then to establish a quasi-equivalence of networks using the essential building blocks, fan-shaped (more generally, hyper-cone-shape) functions, in the form of modularized ReLU networks.  In the second perspective, we replace the mainstream artificial neuron type with a quadratic counterpart and extend our first perspective by utilizing the factorization and continued fraction representations of the same univariate polynomial to construct wide and deep networks, respectively. Specifically, a univariate polynomial function can be expressed as follows: where a i = 0, and r j , s j , t j and b l , c l are related. Previously, inspired by neuronal diversity, our group designed the quadratic neuron (Fan et al., 2018a) that replaces the inner product in a conventional neuron with a quadratic function. Due to the merits of the idea, the network based on quadratic neurons have been increasingly studied and applied (Bu and Karpatne, 2021;Ji et al., 2021;Mantini and Shah, 2021;Xu et al., 2022). Here, we can construct a wide quadratic network and a deep quadratic network to implement the left side and right side of Eq.
(2), respectively. This establishes the equivalence between wide and deep quadratic networks. Finally, we generalize such an equivalence into a multivariate setting based on Kolmogorov-Arnold theorem (Kolmogorov, 1956). Our main contribution is the establishment of the width-depth quasi-equivalence of neural networks. We summarize our main results on ReLU networks (the first perspective) and quadratic networks (the second perspective) in Tables 1 and 2, respectively. Specifically, Table 1 lists the width and depth of the wide and deep networks constructed in our first perspective, where the complexity measure M is the minimum number of simplices needed to cover the polytopes formed by a ReLU network. Table 2 shows the width and depth of the wide and deep quadratic networks constructed in our second perspective, where the complexity measures K 1 and K 2 are the degrees of polynomials to represent a function of interest. Clearly, given a complexity measure, the width of the constructed wide network is greater than the depth, while the depth of the constructed deep network is greater than the width. D is the input dimension, and K 1 and K 2 are the complexity measure of a function class.
Network Width Depth Extension of the De Morgan Law (Theorem 13) Wide max{K 1 , K 2 } log(K 1 K 2 ) Deep 2(2D + 1) To put our contributions in perspective, we would like to mention relevant studies. Jacot et al. (2018) proposed the theory of the neural tangent kernel (NTK), which provides a useful lens to understand a network when the width of a network goes to infinity. Kawaguchi et al. (2019) analyzed the effect of width and depth on the quality of local minima. They showed that the quality of local minima improves toward the global minima as depth and width become larger. Levine et al. (2020) revealed the width-depth interplay in a selfattention network. We discuss the width of neural networks as related to NTK in the Supplementary Information I. To the best of our knowledge, our study is the first work to reveal the width-depth quasi-equivalence of neural networks.

Preliminaries
For convenience, we use σ(x) = max{0, x} to denote the ReLU function. We mainly discuss ReLU networks in this work, thus all networks in the rest of this paper are referred as ReLU networks unless otherwise specified. At the same time, we focus on the fully-connected ReLU networks.
Definition 1 A regression network is a network with continuous outputs, while a classification network is a network that produces categorical outputs (for example, {0, 1, · · · , 9} for digit recognition). The classification network is obtained by thresholding a ReLU network in the last layer. In this study, we investigate a classification network with binary labels without loss of generality.
Definition 2 (Width and depth of a feedforward network (Arora et al., 2016)) For any number of hidden layers k ∈ N, input and output dimensions w 0 , w k+1 ∈ N, a R w 0 → R w k+1 feedforward network is given by specifying a sequence of k natural numbers w 1 , w 2 , . . . , w k representing widths of the hidden layers, a set of k affine transformations T i : R w i−1 → R w i for i = 1, . . . , k and a linear transformation T k+1 : R w k → R w k+1 corresponding to weights of the hidden layers. The function f : R n 1 → R n 2 computed or represented by this network is where • denotes function composition, and σ is an activation function. The depth of a ReLU DNN is defined as k + 1. The width of a ReLU DNN is max {w 1 , . . . , w k }.
Definition 3 (Width and depth of a shortcut network) Given a shortcut network Π, we delete a minimum number of links such that the resultant network Π is a feedforward network without any isolated neuron. Then, we define the width and depth of Π as the width and depth of Π , respectively.
Over the past several years, increasingly diversified network architectures, such as randomly wired networks (Xie et al., 2019), networks with stochastic structures (Deng et al., 2020), etc. are used as backbones for deep learning. Our definitions for width and depth are applicable to many unusual network configurations. Moreover, they are also natural extensions of the conventional width and depth definitions and can make sense for common networks.

Definition 4 (Simplicial complex)
satisfying: 1) every face of a simplex from S is also in S; 2) the non-empty intersection of any two simplices S 1 , S 2 ∈ S is a face of both S 1 and S 2 .
Proposition 1 Suppose that f (x) is a function represented by a ReLU network, then f is a piecewise linear function that splits the space into polytopes, where each polytope is convex and associated with a linear function.
Proof Inspired by the idea in (Chu et al., 2018), the proof here is for all ReLU networks, including networks using shortcuts. Let a vector C = {c 1 , ..., c N } denote the firing states of all neurons in the network, where N is the total number of neurons, and c i ∈ {0, 1}. c i = 0 means that the i-th neuron is not fired and vice versa. The firing state of every neuron is determined by the input x, then we denote the set of instances that share the same collective neuron firing state C h as the polytope P h : x ∈ P h , the output of the neuron i is a linear function, denoted as n (i) (x). Then, the firing state of the neuron i is controlled by a linear inequality (n (i) (x) > 0 or n (i) (x) ≤ 0). In total, C(x) = C h is equivalent to a set of N linear inequality constraints, indicating that P h is a convex polytope.
Definition 5 We define the complexity of the function represented by a ReLU network as the minimum number of simplices: M that are needed to cover each and every polytope to support the function of the ReLU network. Here, we elaborate why M is a good measure. Previously, because a deep network with piecewise linear activation is a piecewise linear function, the number of linear regions (polytopes) was intensively studied to measure the complexity of a neural network. For example, Montufar et al. (2014) and Serra et al. (2018) estimated the upper and lower bounds of the number of linear regions with respect to the number of neurons at each layer.  computed the bounds for convolutional neural networks. Park et al. (2021) proposed neural activation coding to maximize the number of linear regions to improve the model performance. Despite these results, we find that there exists a problem with the number of linear regions as a complexity measure. It may happen that simple and complex networks realize the same number of regions for a given task. As shown in Figure  1, two networks divide the space into two linear regions to separate concentric rings. But one network uses three neurons to define a triangle domain, while the other has six neurons to form a hexagon. In this example, according to the number of linear regions, the two networks have the same complexity but the six-neuron network is apparently more complex than the other. To address this problem, the complexity of a linear region should be taken into account as well. We argue that how many simplices a linear region comprises indicates how complex a linear region is. Therefore, we propose to use the number of simplices as a legitimate complexity measure. In Figure 1, counting the number of simplices, the complexity of two networks are 2 and 4, respectively, which is a better characterization.
Let us estimate the lower bound of M . To this end, we need to take advantage of the lower bound of the number of polytopes. Empirical bounds of the number of polytopes (N p ) in a feedforward ReLU network were estimated in (Montufar et al., 2014;Serra et al., 2018;Serra and Ramalingam, 2020), where one result in (Montufar et al., 2014) states that let n i , i = 1, · · · , L, be the number of neurons in the i-th layer, and D be the dimension of the input space, N p is lower bounded by n L j . The polytopes constructed therein are hypercubes. Because the minimum number of simplices that fill a D-dimensional hypercube is Definition 6 We define a wide network and a deep network as follows. Let us assume a function that can be sufficiently complex and yet can be represented by a network. When such a function becomes increasingly complex, the structure of this network must be also increasingly complex, depending on the complexity of the function. We call a network wide if its width is larger than its depth by at least an order of magnitude in M , e.g., where M is the complexity measure and α > 0. Similarly, we call a network deep if its depth is larger than its width by at least an order of magnitude in M .
It is underscored that we use two different concepts: the complexity of the function class represented by networks and the structural complexity of a network. The former measures the complexity of the function, while the latter measures the topological structure of a network. In our transformation scheme, the structures of constructs/networks are determined by the complexity of the function of interest.
Definition 7 We call a wide network N 1 : Ω → R is equivalent to a deep network N 2 : Ω → R, if N 1 (x) = N 2 (x), ∀x ∈ Ω. We call a wide network N 1 is quasi-equivalent to a deep network N 2 , if there is δ > 0, m({x ∈ Ω | N 1 (x) = N 2 (x)} < δ, where m is a Lebesgue measurement defined on Ω.

Motivating Example
An important school of neural network interpretability research Adadi and Berrada, 2018) is to extract interpretable rules from a network (Thrun, 1995;Setiono and Liu, 1995;Saad and II, 2007) using decompositional or pedagogical methods (Thrun, 1995). Pedagogical methods decode a set of rules that imitate the input-output relationship of a network, whereas these rules do not necessarily correspond to the parameters of the network. One common type of rules are propositional in the IF-THEN format, where the preconditions are provided as a set of hypercubes with respect to the input: IF input ∈ [a i , b i ] m , THEN input belongs to some class. Since there is a connection between the rule-based inference and the network-based inference, we consider a neural network in terms of propositional rules. Furthermore, we Figure 2: The width and depth equivalence in light of the De Morgan law duality. In this construction, a deep network implements A 1 ∨ A 2 · · · ∨ A n using a trapezoid function, and a wide network implements ¬ (¬A 1 ) ∧ (¬A 2 ) · · · ∧ (¬A n ) using the trap-like function. (·) + denotes the ReLU activation.
know that the De Morgan law holds true for disjoint propositional rules. Mathematically, the De Morgan law is formulated as where A i is a rule, and ¬A i is its negation. The De Morgan law gives a duality in the sense of binary logic that the operations ∨ and ∧ are dual, which means that for any propositional rule system described by A 1 ∨ A 2 · · · ∨ A n , there exists an equivalent dual propositional rule system ¬ (¬A 1 ) ∧ (¬A 2 ) · · · ∧ (¬A n ) . Regarding each rule as an indicator function over a hypercube: Figure 2, we construct a deep network that realizes a logic union of propositional rules (the left hand side of Eq. (5)) and a wide network that realizes the negation of the logic intersection of those rules after negation (the right hand side of Eq. (5)). As a result, the constructed deep and wide networks are equivalent by the De Morgan law. The above motivating example inspires us to consider the width-depth equivalence in a broader domain. First, a ReLU network is a piecewise linear function over polytopes. To generate rules, such a piecewise linear function should be divided into simplices instead of hypercubes. As shown in Figure 3, we only need two rules if we build rules over simplices, which is much more efficient than building rules over hypercubes. Second, the network can be a regression network rather than a classification network. Thus, representing a linear function rather than an indicator function is demanded. Based on these two considerations, we generalize an indicator function over a hypercube to a linear function over a simplex S i in a bounded domain: Figure 3: Building rules over simplices is more efficient than over hypercubes for a network.

Quasi-Equivalence of Width and Depth of Networks
This section describes the first contribution of our paper. We formulate the transformation from an arbitrary ReLU network to a wide network and a deep network, respectively. We use a network-based building block to represent a linear function over a simplex. Integrating such building blocks can represent any piecewise linear function over polytopes, thereby elaborating a general equivalence of the width and depth of networks. Particularly, a regression ReLU network is converted into both a wide and a deep ReLU network (Theorem 9), while a classification ReLU network is a special case of a regression ReLU network (Theorem 10). In the regression networks, the transformation of a univariate network is rather different from that of a multivariate network. As a result, the equivalence for the wide and deep networks in the univariate case is precise, whereas the multivariate wide and deep networks are made approximately equivalent up to an arbitrarily small error. What's more, in the multivariate case, the width of the wide network is not the same as the depth of the deep network. This is why we term such an equivalence as a quasi-equivalence.

Regression Networks
The sketch of transforming a regression ReLU network is that we first construct either a wide modular network or a deep modular network to represent the corresponding function over each and every simplex, then we aggregate the results into deep or wide networks in series or parallel, respectively, to represent the original network well.
Our main result is formally summarized as the following quasi-equivalence theorem for the multivariate case.
where m(·) is the standard measure in [−B, B] D .
We defer the proof of Theorem 8 to Supplementary Information II, and split the proof of Theorem 9 into the two-dimensional case (more intuitive) in Appendix and the general case in Supplementary Information II for better readability. The key idea to represent a linear function over a simplex is to construct high-dimensional fan-shaped functions that are supported in fan-shaped domains, and to use these constructs to eliminate non-zero functional values outside the simplex of interest. This is a new and local way to represent a piecewise linear function over polytopes. In contrast, there are two global ways to represent piecewise linear functions (Wang and Sun, 2005;He et al., 2018). In (Wang and Sun, 2005), for every piecewise linear function f : R n → R, there exists a finite set of linear functions g 1 , · · · , g m and subsets T 1 , · · · , T P ⊆ [m] such that f = P p=1 s p max i∈Tp {g i }, where s p ∈ {−1, +1}, p = 1, · · · , P . In (He et al., 2018), the rep- Nevertheless, due to the unboundedness and continuity, the global representation of a piecewise linear function is handicapped over polytopes that make a non-convex region. Let us use Figure 4 to illustrate our point, where the relations of g 1 , g 2 , g 3 are summarized in Table 3, and the function value over the purple area is zero.
• Representation in Wang and Sun (2005); He et al. (2018): f = max{g 1 , g 2 , g 3 } It can be seen that due to the unboundedness and continuity, f = max{g 1 , g 2 , g 3 } is inaccurate over the purple area. But our representation is accurate over the purple area because it is local. We highlight the construction of fan-shaped functions, which opens a new door for high-dimensional piecewise function representation. Particularly, the employment of fan-shaped functions will enable a neural network to express a manifold more effectively. Since such a fan-shaped function is a basic building block in our construction of wide and deep equivalent networks, let us explain it in a two-dimensional case for easy visualization. An essential building block expressed by a network in Figure 5(a) is based on the following function: where h 1 (x) = p 2 )}, and µ is a positive controlling factor. Eq. (7) is a ReLU network of depth=2 and width=2 according to our width-depth definition. As illustrated in Figure 5(b), the support of F (x) contains three boundaries and four polytopes (two of which only allow zero value of F ). For convenience, given a linear function (x) = c 1 x 1 + c 2 x 2 + c 3 , we define − = {x ∈ R 2 | (x) < 0} and There are three properties of F (x). First, the common line shared by Ω 1 and Ω 2 is h 2 (x) = 0. Second, the size of Ω 2 is adjustable by controlling µ. Note that h 1 (x) − µh 2 (x) = 0 can move very close to h 2 (x) = 0 as µ → ∞, which makes Ω 2 negligible. In the limiting case, the support of F (x) converges to the fan-shaped domain Ω 1 . Because h 1 (x) − µh 2 (x) = 0 is almost parallel to h 2 (x) = 0 when µ is big enough, we approximate the area of Ω 2 as the product of the length of h 2 (x) = 0 within [−B, B] 2 and the distance between two lines, which yields |Ω 2 | ≤ 2 √ 2B/µ. Third, the function F over the fan-shaped area Ω 1 is h 1 . Remark 1. As a ReLU network of interest partitions the space into more and more polytopes, the number of needed simplices will go increasingly larger. Because the lower bound of M in Eq. (4) is far larger than D or (D + 1)D 2 , the width of H 1 (x) and the depth of H 2 (x) will dominate. Furthermore, the width of H 1 (x) is higher than its depth by an order of magnitude in terms of M , and the depth of H 2 (x) is higher than its width in a similar way. Therefore, H 1 (x) is a wide network, and H 2 (x) is a deep network.
Remark 2. Compared to universal approximation, our construction, with the use of fan-shaped functions, is valuable in the following aspects: 1) Our construction utilizes the character of ReLU networks. It divides the space into finitely many simplices instead of infinitely many tiny hypercubes; 2) Given a target network, the complexity of our construction does not change with the prescribed error rate. In contrast, the complexity of the construction schemes used in the universal approximation analyses would increase as the preset error decreases. Therefore, our construction is much more efficient; 3) Our construction offers a new and local way to represent a piecewise linear function over polytopes, which is more flexible in representing discontinuous piecewise linear function than the global ways (Wang and Sun, 2005;He et al., 2018). Furthermore, inspired by the network structure used to construct the proposed fan-shaped function, we find that intra-layer links can enhance the representation capability of a shallow network, closely relevant to the published results on "depth separation". Because intra-layer links greatly increase the number of pieces represented by a network, a shallow network with intra-layer links can express a complicated piecewise linear function as well as a deep network! For more details on this new finding, please see Supplementary Information IX for details.

Classification Networks
A regression network gives a continuous output, while a classification network produces categorical outputs (for example, {0, 1, · · · , 9} for digit recognition). The classification network is derived by thresholding the output of the last layer of a ReLU network. In this study, we investigate a classification network with binary labels without loss of generality. We can directly build the equivalence for classification networks in the same way as the regression networks.
where m(·) is the standard measure in [−B, B] D .

Proof
The key is to regard the classification network as a special case of regression network. Then, applying the construction techniques used in the proof of Theorem 9 will lead to that for any δ > 0, which verifies the correctness of Theorem 10.
A classification neural network can be interpreted as a disjoint rule-based system A 1 ∨ A 2 · · ·∨A n by splitting the representation of a neural network into many decision polytopes: IF (input ∈ certain polytope), THEN (input belongs to some class). Furthermore, each rule is a local function supported over a decision region.
Remark 3. The De Morgan equivalence in Figure 2 can be summarized as when the rules are based on hypercubes. Theorem 10 corresponds to when rules are based on simplices.
3. Quasi-Equivalence Extended to Networks of "Quadratic Neurons"

Preliminaries
Definition 11 Quadratic neurons (Fan et al., 2018a(Fan et al., ,b, 2020 integrate the n-variable input x with a quadratic function as follows before being nonlinearly processed: where w r , w g , w b are vectors of the same dimensionality as that of x, b r , b g , c are biases, and is the Hadamard product. The network using quadratic neurons is interesting in several ways: 1) Over the past years the design of neural networks has been focusing on architectures, such as shortcut connections, transformer structure, etc. Almost exclusively, the mainstream deep learning models are constructed with neurons of the same type, which are characterized by the inner product and nonlinear activation. Despite that an artifical neural network was invented via biomimicry, the current artificial networks and a biological neural system are fundamentally different in terms of neuronal diversity and complexity. As we know, a biological neural system coordinates numerous types of neurons to support intellectual behaviors. To fill in this gap, we believe that neuronal diversity should be taken into account in machine learning; 2) Due to the enhanced expressive power at the neuronal level, quadratic neurons have been used in real-world problems and achieved superior performance, such as medical imaging (Fan et al., 2019), civil engineering (Ji et al., 2021), applied math (Bu and Karpatne, 2021), and so on. Given the utility of quadratic neurons, research on networks with quadratic neurons is attractive.
Hereafter, networks made of quadratic neurons are referred to as quadratic networks. We refer to the neurons using inner-product as the conventional neurons, and corresponding networks are called conventional networks. Please note that quadratic neurons are distinct from the conventional neurons that use quadratic activation. The decision boundary of the latter is still linear. The key characteristic of quadratic neurons is the employment of the quadratic function replacing the inner-product, and the choice of activation functions is flexible. Hereafter, we refer to networks made of quadratic neurons as quadratic networks.
Proposition 2 Any univariate polynomial of degree N can be perfectly computed by a quadratic ReLU network with the depth of log 2 (N ) and the width of N .
Proof Please refer to  for a detailed proof. Any univariate polynomial of degree N can be factorized as ) can be ensembled together to perfectly express r j x 2 + s j x + t j in the first layer of a network, followed by the half number of quadratic neurons in the second layer to combine the yields of the first layer, and so on and so forth. Consequently, a quadratic ReLU network with a depth of log 2 (N ) and a width of N can express any univariate polynomial of degree N .
Proposition 3 (Stone-Weierstrass Theorem (De Branges, 1959)) Let f (x) be a continuous real-valued function over [a, b], then for any σ > 0, there exists a polynomial P (x), satisfying sup Proposition 4 (Kolmogorov-Arnold representation theorem (Kolmogorov, 1956)) For any continuous function f (x 1 , · · · , x D ) with D ≥ 2, there exists a group of continuous functions: φ t,s and Φ t , where t = 0, 1, · · · , 2D and s = 1, 2, · · · , D, such that In the above propositions, Proposition 1 shows a quadratic ReLU network can express a univariate polynomial through factorization, which corresponds to the left side of Eq. (111). At the same time, this quadratic network is wide. Proposition 2 suggests that a polynomial can represent any continuous function. Proposition 3 informs us that a multivariate function can be expressed as a combination of univariate functions, which serves as a bridge to generalize our result from approximation to univariate functions to the case of multivariate functions.

Motivating Example
In contrast to the factorization representation, there is a continued fraction representation for a general univariate polynomial: In the right side of Eq. (111), the first part contains the terms with even powers of x, while the second part with the odd powers of x. In Supplementary Information V, we prove the correctness of Eq. (111) in detail. Figure 6: The width and depth equivalence for networks of quadratic neurons. In this construction, a deep network is to implement the continued fraction of a polynomial, and a wide network reflects the factorization of the polynomial.

… …
Since both the factorization representation and the continued fraction representation express the same polynomial, we have the following identity: , we can express Eq. (16) concisely as follows: which could be somehow analogized to the De Morgan law by considering that multiplication and composition operations replace ∨ and ∧ operations, respectively, for either side of Eq. (17). But the objects of those operations are global functions instead of local rules. Eq. (17) is inspiring in the sense that the left side is of a parallel computational structure, and the right side is with two nested structures. Clearly, the left and right sides of Eq. (17) suggest a wide network and a deep network, respectively.
In Figure 6, a deep quadratic network is constructed to represent the right side of Eq. (17), while a wide quadratic network is constructed to represent the left side. In light of Eq. (17), these two quadratic networks are functionally equivalent to each other. In the case of the deep network, if the polynomial is of degree 2N , then the depth of the deep network is N . The inter-layer relationship in the two branches are respectively y and y (2) l are the outputs of the l th layer in either branch. Specially, y 0 = 0. In this deep network, the activation function is in a form of reciprocal relation between two inputs: z(x, y) = ax 1+ax−y , which can be well approximated by a ReLU function after a proper normalization operation. In Supplementary Information V, we demonstrate that the reciprocal activation works effectively in the deep network after some twists. As for constructing the wide network, we employ Proposition 2 directly. Suppose that the polynomial is of degree 2N , the resultant quadratic network has a width of 2N and a depth of only log 2 (2N ), where the width dominates.
Despite the novelty, the above equivalence only fits univariate polynomials. To make a broader impact, we further demonstrate that in terms of representing a multivariate polynomial, the width and the depth of a neural network is also equivalent to each other.

Width-Depth Quasi-Equivalence
This section delineates the second contribution of our paper. Particularly, we first show that a combination of univariate polynomials can approximate any continuous multivariate function. Then, we leverage the constructed wide and deep quadratic networks for univariate polynomials to represent a general continuous multivariate function. As a result, the resultant wide and deep networks offer the same functionality.
Lemma 12 For any continuous function f : [0, 1] D → R, given δ > 0, there exists a function formulated as 2D where P t,s and P t are univariate polynomials, such that Proof First, we apply the famous Kolmogorov-Arnold theorem to express f : where φ t,s (x s ) and Φ t are continuous. According to Proposition 3, for every function φ t,s (x s ), given an arbitrarily small quantity t,s > 0 , there exists a polynomial P t,s (x s ) such that sup Combining φ t,1 (x 1 ), φ t,2 (x 2 ), ..., φ t,D (x D ) and applying the triangle inequality, we have Because Φ t is a continuous function, we can use the property of continuity. We choose a sufficiently small t,s , s = 1, ..., D such that for every Φ t , the following inequality holds: With respect to the continuous function Φ t , we can find a polynomial P t such that Next, applying the triangle inequality again, we have Finally, integrating Φ t , t = 0, ..., 2D, we immediately obtain that which concludes the proof.
Theorem 13 (Quasi-Equivalence by the Extension of the De Morgan Law) Given a continuous function h : [0, 1] D → R, for any δ > 0, there exists a function expressed as where P t and P t,s are polynomials of degrees deg(P t ) and deg(P s,t ). Correspondingly, let , there exist a wide quadratic network Q 1 of a width max{K 1 , K 2 } and a depth log 2 (K 1 K 2 ) and a deep quadratic network Q 2 of a width 2(2D + 1) and a depth K 1 + K 2 , satisfying Proof To prototype the wide network, we can use the wide quadratic sub-network scheme in Figure 6 to express P t and P t,s whose [width, depth] are [deg(P t ), log 2 (deg(P t ))] and [deg(P t,s ), log 2 (deg(P t,s ))], respectively. A straightforward combination of these wide subnetworks can express 2D t=0 P t ( D s=1 P t,s (x t )). Thus, the width of the construction is the summation of the widths of all the sub-networks: For the deep network, we use the deep quadratic sub-network shown in Figure 6 to express P t and P t,s whose [width, depth] are [2, deg(P t )] and [2, deg(P t,s )], respectively. Similarly, by integrating these deep sub-networks, we can have a deep network that expresses 2D t=0 P t ( D s=1 P t,s (x t )). As a result, the width of the derived network is 2(2D + 1), and the depth is In Supplementary Information IV, we use the quasi-equivalence relationship to construct wide and deep quadratic network variants for the same task, verify their validity on the MNIST dataset, and show empirical hints that a wide network might have an enhanced robustness, since adversarial samples have much less room to perturb latent features and play tricks.

Discussions
Gains of Equivalence Notion. Due to the importance of width and depth, a width-depth equivalence theory should have a major impact on deep learning. Some commonly-used deep learning theories, such as neural tangent kernel (NTK) and neural network Gaussian process (NNGP), were developed assuming an infinite width. Naturally, in the era of deep learning we are more concerned with how depth affects the behavior of a neural network. Therefore, depth-oriented theories would be highly desirable. Our work suggests that depth-oriented NTK and NNGP might be feasible since width and depth are equivalent. 2) Our work suggests a significant potential of wide networks in practical applications. Currently, wide networks are less popular than deep networks due to their sub-optimal performance. However, the reason why wide networks do not work as well might be related to the training strategies instead of the model capacities, since width and depth are equivalent in representation power. If wide networks can be trained to match the performance of deep networks, they suggest an alternative path to develop neural networks.
Equivalent Networks. In a broader sense, our quasi-equivalence studies demonstrate the existence of mutually equivalent networks. We argue that the network equivalence is useful in network design. Although deep networks manifest superb power, their applications can be constrained, for example, when the application is time-critical. In that case, we can convert the deep network to a wide counterpart that can be executed at a high speed. A direction for network design optimization is to derive a compact network that maintains a high performance of an original large network through quantization (Wu et al., 2016), pruning (Li et al., 2016), distillation (Polino et al., 2018), low-rank approximation (Zhang et al., 2015), etc. We envision that the equivalence of a deep network and a wide network suggests a new means of network design. Ideally, a wide network can replace the well-trained deep network without compromising the performance. Due to the parallel nature, the wide network can be trained on a computing cluster with many machines for the fast training.
At the same time, the inference time of the equivalent wide network is shorter than its deep counterpart.
Width-Depth Correlation. Every continuous n-variable function f on [0, 1] n can be in the L 1 sense represented by partially separable multivariate functions (Light and Cheney, 2006): where is an arbitrarily small positive number, φ li is a continuous function, and L is the number of products. In the Supplementary Information V and VI, we justify the suitability of this partially separable representation by showing its boundedness, and comparing it with other representations. Further, we correlate the width and depth of a network to the structure of a function to be approximated. In a nutshell, each continuous function φ li can be approximated by a polynomial of some degree, which can be appropriately represented by quadratic neurons. As a consequence, via a quadratic representation scheme, the width and depth of a network structure must reflect the complexity of L l=1 n i=1 φ li (x i ). In other words, they are controlled by the nature of a specific task. As the task becomes complicated, the width and depth must increase accordingly, and the combination of the width and depth is not unique. For more details, please see the Supplementary Information VII.
Effects of Width on Optimization and Generalization. In the Supplementary Information VIII, we illustrate the importance of width on optimization in the context of over-paramterization, kernel ridge regression, and NTK, and then report our findings that the existing generalization bounds also shed light on the relationship of the width and depth given a fixed complexity.

Conclusion
Inspired by the De Morgan law and through a systematic analysis, we have established the quasi-equivalence between the depth and width of ReLU neural networks from two perspectives. In the first perspective, we have formulated two transforms for mapping an arbitrary regression/classification ReLU network to a wide ReLU network and a deep ReLU network, respectively. In the second perspective, we have extended our quasi-equivalence results from ReLU networks of popular artificial neurons to those of quadratic neurons. This quasi-equivalence represents a step forward in developing a unified deep learning theory. More efforts are needed in the future to refine this quasi-equivalence relationship and find real-world applications.

Appendix A. Proof of Theorem 9 (2D)
Here, we show the correctness of Theorem 9 in the 2D case. Regarding the transformation of an arbitrary multivariate network, the situation is more complicated than in the case of univariate networks. Nevertheless, we are able to establish the δ-equivalence, which is a slightly relaxed result.
The sketch of proof: A ReLU network is a piecewise linear function over polytopes, which can be decomposed into a summation of linear functions over a simplex. Lemma 14 shows that a wider network module N 1 (x) and a deeper network module N 2 (x) can represent an arbitrary linear function over a simplex. Next, in Theorem 9, to transform an arbitrary ReLU network into a wide and a deep network, we horizontally aggregate network modules N 1 (x) to have a wide network, and we use shortcuts to sequentially establish a deep network with N 2 (x). A It is clear that the following one-to-one affine mapping between S and ∆ exists, which is Therefore, we only need to make the construction in the special case where S = ∆ to simplify our analysis. The coordinate transform in Eq. (29) can conveniently map the construction from ∆ to S. Given a linear function (x) = c 1 x 1 + c 2 x 2 + c 3 , we write − = {x ∈ R 2 | (x) < 0} and + = {x ∈ R 2 | (x) ≥ 0}. ∆ is enclosed by three lines provided by 1 (x) = x 1 , 2 (x) = x 2 , and 3 (x) = −x 1 −x 2 +1. We write three vertices of ∆ as v 0 = (0, 0), v 1 = (1, 0), v 2 = (0, 1). Then, f : [−B, B] 2 → R supported on ∆ is expressed as follows: where Lemma 14 Suppose that the representation of an arbitrary ReLU network is f : (30), for any δ > 0, there exist a ReLU network N 1 of width D(D + 1)(2 D − 1) + 2 and depth D, and also a ReLU network N 2 of width (D + 1)D 2 and depth Proof (D=2) Our goal is to approximate the given piecewise linear function f over ∆; therefore, we need to cancel f outside its domain. We first index the polytopes separated by three lines 1 (x) = 0, 2 (x) = 0, and 3 (x) = 0 as A (χ 1 ,χ 2 , The discontinuity of f in Eq. (30) is a major challenge of representing the function using a ReLU network. To tackle this issue, we start from a linear functionf (x) = a x + b, ∀x ∈ R 2 , which can be represented by two neurons σ The key idea is to eliminate f over all polytopes outside ∆. In other words,f over three fan-shaped polytopes A (∨,−,+) , A (−,+,∨) , and A (+,∨,−) should be cancelled.
Let us take the polytope A (+,∨,−) as an example. Note that A (+,∨,−) has two boundaries 1 (x) = 0 and 3 (x) = 0 as illustrated in Figure 7(b). We choose a sufficiently large positive number µ to construct the three fan-shaped functions: where the positive number η is chosen to be small enough such that the lines x 1 − ηx 2 = 0 and x 1 − η = 0 are very close to x 1 = 0, then m((
Therefore, for any δ > 0, as long as we choose η and µ satisfying the constructed network N will have Constructing N 2 (x): Allowing more layers in a network provides an alternate way to represent f . Let F (x 1 , x 2 ) = σ•(x 1 −µσ•(−x 2 )) and F (x 1 , x 2 ) = σ•(x 1 −νx 2 −µσ•(−x 2 )), both of which are approximately enclosed by boundaries x 1 = 0 and x 2 = 0. Therefore, the fan-shaped regions of F (x 1 , x 2 ) and F (x 1 , x 2 ) almost overlap as ν is small. The negative sign for x 2 is to make sure that the fan-shaped region is ∆. To obtain the third boundary 3 (x) = 0 for building the simplex ∆, we stack one more layer with only one neuron to separate the fan-shaped region of F (x 1 , x 2 ) with the boundary −x 1 − x 2 + 1 = 0 as follows: where (γ * 1 , γ * 2 , γ * 3 ) are roots of the following system of equations: Thus, E 1 (x 1 , x 2 ) will represent the function −x 1 − x 2 + 1 over ∆ and zero in the rest area. The depth and width of E 1 (x 1 , x 2 ) are 3 and 4 respectively. Similarly, due to the employment of the two fan-shaped functions and the effect of ν, the area of the region with errors is estimated as 2 √ 2B (ν + 2/µ) .
To acquire f over ∆, similarly we need three linear independent functions as linear independent bases. We modify 3 slightly to get 3 = 3 − τ x 1 and 3 = 3 − τ x 2 . Repeating the procedure described in (1), for 3 we construct the network E 2 (x 1 , x 2 ) that is 3 − τ x 1 over + 1 ∩ + 2 ∩ ( 3 ) + , while for 3 we construct the network E 3 (x 1 , x 2 ) that is 3 − τ x 1 over + 1 ∩ + 2 ∩ ( 3 ) + . We set positive numbers τ and τ small enough to have two triangular domains + 1 ∩ + 2 ∩ ( 3 ) + and + 1 ∩ + 2 ∩ ( 3 ) + almost identical with ∆. In addition, let τ and τ satisfy where ρ * 1 , ρ * 2 , ρ * 3 are solutions. As a consequence, the deep network (illustrated in Figure 8 (c)): produces f on ∆. The depth and width of the network are 3 and 12. Similarly, the area of the region with errors is bounded above by Therefore, for any δ > 0, if we choose then the constructed network N 2 will satisfy Proof (Theorem 9, D = 2) The network h is piecewise linear and splits the space into polytopes. It is feasible to employ a number of simplices to represent the polytopes defined by h Ehrenborg (2007). Given that M is the minimum number of required simplices, we have where and S (m) is the m-th simplex. For construction of a wide network, we use network modules to represent f (m) (x) and then horizontally aggregate them into a wide network. In contrast, for construction of a deep network, we sequentially express f (m) (x) in terms of a network module without linking them to the input. Representing f with a wide ReLU network: Lemma 14 suggests that a wide network module N 1 can generically represent a function over a template simplex. To represent f (m) over S (m) , we need to use Eq. (29) to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let three vertices of S (m) be {v (2) Figure 9: Illustration of deep networks in 1D and 2D cases.
By aggregating the network N 1 (x) horizontally, we have the following wide network: Therefore, the constructed wide network H 1 (x) is of width O(20M ) and depth 2. It is clear that the width O(20M ) of the wide network H 1 (x) dominates, as the number of needed simplices goes larger and larger.
Representing f with a deep ReLU network: For a deep construction, the fundamental difficulty is how to sequentially express each f (m) , i.e., the input of each block can only come from the earlier block instead of the input, like what we did in one-dimensional case (Figure 9(a)). Let us derive via induction how to sequentially represent each f (m) . We still adopt the idea of modularized networks, but now each network module has two outputs. Lemma 14 suggests that a deep network module N 2 can generically represent a function over a template simplex.
Step 1. Assume that the two outputs of the first block are N 2 and Ω (1) . To represent f (1) over S (1) , similarly, we need to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let three vertices of S (1) be {v which is one output of the first block.
Next, we derive the other output Ω (1) (x). Note that we are not allowed to use the input directly. Encouraged by the inversion idea in the univariate case, we invert the function domain into the input domain to get a function that is approximately only supported over To do so, recall that we have E 1 , E 2 , E 3 in constructing N 2 , we will have (ξ * 1 , ξ * 2 , ξ * 3 ) such that As shown in Figure 9(b), use the residual connection, we compute Ω (1) (x) = x −x S (1) , which is zero over S 1 and x for other regions. Ω (1) (x) will be used to feed the next block to construct f (2) .
Step 2. Suppose that the two output functions of the m-th block are N m 2 (x) and . Ω m (x) is fed into the (m + 1)-th block as the input. Because S (m+1) is outside the domain of S (1) ∪ · · · ∪ S (m) , with the same technique in Step 1, we construct N where {v ) to express f (m+1) over S (m+1) well. Ω (m) (x) is functionally equivalent to x for two reasons. First, all simplices do not overlap. Ω (m+1) (x) is x outside the domain of S (1) ∪ · · · ∪ S (m) ; therefore zero value over S (1) ∪ · · · ∪ S (m) has no effect. Second, w.l.o.g., we assume that (0, 0) is outside of all simplices or lies on the boundary of a certain simplex. As such, we have N (Ω (m) (x)) will not erroneously produce a non-zero constant over S (1) ∪ · · · ∪ S (m) . Furthermore, we also obtain a functionx S (m+1) inside the (m + 1)-th block. Apply the residual operation, we have Lastly, simlilar to the one-dimensional deep network, we use the shortcut connection to aggregate N (m) 2 (Ω (m−1) (x)) to obtain the following deep network: Therefore, the constructed deep network H 2 (x) is of depth O(3M ) and width 12. Please note that in the above equation, the summation is made by shortcuts. It is clear that the depth of H 2 (x) dominates over the width.

I. Width as Related to Neural Tangent Kernel (NTK)
NTK sheds light on the power of a network when the width of this network goes to infinity. Previous results have suggested a link between the neural network and the Gaussian distribution when the network width increases infinitely. In Neal (1996) Weakly-trained networks (Arora et al., 2019) refer to the networks where only the top layer is trained after other layers are randomly initialized. Let f (θ, x) ∈ R denote the output of the network on input x where θ denotes the parameters of the network, and θ is an initialization over Θ. In the above context, training the top layer with the l 2 penalty reduces to the kernel regression: where (F ) denotes the functional space. On the other hand, an infinite wide network is considered as an over-parameterized network, with the motivation to explain the fact that why over-paramterized networks scale. There were extensive studies on this topic (Du et al., 2018;Allen-Zhu et al., 2019;Cao and Gu, 2019). Notably, these studies indicate that the training renders relatively small changes when the network is sufficiently wide. Such a concentration behavior is justified by NTK in terms of the Gaussian process. The NTK is defined as the following: where (∂) denotes the gradient space. Along this line,  compared the NTK of the ResNet and the deep chain-like network, and found that the NTK of the chainlike network converges to a non-informative kernel as the depth goes to infinity, while the counterpart of ResNet does not.  showed that the width provably matters in networks using the piecewise linear activation.

II. Transformation of an Arbitrary Network
Here, we provide the proof for Theorem 15 and the experiment that shows the feasibility of the procedures in the proof of Theorem 16. The similar derivation to the proof for Theorem 15 is also seen in Fan et al. (2018c). where continuity, and the slopes of neighboring pieces are different; otherwise they can be fused together.
We first construct a wide ReLU network H 1 (x) to represent f . This can be straightforward as follows: where specially w (−1) = 0. This network is of width n + 1. Now, we verify this claim. Given x ∈ [x j , x j+1 ), for some j ≥ 0, we have the following: Next, we construct an equivalent deep network dual to the above wide network. Note that (62) is the same as where sgn(x < 0) = −1 and sgn(x > 0) = 1. If we write R i (x) = σ(|w (i) −w (i−1) |(x−x i )), i = 0, · · · , n − 1, then it is clear that the following recursive relation holds: Thanks to the recurrent relation, each R i can accurately represent a small monotonically increasing piece over [x i , x i+1 ]. We aggregate the outputs of those n + 1 pieces in the output neuron to get the desired deep ReLU network H 2 (x): which is the same as H 1 (x). To construct R n (x), n + 1 consecutive neurons are connected in series. Therefore, such a one-neuron-wide network has n + 2 layers. We plot two types of neural networks H 1 (x) and H 2 (x) in Figure 10.
where m(·) is the standard measure in [−B, B] D .
The sketch of proof: A ReLU network is a piecewise linear function over polytopes, which can be decomposed into a summation of linear functions over a simplex. Lemma 17 shows that a wider network module N 1 (x) and a deeper network module N 2 (x) can represent an arbitrary linear function over a simplex. Next, in Theorem 16, to transform an arbitrary ReLU network into a wide and a deep network, we horizontally aggregate network modules N 1 (x) to have a wide network, and we use shortcuts to sequentially establish a deep network.
Constructing N 1 : Discontinuity of f in Eq.(69) is one of the major challenges of representing it using a ReLU network. To tackle this issue, we start from a linear functioñ f (x) = a x + b, ∀x ∈ R D , which can be represented by two neurons σ •f − σ • (−f ). The key idea is to eliminate f over all 2 D+1 − 2 polytopes outside S using the D-dimensional fan-shaped functions.
where we let x D+1 = 1 for consistency, the negative sign for x 2 is to make sure that the fan- D is A (+,+,∨,−,··· ,−) , η 1 = 0, and η k = η, k = 2, ..., D + 1 represents a small shift for x 1 = 0 such that m(( which is approximately over Let us find ω * 1 , · · · , ω * D+1 by solving where the width and depth of the network are D(D + 1)(2 D − 1) + 2 and D respectively. In addition, because there are 2 D − 1 polytopes being cancelled, the total area of the regions suffering from errors is no more than Therefore, for any δ > 0, as long as we choose appropriate µ, η that fulfill the constructed network N 1 (x) will have Constructing N 2 : Allowing more layers in a network provides an alternate way to represent f . The fan-shaped functions remain to be used. The whole pipeline can be divided into two steps: (1) build a function over S; and (2) represent f over S by slightly moving one boundary of S to create linear independent bases.
Similarly, due to the employment of a number D of D-dimensional fan-shaped functions and the effect of ν k , the area of the region with errors is estimated as (2) To acquire an arbitrary linear function, similarly we need D + 1 linear independent functions as linear independent bases. Other than the one obtained in step (1), we further modify D+1 a little bit D times to get l D+1 = D+1 − τ l x l , l = 1, · · · , D. Repeating the same procedure described in step (1), for l D+1 , we can construct the network E l (x) that is l D+1 − τ l x l approximately over + 1 ∩ + 2 ∩ · · · ∩ ( l D+1 ) + , where τ l , l = 1, · · · , D is small to render these domains almost identical to S, and τ l , l = 1, · · · , D satisfies that where ρ * 1 , · · · , ρ * D+1 are the roots. As a result, the deep network is built as producing f on S. The depth and width of N 2 (x) are D + 1 and D 2 (D + 1), respectively. Similarly, the area of the region with errors is bounded above by Therefore, for any δ > 0, if we choose µ, ν k , τ l appropriately such that then the constructed network N 2 (x) will satisfy Proof (Theorem 16, D ≥ 2) The network h is piecewise linear and splits the space into polytopes. It is feasible to employ a number of simplices to fill the polytopes defined by h (Ehrenborg, 2007). Given that M is the minimum number of required simplices, we have where and S (m) is the m-th simplex. The core of the wide construction is to horizontally aggregate network modules representing f (m) (x) into a wide network. In contrast, the core of the deep construction is to sequentially express f (m) (x) in a network module without linking these modules to the input.
Representing h with a wide ReLU network: Lemma 17 suggests that a wide network module N 1 can generically represent a function over a template simplex. To represent f (m) over S (m) , we need to use Eq. (68) to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let D + 1 vertices of S (m) be {v By aggregating the network N Representing h with a deep ReLU network: For a deep construction, the fundamental difficulty is how to sequentially express each f (m) , i.e., the input of each block can only come from the earlier block instead of the input, like what we did in one-dimensional case (Figure 9(a)). Let us use mathematical deduction to derive how to sequentially represent each f (m) . We still adopt the idea of modularized networks, but now each network module has two outputs. Lemma 17 suggests that a deep network module N 2 can generically represent a function over a template simplex.
Step 1. Assume that the two outputs of the first block are N (1) 2 and Ω (1) . To represent f (1) over S (1) , similarly, we need to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let D+1 vertices of S (1) be {v which is one output of the first block. Next, we derive the other output Ω (1) (x). Note that we are not allowed to use the input directly. Encouraged by the inversion idea in the univariate case, we invert the function domain into the input domain to get a function that is approximately only supported over S (1) :x To do so, recall that we have E 1 , E 2 , · · · , E D+1 in constructing N 2 , we will have a coefficient vector (ξ * 1 , ξ * 2 , · · · , ξ * D+1 ) such that (1) 0 )) (99) As shown in Figure 9(b), use the residual connection, we compute Ω (1) (x) = x −x S (1) , which is zero over S 1 and x for other regions. Ω (1) (x) will be used to feed the next block to construct f (2) .
Step 2. Suppose that the two output functions of the m-th block are N m 2 (x) and . Ω m (x) is fed into the (m + 1)-th block as the input. Because S (m+1) is outside the domain of S (1) ∪ · · · ∪ S (m) , with the same technique in Step 1, we construct N where {v is functionally equivalent to x for two reasons. First, all simplices do not overlap. Ω (m+1) (x) is x outside the domain of S (1) ∪ · · · ∪ S (m) ; therefore zero value over S (1) ∪ · · · ∪ S (m) has no effect. Second, w.l.o.g., we assume that 0 is outside of all simplices or lies on the boundary of a certain simplex. As such, we have N (Ω (m) (x)) will not erroneously produce a non-zero constant over S (1) ∪ · · · ∪ S (m) . Furthermore, we also obtain a functionx S (m+1) inside the (m + 1)-th block. Apply the residual operation, we have Lastly, simlilar to the one-dimensional deep network, we use the shortcut connection to aggregate N (m) 2 (Ω (m−1) (x)) to obtain the following deep network: Therefore, the constructed deep network H 2 (x) is of depth O((D + 1)M ) and width (D + 1)D 2 . Please note that in the above equation, the summation is made by shortcuts. It is clear that the depth of H 2 (x) dominates over the width.
Experiment: A 2-6-2-1 network was trained on the moon dataset. When the training process was finished, the accuracy of the network was 0.94. Then, we transformed the original network into a wide network (µ = 80) and a deep network (µ = 5) using the procedures described in the proof of Theorem 16. The patterns of outputs of the constructed wide network and deep network are respectively shown in Figure 11. It is seen that the constructed networks can approximate the original network well, and some expected artifacts can be squeezed by further increasing µ or using a post-processing network that can identify streaks and remove them.

III. Width-Depth equivalence by the De Morgan Law
We have the following formal statement for the quasi-equivalence in light of the De Morgan law. The proof of this theorem is constructive.
Proposition 5 (Proposition C.1 in (Lin and Jegelka, 2018)) Given a piecewise constant function h : R d → R, for any small enough > 0, there exists a ResNet H 2 (x) such that The sketch of proof. The detailed proof can be referred to Lin and Jegelka (2018). We only show the one-dimensional case here for a brief introduction of the idea.
The operations that are realizable by a residual network block are 1. shifting: G + = G + c, for any c ∈ R. This operation is to shift a function with a constant. Figure 12: To approximate any univariate function, (a) we construct an increasing trapezoidal function; (b) we adjust the height of each trapezoid function to its corresponding height.
As Figure 12 shows, we first construct an increasing trapezoidal function with a very high plateau over each trapezoid. Then, we adjust the height of each trapezoid function to its corresponding height. The procedures of constructing an increasing function are shown in Figure 13, which are based on mathematical induction. Suppose G m satisfies 1. G m = 0, when x ∈ [−∞, a 1 ].
2. G m is a trapezoid function over each [a i , a i+1 ].
Given G m , we derive G m+1 as follows: The procedure of adjusting the height of each trapezoid function is shown in Figure  12(b). An important consideration is that we need to keep the function on previous subdivisions unchanged while twisting the current trapezoid function. This is realized by the fact that ||G|| ∞ is large. Suppose that φ k is the target value over the interval [a k , a k+1 ], we adjust the values of intervals sequentially by Finally, R * 0 is the constructed function.
Theorem 18 (Quasi-equivalence in light of the De Morgan law) Given a disjoint rule system {A i } n i=1 , where each rule A i is characterized by an indicator function g i (x) over a hypercube Γ i = [a 1i , b 1i ] × · · · × [a Di , b Di ] ∈ [0, 1] D :

fulfilling the De Morgan law
we can construct a wide ReLU network H 1 (x) to represent the right hand side of the De Morgan law and a deep ReLU network H 2 (x) to represent the left hand side of the De Morgan law, such that for any > 0, where m is a measurement in [0, 1] D . Proof The defined rule system is a piecewise constant function over a hypercube. The wide and deep networks are constructed as follows: Wide network H 1 (x): As illustrated in Figure 14, to represent the rule A i that is equivalent to g i (x), a trap-like function is constructed to represent ¬A i that is equivalent to 1 − g i (x). In our construction, for each hypercube the measures of {x ∈ Γ i |H 1 (x) = g i (x)} is no more than vol(Γ i )(1 − 2δ) n , where vol(Γ i ) is the volume of a hypercube Γ i . Therefore, the total measure of errors for all the hypercubes is less than n i vol(Γ i )(1 − 2δ) n = (1 − 2δ) n n i vol(Γ i ) < (1 − 2δ) n . As a result, let δ < (1 − ( /2) 1/n )/2, we have Deep network H 2 (x): A 1 ∨A 2 · · ·∨A n is a piecewise constant function over a hypercube, based on Proposition 5, we have Combining Eqs. (108) and (109) leads to m {x |H 1 (x)} = H 2 (x)} < , which concludes our proof.

IV. Quasi-equivalence of Quadratic Networks.
In this section, we first prove the correctness of the continued fraction representation of a polynomial. Then, we report the experimental results on the quasi-equivalency of quadratic networks.

Continued Fraction Representation
For a general univariate polynomial, there is a continued fraction representation as follows: where a i = 0, b 0 = a 0 , b k = a 2k /a 2k−2 , k ≥ 1 and c 0 = a 1 , c k = a 2k+1 /a 2k−1 , k ≥ 1. In the right side of (111), the first part contains the terms with even powers of x, while the second part is for the odd powers of x. Because both even and odd parts are in the essentially same format, we only show the correctness of the even part: Specially, we can derive the continued fraction as follows: The following example helps illustrate the correctness of (111): (113) If a i = 0 for certain i, we can derive an approximate continued fraction representation by complementing 2N i=0 a i x i with a term δ i x i , where δ i is a small constant. For a given approximation requirement, we can always choose a sufficient small, nonzero δ i to accommodate the accuracy requirement since x is bounded.

Experiments
This subsection reports the experimental results on quasi-equivalence of quadratic networks in two steps. In the first step, we show the feasibility of the reciprocal activation function in the network according to the continued fraction representation. In the second step, we compare the accuracy and robustness between wide and deep networks.
Experimental Design: We first preprocessed the MNIST dataset using image deskewing and dimension deduction techniques. Image deskewing (https://fsix.github.io/ mnist/Deskewing.html) straightens the digits that are written in a crooked manner. Mathematically, skewing is modeled as an affine transformation: Image = A(Image) + b, in which the center of mass of the image is computed to estimate how much offset is needed, and the covariance matrix is estimated to approximate by how much an image is skewed. Furthermore, the center and covariance matrix are employed for the inverse affine transformation, which is referred to as deskewing. Then, we used t-SNE (Van der Maaten and Hinton, 2008) to reduce the dimension of the MNIST from 28 × 28 to 2, as the two-dimensional embedding space. Figure 15 presents the effect of deskewing on t-SNE. The embedding data can be naturally divided into the training and testing sets based on the partition in the original space. To illustrate the aforementioned quasi-equivalence between quadratic networks, We built three representative quadratic deep learning models, using two wide sub-networks (by the factorization representation), four deep sub-networks with the reciprocal activation (by the continued fraction), and four deep sub-networks using the ReLU activation (representing the first-order approximation to the continued fraction representation). Each wide sub-network takes one element of the input, and two paired deep sub-networks take the same element of the input. Each wide sub-network has four layers with 8, 4, 2, 1 neurons sequentially, while each deep sub-network also has four layers with one neuron in each layer. These sub-networks are followed by a fully connected ReLU layer of the same structure. The fully connected network has three layers with 300, 200, and 10 neurons respectively.
We performed the batch training with 1, 000 instances per batch in each iteration. All the parameters were initialized with truncated Gaussian distribution of a mean 0 and a variance 0.1. The learning rate was set to 0.002, and the whole network was optimized with Adam (Kingma and Ba, 2015). The number of epochs was 200.
Feasibility of the Reciprocal Activation: At the first glance, the training of a deep model using the reciprocal activation function is subject to instability. For example, when the value of the input is large, b N x 2 1+b N x 2 gets closer to 1, which will lead to zero in the denominator in the next fraction, and undermine the training convergence. However, we argue that when data are appropriately normalized, the training process can be made stable. We did the following experiment to study how the normalization can help. Specifically, the input was normalized by the formula: xmax−x min , where ζ is a scaling factor. We evaluated four scaling factors {1, 4, 8, 16} by repeating 20 times the training process for each factor. The criterion is whether the training process converges or not. In this experiment, as long as 'nan' does not appear, the training will converge because we used a sufficiently large number of epochs. Table 4 shows the success rate of training for different ζ scales. It can be seen that without scaling the training process always fails, while the success rate reaches 80% when ζ = 16. The point is that the training process can be stablized with an appropriate scaling operation. Accuracy and Robustness of the Three Models: We repeated the training process 10 times and computed the accuracy of the three models on the test dataset. As shown in Table 5, all the three models achieved the state-of-the-art results. The performance of the deep model using ReLU was only slightly lower than that of the other two models. Furthermore, we used the following four popular adversarial attack methods to evaluate the robustness of the deep learning models: (1) fast gradient method (FGM); (2) fast sign gradient method (FSGM (Goodfellow et al., 2014b)); (3) iterative fast sign gradient method (I-FSGM (Kurakin et al., 2016)); and (4) DeepFool (Moosavi-Dezfooli et al., 2016). Let θ denote the model parameters, x be the input, y be the target pertaining to x and L(θ, x, y) the loss, FGM generates an adversary as where is an amplitude factor. FGM is plausible to find the attack along the direction of the gradient. FSGM computes an adversary based on where is also a factor, and sign(·) outputs 1 for a positive argument and −1 otherwise. I-FSGM iteratively derives an adversary with the FSGM formula: where Clip x,α (·) is a thresholding function such that the maximum value of x is no more than a preset limit α. In the experiment, we set this limit α to the maximum value of the input. DeepFool calculates an adversary by (117) Table 6 shows the performance of the three models on the adversarial samples generated based on the test dataset using the aforementioned attacking methods. We bold-faced the best scores under attack. It can be seen that the wide model shows the highest accuracy under most attacks. The only exceptions are the cases of FSGM( = 3) and FSGM( = 5), where the wide model is the second best.  Next, we probed the robustness of each model by examining the needed strength of the adversarial attack such that the network performance will drop by a pre-specified percentage. Using the absolute performance drop as a reference is better than using the performance compromised by adversarial attacks, as it eliminates the bias induced by the original network performance. Suppose that the original performance (without attack) is O, with each attacking method, we reduced O to O = O − 5% by gradually increasing the strength of attack. The reason to select a 5% drop is that if the drop is too high the comparison is not sensitive, while if the drop is too low the attack is too weak. Specifically, we gradually increased the strength by a fixed step for each attacking method until the performance drop is over 5%. For FGM, FSGM, I-FSGM and DeepFool, the steps are 50, 1, 1 and 1 respectively. Table 7 shows the needed strength values. The higher the needed strength, the more robust the model is. We bold-faced the best scores. Overall, the wide model shows the highest robustness. The only exception is the case of I-FSGM, where the wide model is the second best. These data suggest that the wide model seems more robust than the deep model. Further studies are needed to understand the underlying mechanisms and derive practical guidelines.

V. Bounds for the Partially Separable Representation
The purpose of this section is to show that the partially separable function can be realized a quadratic network whose structure is bounded. Let us first introduce necessary preliminaries. Partially Separable Representation: Approximating a multivariate function f (x 1 , x 2 , · · · , x n ) by a set of functions of fewer variables is a basic problem in approximation theory (Light and Cheney, 2006). Despite that some function f is directly separable in the form of a more general formulation to express a multivariate function f is If L is permitted to be sufficiently large, then such a model is rather universal. For bivariate functions, an inspiring theorem has been proved (Light and Cheney, 2006): Let {u n } n∈N and {v n } n∈N are orthornomal bases of L 2 (X) and L 2 (Y) respectively, then {u m v n } (m,n)∈N 2 is an orthornomal basis of L 2 (X × Y).
To approximate a general multivariate function, we relax the restrictive equality to the approximation in the L 1 sense and assume that, for every continuous n-variable function f (x 1 , · · · , x n ) on [0, 1] n , given any positive number , there exists a group of φ li , satisfying: The key to demonstrate boundedness of the partially separable representation with a quadratic network is to use the Taylor's expansion, and estimate the remainder, leading to a realization of the partially separable representation. Based on such a realization, we obtain an upper bound of the needed width and depth for the partially separable representation.
Theorem 19 For any f ∈ F n,k , if x ∈ [0, 1] n , there exists a quadratic network Q(x) with the width no more than k n+k n and the depth no more than log 2 (kn) to represent f in the partially separable representation, satisfying that Proof Let f ∈ F n,k , if x ∈ [0, 1] n , then based on classical multivariate calculus (Widder, 1989),

VI. Properties of the Partially Separable Representation
Let us compare the properties of the partially separable representation with those of the piecewise polynomial representation and the Kolmogorov-Arnold representation in terms of universality, finiteness, decomposability and smoothness. As a result, the partially separable representation is well justified.
1. Universality: The representation should have a sufficient ability to express the functions. A piecewise polynomial representation is capable of representing any continuous function in a local way. The Kolmogorov-Arnold representation theorem states that every multivariate continuous function can be represented as a combination of continuous univariate functions, which solves a generalized Hilbert thirteen problem. Therefore, the Kolmogorov-Arnold representation also possesses universality. As far as the partially separable representation is concerned, its universality in the L 1 distance is analyzed in the above Taylor expansion analysis. In summary, these three representation classes are all powerful enough to express continuous functions.

Finiteness:
The multivariate Taylor expansion can approximate a general continuous function f ∈ F n,k with a finite number of terms. Although the proof in the manuscript devises a specific strategy of obtaining a partially separable representation, the partially separable representation actually allows a variety of constructions in addition to the Taylor expansion. There is no fundamental reason that the partially separable representation will necessarily have exponentially many terms in a majority of practical tasks after reasonable assumptions are incorporated.
3. Decomposability: Decomposability is to decode the functionality of the model in terms of its building units, i.e., cells, layers, blocks, and so on. A plethora of engineering examples, such as software development and optical system design, have shown that modularized analysis is effective. Particularly, modularizing a neural network is advantageous for the model interpretability. For example, Chen et al. (2016) developed InfoGAN to enhance decomposability of features learned by GAN (Goodfellow et al., 2014a), where InfoGAN maximizes the mutual information between the latent code and the observations, encouraging the use of noise to encode the semantic concept. Since the partially separable representation is more decomposable than its counterparts, the network constructed in reference to the partially separable representation is easily interpretable. For instance, such a network has simpler partial derivatives that can simplify interpretability analysis (Lipton, 2018).

Smoothness:
A useful representation should be as smooth as possible such that the approximation can suppress high-frequency oscillations and be robust to noise. In the last eighties, Girosi and Poggio claimed that the use of the Kolmogorov-Arnold representation is doomed because the inner functions and the outer functions are highly non-smooth (Girosi and Poggio, 1989). In the piecewise polynomial representation, the situation gets better since at least over each interval the representation is smooth. As far as the partially separable representation is concerned, because the expression is greatly relaxed (from the exact computation to approximate representation), it is feasible to make each element smooth as L goes large, and extract meaningful structures underlying big data. Table 8: Comparison of the three functional representations, where " " means "yes", "X" means "no", and "-" means "mediocre".

Representation Universality Finiteness Decomposability Smoothness
In summary, we compare the three representation schemes in terms of universality, finiteness, decomposability, and smoothness in Table 8. Clearly, the partially separable representation is suitable for machine learning tasks.

VII. Width-Depth Correlation Through Partially Separable Presentation
Now, let us analyze the complexity of the quadratic network in light of the aforementioned partially separate representation scheme as shown in Figure 16. Suppose that the polynomial P li is of degree N li , then the representation of each P li can be done with a network of width L l=1 n i=1 N li and depth max l,i {log 2 (N li )}. Next, the multiplication demands an additional network of width Ln and depth log 2 (n). Therefore, the overall quadratic network architecture will be of width max{ L l=1 n i=1 N li , Ln} and depth max l,i {log 2 (N li )} + log 2 (n). Because the depth scales with a log function, which changes slowly when the dimensionality of the input is large. For simplicity, we take an approximation for depth max l,i {log 2 (N li )} + log 2 (n) = log 2 (max l,i {N li }) + log 2 (n) ≈ α log 2 L l=1 n i=1 N li + log 2 (n), where α is a positive constant. Let L l=1 n i=1 N li = N Σ , which describes the overall complexity of the function to be expressed, then the formulas to compute the width and depth are simplified as follows:  Figure 16: Use of a quadratic network to represent a partially separable representation. Suppose that the polynomial P li is of degree N li , the width and depth of the quadratic network to approximate P li are N li and log 2 (N li ) respectively.
One interesting point from (124) is that the lower bounds for depth and width to realize a partially separable representation are also suggested. As shown in Figure 17, we plot the width and depth as N Σ changes. There are two highlights in Figure 17. The first is that the width is generally larger than the depth, which is different from the superficial impression on deep learning. The second is that, as the N Σ goes up, the width/depth ratio is accordingly increased.
Remark: Through the complexity analysis, we realize that the width and depth of a network depend on the structure or complexity of the function to be approximated. In other words, they are controlled by the nature of a specific task. As the task becomes complicated, the width and depth must increase accordingly, and the combination of the width and depth is not unique.

VIII. Effects of Width on Optimization and Generalization
We first illustrate the importance of width on optimization in the context of over-paramterization, kernel ridge regression, and NTK, and then report our findings that the existing generalization bounds and VC dimension results somehow suggest the width and depth equivalence for a given complexity of networks.
1. Optimization: The optimization mechanism is the key to understand the training process of neural networks (Ge et al., 2015). From the view of optimization, the fact that randomly-initialized first-order methods can find well-generalizable minima on the training data is quite intriguing. Given the theme of this paper, we put our endeavor into the importance of width on the optimization of neural networks in hope to provide insights for practitioners. We divide the relevant literature into the three categories: (1) Increase width for over-parameterization: Brutzkus et al. showed that a wide twolayer network using the hinge loss can generalize well on linearly separable data with stochastic gradient descent (SGD) (Brutzkus et al., 2017). Li and Liang showed that when data is normalized and fulfills a separability condition, a two-layer over-parameterized network can learn these data in a polynomial time (Li and Liang, 2018). Allen-Zhu et al. (2019) showed that if the width of hidden layers is O n 30 D 30 log 30 (1/ ) , where n is the number of samples, D is the network depth, and is an expected error, then the gradient descent search converges with the error . This bound was further reduced to n 4 2 D for the network using non-linear smooth activation functions such as soft-plus (Du et al., 2018). These paper established that a pre-specified small training error can be achieved by gradient descent when the network is very wide. The secret therein is that the weight matrix is very close to the initialization due to NTK (Jacot et al., 2018).
(2) Kernel ridge regression: A neural network tends to give a Gaussian process when the width goes infinitely large. In this situation, only training the top layer of a network (weak training) will reduce to Kernel ridge regression. In a classical way, the weak training is to minimize the quadratic loss: where X = [x 1 ; ...; x n ] is a collection of data x i ∈ R 1×d , i = 1, ..., n. Taking derivatives with respect to w and equating them to zero gives Given a new example x * , the prediction is We replace the data samples with the high-dimensional representations of the neural network: x → g θ (x) ∈ R 1×k and let Φ g (X) = [g θ (x 1 ); ...; g θ (x n )]. By a similar derivation, we have When the width is very large, due to the Gaussian process, the weak training is reasonable, because Φ g (X) converges to E θ∼Θ [Φ g (X)], and Φ g (X)Φ g (X) T converges to E θ∼Θ [Φ g (X)Φ g (X) T ], which can be approximated by the random initialization.
(3) Neural Tangent Kernel: In the above argument, we have justified the legitimacy of training the top layer by the large network width. How about training the entire network? Given the dataset {(x i , y i )} n i=1 , we also consider training the network with the quadratic loss for regression: l(θ) = 1 2 n i=1 (f θ (x i )) − y i ) 2 . Consider the gradient descent in an infinitesimally small learning rate, then Next, we can describe the dynamics of the model output y(θ) given the input x j as follows: Let us consider u(t) = (f θ(t) (x i )) i=1,...,n for all samples at time t and y = (y i ) i=1,...,n is the output, (130) can be written in a compact way: where the pq-entry of H(t) is In the width limit, the Gaussian process shows that H(t) becomes a constant H * . Therefore, which characterizes the trajectory of the training in the functional space instead of the parameter space. 2. Generalization: Analysis of generalization bounds is a powerful tool to explain the excellent performance of neural networks. Traditional wisdom suggests that the increased model complexity will cause over-fitting to training data, which contradicts the fact that deep networks can easily fit random labels to the data and yet practically generalize well (Zhang et al., 2016). Recently, the generalization theory has gained increasingly more traction. In reference to , we have summarized the the state-of-the-art generalization bounds in Table 10 and provided their complexities. In Table 10, we boldface the bounds of interest which the width and depth dominate. As far as Neyshabur et al. (2015) is concerned, due to the exponential dependence on the depth, we will not focus on it. Instead, we argue that these bold-faced bounds somehow suggest the width and depth equivalence under a given complexity. Now, we use B 1 (L, p) = log(p) √ L 3 , B 2 (L, p) = log(Lp) L 3 p, and B 3 (L, p) = √ Lp to denote the bounds of interest. Figure  18 shows the contours plots of B 1 , B 2 and B 3 . Along a contour, the width and depth changes for the fixed bound complexity. In other words, the depth and width of a neural to denote the upper bounds of the spectral norm ||W d || 2 , Frobenius norm ||W d || F , and ||W d || 2,1 of the rank-r matrix in l th layer. Generally, ||W d || F and ||W d || 2,1 is √ r-times larger than ||W d || 2 , γ means the ramp loss function is 1/γ-Lipschitz function, Γ is the lower bound for the product of the spectal norm of all the layers and the m is the size of data. √ L 3 and √ Lp respectively. Along a contour, the width and depth changes to give the same bound.

Reference Bound
network are mutually transformable without impacting the overall generalization ability theoretically. Therefore, increasing either width or depth can boost the hypothesis space of a neural network. In other words, when it comes to promoting the expressive power of a network, increasing the width is essentially equivalent to increasing the depth in the sense of VC dimension, which also implies the equivalence of the width and depth of neural networks.

IX. Rethinking the Depth Separation with Intra-layer Links
Depth separation highlights the representation ability of a deep network, which has been intensively investigated over the past years. The idea is to show that a deep network can more efficiently approximate a complicated function than a shallow network. One notable depth separation result is from Arora et al. (2016), showing that "for every pair of natural numbers k ≥ 1, w ≥ 2, there exists a family of hard functions representable by a R → R (k + 1)-layer feedforward ReLU DNN of width w such that if it is also representable by a (k + 1)-layer feedforward ReLU DNN for any k ≤ k, then this (k + 1)-layer feedforward ReLU DNN has size at least 1 2 k w k k − 1". Suppose that the number of neurons, i.e., the size of a ReLU network is s, and the piecewise linear function of this ReLU network has p pieces, the core of the depth separation theorem established in Arora et al. (2016) is summarized as the size-piece relationship: s ≥ 1 2 k(2p) 1/k − 1. In the proposed network structure for the width-depth conversion, intra-layer links are employed for a network to flexibly represent arbitrary piecewise linear functions over polytopes using fan-shaped functions. Here, we find that adding intra-layer links can greatly increase the maximum number of pieces represented by a shallow network such that it can express as a complicated function as a deep network could. As shown in Table 11, let n be the number of intra-linked neurons, the size-piece inequality s ≥ n 2(2 n −1) k(2p) 1/k −1 trivially holds true. As such, it cannot be used to demonstrate depth separation any more. Instead, we need to rethink the depth separation and re-characterize the power of a network when intra-layer links are used. Our detailed proof is as follows. Table 11: Network structures and their associated size-piece relationships. The number of neurons, i.e., the size of a ReLU network is s. k is the number of hidden layers. The piecewise linear function of this ReLU network has p pieces. n is the number of intra-linked neurons.
The result listed in the first row of Table 11 is obtained from the following two lemmas and proofs in (Arora et al., 2016).
Lemma 20 (Lemma D.5 in (Arora et al., 2016)) Let f : R → R be a function represented by a R → R ReLU DNN, as shown in Figure 19(a), with depth k + 1 and widths w 1 , . . . , w k of k hidden layers. Then f is a PWL function with at most 2 k−1 · (w 1 + 1) · w 2 · . . . · w k pieces.
Lemma 21 (Lemma D.6 in (Arora et al., 2016)) Let f : R → R be a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19 depth k + 1, then it must have size at least 1 2 kp 1/k − 1. Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most 2s k k pieces. Now, we prove the result listed in the second row of Table 11 for ReLU DNN with intra-links.
Lemma 22 (Our 1st New Result) Let f : R → R be a function represented by a R → R ReLU DNN, as shown in Figure 19(b), with depth k + 1 and widths w 1 , . . . , w k of k hidden layers. Then f is a PWL function with at most 3 k−1 · 3w 1 2 + 1 · w 2 · . . . · w k pieces. Proof As shown in Figure 19(b), let us connect every two neurons with an intra-layer link. Suppose that one neuron is σ(wx + b) which can create 1 breakpoint at −b/w, while the other neuron is σ(w x + b + σ(wx + b)) creating at most 3 breakpoints −b/w, −b /w , −(b + b )/(w + w ). Thus, we can get at most w 1 2 · 3 = 3w 1 2 distinct breakpoints, i.e., 3w 1 2 + 1 pieces. Let us estimate the number of pieces via induction. Assume that for some k ≥ 1, any R → R ReLU DNN, as shown in Figure 19(b), with depth k +1 and widths w 1 , . . . , w k of the k hidden layers produces at most A k = 3 k−1 · 3w 1 2 + 1 ·w 2 ·. . .·w k pieces. We add one more layer of w k+1 neurons to this network such that its pre-activation is actually the output of a R → R ReLU DNN with depth k + 1 and widths w 1 , . . . , w k . Suppose that the preactivation of the i-th neuron is f i , i ∈ [w k+1 ]. By the induction hypothesis, f i is a piecewise linear function with at most A k pieces. Without loss of generality, let the i-th and (i + 1)-th neurons be connected by an intra-layer link. Their outputs are σ(f i ) and σ(f i+1 + σ(f i )) whose total number of pieces equals to that of three functions σ(f i ), σ(f i+1 ), σ(f i + f i+1 ), which is at most 6A k . Because we have w k+1 neurons in the last hidden layer, we can get at most 6A k · (w k+1 /2) = 3 k · 3w 1 2 + 1 · w 2 · . . . w k · w k+1 pieces, which concludes the induction step.
Lemma 23 (Our 2nd New Result) Let f : R → R be a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(b), with depth k + 1, then it must have size at least 1 3 k(2p) 1/k − 2 3 . Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most 1 2 3s k k pieces.
Lemma 24 (The 1st New Result Extended to n Intra-Linked Neurons) Let f : R → R be a function represented by a R → R ReLU DNN, as shown in Figure 19(c), with depth k + 1, widths w 1 , . . . , w k of k hidden layers and n intra-linked neurons in each layer. Then f is a PWL function with at most 2 k−1 · (2 n −1)w 1 n + 1 · (2 n −1)w 2 n · . . . · (2 n −1)w k n pieces.
Proof For the first layer, the n intra-linked neuron can create 2 n − 1 breaking points. Let us derive this fact via induction. Suppose that G n−1 is the piecewise linear function of n − 1 intra-linked neurons, then G n = σ(wx + b + G n−1 ). When G n−1 ≥ 0, G n = σ(wx + b + G n−1 ), generating at most 2 n−1 − 1 breaking points (same as G n−1 ). When G n−1 ≤ 0, G n = σ(wx + b) has one breaking point. Without repetition, the number of breaking points of G n is the summation of these newly-generated breaking points and breaking points of G n−1 , which is 2(2 n−1 − 1) + 1 = 2 n − 1. Because the first layer has w 1 neurons, at most w 1 /n groups of n intra-linked neurons are available. Thus, we can get at most (2 n −1)w 1 n + 1 pieces. Computing the number of pieces for the hidden layers shares the same spirit with the first layer, combining the proof of Lemma 22, we conclude that a R → R ReLU DNN, as shown in Figure 19(c), with depth k + 1, widths w 1 , . . . , w k of k hidden layers, and n intra-linked neurons in each layer, has at most 2 k−1 · (2 n −1)w 1 n + 1 · (2 n −1)w 2 n · . . . · (2 n −1)w k n pieces.
Lemma 25 (The 2nd New Result Extended to n Intra-Linked Neurons) Let f : R → R be a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(c), with depth k + 1 and n neurons intra-linked in each hidden layer, then it must have size at least n 2(2 n −1) k(2p) 1/k − n 2 n −1 . Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most 1 2 2(2 n −1)s nk k pieces.
Proof Please refer to our proof for Lemma 23.