Serial and Parallel Convolutional Neural Network Designs for NFDM Signals

Two conceptual convolutional neural network designs are proposed for directly decoding NFDM signals with the consideration of hardware implementation. A serial network is designed for small network size suitable for small user applications and a parallel network is designed for speed suitable for places such as data centres. In the numerical demonstrations, the serial design only occupies 0.5 MB of memory space while the parallel design occupies 128 MB of memory but allow parallel computing. Both network designs were trained with simulated data and able to reach more than 99.9% accuracy.


Introduction
Nonlinear Fourier Transform (NFT) has been proposed as an alternative technique for fibre optics communication to break through the Shannon linear capacity limit due to the nonlinearity of optical fibres [1][2][3][4] . It has already been demonstrated experimentally that NFT can outperform conventional linear Fourier transform-based systems [5][6][7] . However, NFT-based communication methods have not yet reached the maturity level of conventional methods. One aspect is the lack of hardware implementation of NFT. Only very recently, hardware implementation was conceptually proposed on an FPGA platform 8 . However, due to the complexity of NFT algorithms, implementing and optimizing the hardware can take years of effects before any products can be built for practical use.
On the other end, it has been shown that a convolutional neural network (CNN) can be used to directly decode nonlinear frequency division multiplexing (NFDM) signals without actually performing NFT 9 . Due to the popularity of machine learning, hardware implementation of deep neural networks itself has become an active research area with supports from mega-corporations such as Google and Nvidia [10][11][12][13][14][15][16] . It is an opportunity to take the advantage of the rapid development of deep learning hardware and use it for NFT applications.
In the previous work 9 , the network design was not used efficiently, especially for signals with a large number of subcarriers, since for every subcarrier, we need a separate network. Furthermore, the previous work only demonstrated networks for a fixed signal power without considering fibre loss and in the case of lump amplification. In this paper, we are going to show two network design ideas with hardware implementation in mind. A serial design aimed at small network size and a parallel design aimed at performance. We will also demonstrate the networks with variable signal power and in case of loss and lump amplification.
The paper is divided into 5 sections apart from the Introduction. In section 2, a quick introduction to nonlinear Fourier transform and nonlinear frequency division multiplexing is given. Section 3 gives the basic parameters and description of the generation training and validation data. Section 4 and 5 discuss the serial and parallel network design respectively followed by a conclusion in Section 6.

Modulation of continuous nonlinear spectrum
Transforming a temporal signal q(t) in the nonlinear Fourier domain equals solving the following differential equation with the initial condition: where v t is the derivative of v and λ is the nonlinear frequency. Two time invariant coefficients a(λ ) and b(λ ) can be found using the following limits 3 : The nonlinear spectrum of signal q(t) is thus defined as where a ′ (λ ) = ∂ a(λ ) ∂ λ . The nonlinear Fourier spectrum is continuous for λ ∈ R and discrete when a(λ ) = 0 for λ ∈ C + . The spectrum in both continuous and discrete region can be used to carry information, but in this work, we focus on the continuous part only.
Nonlinear frequency division multiplexing (NFDM) is a multiplexing technique borrowed from orthogonal frequency division multiplexing (OFDM) in convectional fibre optic communication with a replacement of the linear spectrum with a nonlinear spectrum 3 and becoming popular in recent years 6,[17][18][19][20][21] . The modulation is usually applied to the continuous part of the nonlinear spectrum Q(λ ) directly, but modulating b(λ ) (b-modulation) has also been used as it results in pulses with well defined temporal windows 7,[22][23][24][25][26] .
The modulation can often be expressed as where n is the number of subcarriers, c n is the complex data symbols, w n (λ ) is the carrier wave. Conventionally, sinc function is used as w(λ ) 17 . Other choices of functions such as raised cosine or flat-top window with cardinal sine carrier waveform have also been proposed 23 . The function s(λ ) is the modulated spectrum. It is often scaled to match the desired signal power. For Q-modulation case, Q(λ ) = s(λ ). An inverse NFT can be performed to obtain the temporal signal q(t) 17 . In the case of b-modulation, b(λ ) = s(λ ) and an additional step is needed to generate a(λ ) from b(λ ) before the inverse NFT step can be taken 22 . Inspired by the work 23 , we used the following carrier wave functions for this work.
where T 0 denotes the width to the 4th zero of the sinc function [note: here sinc(t) = sin(πt)/(πt)], τ is a free parameter that adjusts the width of the Gaussian, which also related to the sharpness of the flat-top waveform in the frequency domain, β adjusts the separation between subcarriers. In the nonlinear frequency domain, the corresponding w n (λ ) can be obtained by replacing the angular frequency in w n (ω) with 2λ (or −2λ depending on the choice of the sign in the Fourier transform), where w n (ω) is the linear Fourier transform on w n (t).
From Eq. (8), one can see the nonlinear frequency separation between subcarriers is 4πβ /T 0 , whilst 4π/T 0 is approximately the bandwidth of each subcarrier. For the rest of the work, τ = 1 and β = 1 are used.

Generation of training data
The preparation of the training data is crucial to the performance of a neural network. Ideally, the training data should be generated experimentally such that the network can be tuned to the specific experimental setup. However, for demonstration purposes, we generate the training data through numerical simulation while the simulation conditions are tuned to match those of the experiments closely. Assuming a signal generator with a sampling rate of 96 GS/s with total electronic 3-dB bandwidth of 28 GHz. A time window of 21.33 ns is used for each burst of the signal. The laser source has a line-width of 100 Hz at the central wavelength of 1550 nm. Standard single-mode fibres are used with propagation loss of 0.2 dB/km, group velocity dispersion of -21.7 ps 2 /km (β 3 = 18.6 ps 3 /km) and nonlinearity of 1.1 W −1 km −1 . A maximum of 10 optical fibre spans with 50 km per span is used in the simulation with point amplification at the end of each span. For each burst of signal, a random number of spans between 1 to 10 is chosen. Lump amplification is applied in the simulation and a path-average model is used to counteract the effects of fibre loss and lump amplification 27 . A quadrature amplitude modulation is used (16-QAM) for encoding subcarriers with direct mapping of numbers 0 to 15 to the symbols. The numbers are then used as labels for the network training algorithm. We desire to have the network perform equally at different input power levels in the range of -20 dBm to 0 dBm, hence we need a way to randomly generate the training data with power evenly distributed within this range to avoid bias of the final network towards certain input power. To achieve this, we develop and used the approximated energy calibration model shown in Section . The noise background in the training data is chosen to be -40 dBm per amplification.

An approximated model for energy calibration
The energy of a burst can be evaluated using its nonlinear spectrum Q(λ ) as where A is a real scalar scaling factor through which different burst energy can be set. The energy E does not scale linearly with A. Depending on the choice of the carrier wave and the number of subcarriers, the same A results in different energy levels. This is particularly difficult for generating training data since a uniform distribution of power on the log scale is needed (-20 dBm to 0 dBm in this work). Here we present a fitting model that gives an approximated mapping between A and the average burst energy. Firstly, we discretize the integral in Eq. (9) as Rearranged the equation, we have Let C = ∏ i 1 + A 2 |Q i | 2 and take natural logarithm on both sides of the equation above, we have ln E + ln π ∆λ = ln lnC (13) Next, we normalize the symbol constellation to its mean amplitude, c n /|c n |, and the carrier wave function to its peak amplitude, w n (λ )/max (w n (λ )). From now on, all Q(λ ) shall be generated using the normalized constellation and carrier waves. After that, we assume a new constant |Q| such that where n is the number of samples in λ and Λ = n∆λ is the window size in the nonlinear spectrum. Now, we define a new parameter A ′ such that And we choose a few points E ′ within the energy range [E min , E max ] to calculate A ′ using Eq. (17). Once, we get A ′ , we replace A in Eq. (9) with A ′ and calculate the actual averaged pulse energy E avg using randomly generated Q(λ ). Depending on the 3/11  Figure 1 shows an example. Finally, a least square fit is applied to ln E avg and ln E ′ , where E avg and E ′ are column vectors, J is an all-ones column vector and pinv is the matrix pseudo-inverse function. And finally the calibrated A cal is calculated as

The serial CNN design
With the original work 9 , to decode each subcarrier, a transformation need to be performed on the received signal to remove the propagation induced temporal shift and phase change. Then, each subcarrier is decoded using its own corresponding network. This process is rather inefficient since the differences between the nonlinear spectra of subcarriers are rather small, except for the ones on the edges of the nonlinear spectrum. There are overlaps between the networks from previous work and in principle, all of them can be combined into one. Furthermore, during our study, we realised that the evolution of the nonlinear spectrum of subcarrier k is mostly affected by its neighbouring subcarriers. Therefore, the input to the network can be reduced significantly by applying a gate function to the input pulse in the nonlinear frequency domain, hence greatly reduce the parameter space of the network. Figure 2 illustrate the idea. An example pulse in the linear frequency domain is aligned with 128 subcarrier waves w n (λ ), where the nonlinear frequency is half of the linear frequency (2λ = ω). For any arbitrary subcarrier (between two black vertical lines), one adjacent subcarrier on each side (between two red lines) is used as the input to the network.
With these ideas in mind, we design serial network scheme that works with all subcarriers. A network is trained for only 1 subcarrier (subcarrier n = 0). To decode any other subcarriers using this network, a segment of the pulse spectrum q(ω) around a subcarrier (between the two red lines) is taken out and shifted using Eq. (19).
where λ k is the central nonlinear frequency of subcarrier k, t k = 4λ k z. This equation shifts the central frequency of subcarrier k to subcarrier 0 and removes the phase change due to propagation from the nonlinear spectrum 9 . Note that a negative sign is used in the Fourier transforms when gettingq(ω) from q(t). The shift of Eq. (19) is built into the network as a transform layer. All the subcarrier segments are collected in a queue and feed into the network. Figure 3 illustrates the idea. For the first and last subcarrier, zero paddings are applied. The output of the network is also a queue in the same order as the input. For each subcarrier input, the network outputs the decoded data corresponding to the mapping of the symbol c n . In this work, it is number 0 to 15.  The design of the CNN is shown in Fig. 4. In the figure, only three convolutional layers are shown. However, depending on the input signal, e.g. the carrier wave function and the width of the spectrum segment, the network depth can be adjusted to maximize the network accuracy.
The network used in this work contains 4 convolutional layers. The input to the network is a 32-rows long by 2-columns wide array with each column corresponding to the real and imaginary parts of the signal. Eq. 19 is implemented into the

5/11
network as a transform layer. All the convolutional layers have 64 kernels with a size of 3 × 3 for the first layers and 3 × 1 for the rest of the layers. The network is trained using the Adam optimizer 28 with L2 regularization. 10,000 simulated pulses were generated within which 99% is used for training and 1% is used for validation. The decay rate of gradient and squared gradient moving averages of the Adam optimizer are set to be 0.9 and 0.99, respectively. The L2 regularization factor is set to 0.00002. Cross-entropy is used in the training algorithm as the loss function 29 . As shown in Fig. 5, the network converges rather quickly. The accuracy of the network, defined as the percentage of correctly predicted symbols in the total number of symbols, reaches 90% in less than a single epoch and reaches over 99% before epoch 5. The learning rate for the first 10 epochs is set to 0.002 and then reduced to 0.0002 for the next 10 epochs. The accuracy approaches 99.9% after reducing the learning rate. The learning rate was further reduced to 0.00002 for the third 10-epoch period but no further accuracy improvement was observed. From the loss value, we notice the step-wise drops at epoch 10 and 20 are corresponding to the change of learning rate. The loss and accuracy saturate after 21 epochs while the training loss is slightly lower than validation loss which indicates the network may have reached its capability limit for further improvements with the available training data.  Looking at the network's prediction accuracy for different pulse power and individual subcarriers as shown in Fig. 6, we notice that the accuracy remains relatively constant for all ranges of pulse power. This can be the result of fact that the range of power are covered by the training data so that the network is tuned to balance the loss values for different powers. For individual subcarriers, the accuracy is lower on the two edges of the spectrum (subcarrier 2∼5 and 117∼128), where in the middle, the accuracy is the same and at a high level. Interestingly, subcarrier 1 is on the very edge of the spectrum but still has high accuracy which indicates slight bias in the network toward the first subcarrier. The cause of this phenomenon needs further investigation.  This network design optimizes the size of the parameter space. It allows hardware implementation of the network on relatively small chips (the example network only occupies 0.5 MB of memory). Although every subcarrier has to be decoded sequentially, a pipe-lined hardware design is suitable for this network to minimize the speed penalty.

The parallel CNN design
The serial design is aimed at small network size, which allows the use of cheap hardware for network implementation with a small speed penalty making it suitable for small end users. But for big end-users, such as data centres, high performance is the major concern instead of hardware costs. Hence, here we purpose a parallel network design where all subcarriers can be decoded at once.
In serial design, every subcarrier is taken out of the linear spectrum of the signal with its neighbouring subcarriers before being sent into the network. To decode all subcarriers, the spectral data of each subcarrier is used multiple times in the subsequent decoding process, which can be saved if all the subcarriers can be decoded simultaneously. However, the transformation Eq. (19) is necessary to compensate for the phase change due to propagation. In this section, a new transformation is introduced that allows the design of a multi-output network for simultaneous decoding of all subcarriers.
From Eq. (19) one can see the propagation phase compensation is done through the exponential term. The shift ofq in ω is used to align all the subcarriers. If all the subcarriers are going to be decoded at once, the frequency shift is not necessary. Furthermore, 2λ k is the central frequency of subcarrier k. To decode all subcarriers at once, one needs to compensate for the phase change for all frequencies. Therefore, we rewrite transformation in the following way. Firstly, we remove the frequency shiftq =q(ω, z)e jt k (ω+λ k ) .
Next, we replace λ k with λ to compensate for the phase change for all frequencies.

7/11
where 2λ = −ω (the negative sign depends on the sign in the linear Fourier transform) andq c isq after compensation. In principle we can apply the continuous frequency compensation to the serial design as well, however, the exponential calculation is computationally expensive and is not necessary for the serial case. Now, the propagation phase is removed from the pulse, we come up with the following network design for multi-subcarrier simultaneous decoding. Figure 7 shows the design concept of the network. In comparison to the serial design, multiple fully connected layers are added in parallel to the last activation layer (ReLU) followed by softmax layers. A custom training process is used for this new design. The cross-entropy function is applied to all "softmax" outputs and the averaged cross-entropy is used as the loss value for calculating gradients for the next iteration. In the figure, only three convolutional layers are shown, but as the number of subcarriers increases, additional convolutional layers can be added if necessary as well as increase the number of filters in each layer. The training of the parallel network requires more train data than the serial design. For the example shown here, a total of 200,000 sample pulses were generated, 90% of which was used for training and 10% for validation. The convergence of the network is similar to the serial design. The training and validation accuracy and loss can be found in Fig. 8. The accuracy reaches 95% within the first epoch and reaches 99% before epoch 5. The initial learning rate for the first 10 epochs is 0.002. With this learning rate, we notice big accuracy fluctuations toward later epochs. The fluctuation is reduced immediately after reducing the learning rate to 0.0002 and the accuracy reaches 99.9% within the following epoch. A small gap between training and validation is observed. Further reducing the learning rate to 0.00002 results in a further reduction in training loss but the validation loss remains the same, a slight indication of over-fitting. We speculate a larger training data can help overcome the over-fitting and further improve the accuracy and loss of the network.
The accuracy for different power ranges and individual subcarriers are shown in Fig. 9. The behaviours are rather similar to the serial network. The accuracies for different power ranges are very close while the accuracy for the subcarriers on the edges of the spectrum is slightly lower than the majority.
The trained network of the parallel design occupies 128 MB of memory while the serial design occupies merely 0.5 MB. It is possible to improve the performance of the network by considering adding more hidden layers to the network or increase the layer size. At the same time, it is also worth considering new network design concepts such as ResNet 30 , InceptionNet 31 and SENet 32 to get better overall performance with smaller network size. In networks presented here, all the convolutional layers contain 844N operations in total including 468N multiplications and 376N additions, where N is the input size. The operations in ReLU activation and max-pooling layers are negligible (about 2N operations). A fully connected layer has 32N operations with 16N multiplications and 16N additions. In the serial design, N=64, but the computation is repeated M times in sequence with a total of 878 × N × M operations. In the parallel design, N=4096 and only one-pass computation is required. There are M fully connected layers in the multi-output stage resulting in a total of (846 + 32 × M) × N operations, which is about 2.8 times more operations than the serial design but the computation can be performed in parallel to save overall execution time.

Conclusion
In this paper, we proposed two conceptual CNN designs for decoding NFDM signals. A serial design that scarifies speed for small network size and a parallel design optimizes speed over size. The serial design only occupies 0.5 MB of memory space suitable for implementation on small computer chips for small users while the parallel design occupies 128 MB of memory which requires more expansive hardware for usages in places such as data centres. Both network designs were demonstrated by training with simulated data and were able to reach more than 99.9% accuracy.
The simulated training data have zero initial phases. However, in the experiments, an unknown initial phase may present and slowly drift over time. This initial phase can be seen as a rotation of the constellation of symbol c n . This information is contained inside each signal burst and thus can be identified. For future work, network design shall take the initial phase    into account either through a separate network or adding layers to the current designs to make the network more suitable for practical use.

Methods
The CNN models in this work are developed using Matlab's Deep Learning Toolbox (version R2020a), but the same results can be reproduced using other deep learning frameworks such as TensorFlow and PyTorch. In each network, the first convolution has a kernel size of 3 times 3, a stride step size of 1, a padding size of 1 and no dilation. The rest of the convolution layers are 9/11 the same except a kernel size is reduced to 3 times 1. The first max-pooling layer has a pool size of 2 times 2, a stride step size of 2 and no padding. For the rest of the max-pooling layers, the pool size is reduced to 2 × 1. Each fully connected layer has an output size of 16 corresponding to the 16-QAM used in this work. Adam optimizer was using for training the network. The decay rate of gradient and squared gradient moving averages are set to be 0.9 and 0.99, respectively. L2 regularization is applied with a factor of 0.00002. The initial learning rate is set to 0.002 and then reduced manually each time by a factor of 10 during the training process. More detail can be found in corresponding sections where the network training results are discussed.
The key information of training data generation is given in Section 3, in which, the pulse propagation over a standard single mode optical fibre is simulated by solving the following general nonlinear Schrödinger equation using a split-step Fourier method 33 , The fibre loss coefficient α, dispersion β 's and nonlinearity γ are given in Section 3. In the split-step method, 2048 sampling points are used in the time window with a sampling rate of 96 GS/s. The propagation step size is set to 10 meters per step.