Machine learning compensation of fiber nonlinear noise

This paper extends previous studies work on the application of Machine Learning (ML) to distortion compensation in optical communication systems with optical fibers nonlinear coefficients (γ) that exceed current industry standards e.g. γ>1.4W-1km-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma > 1.4W^{ - 1} km^{ - 1}$$\end{document}. To quantify the improvement afforded by ML methods under different transmission conditions the procedures are applied to a model of a typical single-frequency optical communication system with, a 3200 km fiber length, double polarization, and a 16-QAM modulation format. The performance of both transmitters and receivers that incorporate Neural Networks (NNs) are then examined over a wide range of γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma$$\end{document} values. In all cases considered, the system Q-factor is improved although the degree of enhancement is dependent on the signal to noise ratio. The ML structures that were investigated include Siamese neural networks (SNN) implemented at the receiver end as well as two-stage architectures that employ NNs at the transmitter together with a classifier at the receiver side. Classifiers ranging from simple decision tree structures to boosting, forests, extra trees, and Multi-layer perceptron (MLP) were further examined and found to provide significant enhancement for γ>\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma >$$\end{document}4W-1km-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4W^{ - 1} km^{ - 1}$$\end{document}. The optimal performance for highly nonlinear systems was achieved for two-stage systems with random forest or extra tree classifiers. Finally, empirical equations for each ML technique were derived that relate the Q-factor enhancement and the value of γ,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma ,$$\end{document} and the number of triplet terms that are input into the neural network. These results could potentially be employed to relax manufacturing constraints and accordingly reduce system costs.


Introduction
High-capacity optical communication single-mode optical fiber systems are often limited by nonlinear distortion in the optical fiber transmission medium. This distortion is a phase and polarization rotation noise that is determined by the field intensity This work is part of a PhD thesis submitted to the University of Waterloo. That pre-print can be found at https:// uwspa ce. uwate rloo. ca/ bitst ream/ handle/ 10012/ 17780/ Melek_ Marina. pdf propagating along the optical fiber. Prior work has demonstrated that the mean phase variation resulting from cross-phase modulation (XPM) can be compensated by a judicious choice of the averaging window in carrier phase estimation (CPE) from the observed XPM amplitude (Lin et al. 2012). Hence CPE can significantly improve system performance when the nonlinearity accumulation during transmission is sufficiently small. However, for strong nonlinear signal distortion, the benefit of CPE diminishes because simply compensating for the average phase rotation does not address the rapidly varying nonlinear phase shifts associated with self-phase modulation (SPM) (van den Borne et al. 2007;Lin et al. 2012). Several compensation techniques have accordingly been applied at both the receiver and transmitter (Fisher et al. 1983;Tkach 2010). The most common procedure, digital backpropagation (DBP), employs the coupled nonlinear Schrödinger equation (CNLSE) to model the propagation of the two orthogonal polarizations through the fiber (Ip and Kahn 2008;Liga et al. 2014). Typically, the splitstep Fourier method (SSFM) (Agrawal 2007) is applied to the Manakov equation as an approximation to the CNLSE, namely, where E (t, z), = x, y are the two optical field components and , 2 , and represent the fiber attenuation, nonlinear, and chromatic dispersion coefficients, respectively. By reversing the sign of the chromatic dispersion and nonlinear terms as shown in Eq. (1), the undistorted input pulse train at the transmitter can then be estimated from the received signal.
While a DBP analysis enables the nonlinear noise imparted to the signal during propagation to be largely compensated, the computational complexity is O(N step N FFT log 2 (N FFT )) which depends on the total number of steps along the fiber length and the total number of symbols fed to each fast Fourier transform (FFT) block. In contrast, the complexity of relevant small structured NN is O(N inputs ) where N inputs is the number of inputs to the NN (Melek and Yevick 2020a), demonstrating that AI techniques are considerably more computationally efficient than the DBP.
Machine Learning (ML) methods such as k-nearest neighbors (Catanese et al. 2019;Gao et al. 2013;Kamalov et al. 2018;Mata et al. 2018;Wang et al. 2016b) as well as support vector machines (SVM) (Wang et al. 2016a), and dynamic deep neural networks (DDNN) (Sidelnikov et al. 2018) were employed to compute optimum decision boundaries for the received constellations in nonlinear optical communication systems. However, in the presence of nonlinear noise, many of these procedures required averaging the error over multiple transmission blocks before applying the AI algorithm, which limits their applicability to long-haul propagation. For example, in the DDNN based method of (Sidelnikov et al. 2018), in order to avoid the effect of the error floor, Previous studies by the authors Yevick 2020a, 2020b), integrated ML with a perturbation-based compensation procedure by employing the most significant self-phase modulation (SFM) and cross-phase modulation (XPM) noise terms as input data to the NN. In the perturbation based technique, the second-order nonlinearity acting on a symbol at t = 0 could be approximated by a series of terms of the form (Tao et al. 2011), where P 0 denotes the pulse peak power at the launch point, X , Y represent the symbol sequences in the x-and y-polarization channels, respectively, and the nonlinear perturbation coefficients, C mn , depend on the link parameters as well as the duration and shape of the signal pulse. As shown in Eq. (2), the nonlinear noise triplets terms are of the form X n X * m+n X m , Y n Y * m+n X m , andY n Y * m+n Y m , X n X * m+n Y m . These respectively represent intra-selfphase modulation (ISPM) when m = n = 0 , intra-channel cross-phase modulation (IXPM) when morn = 0 , and intra-channel four-wave mixing (IFWM) when mandn ≠ 0 . However, the resulting procedure requires a somewhat larger number of complex multiplications compared to the proposed NN procedure of (Melek and Yevick 2020a) as indicated by the black and grey bars in Fig. 1, respectively.
In previous work, we investigated the degree to which nonlinear noise prediction and compensation can be enhanced by including the dominant nonlinear noise triplets terms in Eq.
(2), defined as those terms for which 20log into both the standard and Siamese NN (SNN) at the receiver side Yevick 2021, 2020a), as well as into the NNs at the transmitter side (Melek and Yevick 2020b). This approach was also applied to two-stage AI techniques in which the fiber nonlinearity was compensated at both a transmitter stage, through a NN, and at a second receiver stage, where additionally various classifier strategies were examined (Melek and Yevick 2020b). An important feature of ML architectures is generalization, which is the ability of a ML system to generate accurate predictions under transmission conditions that are different than those on which it was optimally designed. Accordingly, while the previous studies of the authors investigated the ability of ML to improve the signal-to-noise ratio measured in systems containing fibers with typical nonlinearity coefficients, this paper extends these studies to fibers with higher nonlinear noise levels. This demonstrates that the ML procedure can generalize to systems with uncharacteristic properties and can therefore be employed to find the most appropriate Fig. 1 The number of complex multiplications required by the perturbation-based compensation method (black bars) versus the proposed neural network approach in (Melek and Yevick 2020a) (grey bars) compensation technique for a wide range of communication system designs. The results of this analysis could accordingly decrease the dependence of the system performance on the doping profile which significantly affects the nonlinear coefficient in single-mode fibers (Oguama et al. 2005), which could reduce manufacturing cost and enable additional optimization of the fiber properties. As well, the predictions are further expected to be indicative of wavelength division multiplexing (WDM) system behavior as multichannel systems are considerably more sensitive to the presence of nonlinear noise.

Transmission system model
The single frequency system models in this paper employ two simulated 2 17 amplitudemodulated symbol blocks generated according to the parameters in Table 1 and transmitted through the link specified in Table 2. As indicated in Fig. 2, for nonlinear pre-compensation, a negative shift corresponding to the noise predicted by a NN should be applied to the data symbols at the transmitter side. The encoded data is otherwise modulated and propagated through the fiber without additional compensation. After propagation, the signals are coherently detected and demodulated. The demodulated data either first passes through an AI stage or is inserted without compensation into the decoder. The AI stage either applies NN or SNN regression prediction and compensation to the added propagation noise or a classifier that assigns the received symbol to one of 16 classes. Finally, the symbols are decoded, and the Q-factor is evaluated by comparing the received and transmitted bits according to the standard formula in which erfc −1 is the inverse of the error function, and BER is the bit error rate. (

Methodology and numerical results
This section first considers the one stage techniques in which a ML block, NN or SNN used for nonlinearity mitigation at either the transmitter or the receiver side as illustrated in Fig. 2. The performance of two-stage ML techniques is then investigated. In both cases a wide range of nonlinear coefficients are considered, and empirical equations are derived to qualitatively describe the compensator performance.

Neural networks (NN)
According to our parametric study in previous work Yevick 2020a, 2020b), the optimum NN design at the receiver side of the system, as shown in Fig. 3, comprises 1 hidden layer with 2 neurons, employing "Relu" activation functions. The target function associated with this structure combines the target functions,y = f ( ∑ (w i x i + b i )) , of the individual neurons . The output of the NN target function is nearly identical to the result for the nonlinear distortion generated by Eq. (2). However, the nonlinearity introduced by the activation functions of the NN hidden layers together with the data-driven stochastic optimization of the weights and biases yield improved predictions of the added nonlinear noise compared to the perturbation-based nonlinear compensation technique, Eq. (2) (Zhang  . Increasing the number of neurons or hidden layers increases the likelihood of overfitting, adds more computational complexity, and does not noticeably improve the NN performance. The NN inputs are the symbol of interest and the co-polarized symbols in shared time slots. The triplet terms in Eq. 2 that are input into the NN are chosen such that their C mn (Tao et al. 2011) possess values of 20log(|C mn |∕|C 00 |) that exceed a threshold value−22dB , that yields optimal performance with minimum computational overhead. The NN is trained on ~ 80,000 received symbols with a given polarization at 2 dB above the optimum launch power Yevick 2020a, 2020b;da Silva et al. 2019;Zhang et al. 2019) and tested on the remaining 50,000 symbols with the identical polarization. The outputs are the Real and imaginary parts of the additional nonlinear symbol distortion for the time slot of interest. To obtain the minimum mean square error (MSE) between the observed and predicted outputs, the adaptive moment estimation, "Adam", a first-order gradient-based optimization algorithm is applied to the stochastic objective function (Kingma and Ba 2017). During the execution stage, the trained receiver NN post-compensates the received data, for both polarization and at all launch powers, according to the following formula, where Tx symbol , and Rx symbol are the predicted transmitted and received symbols, is the power scaling factor that adjusts the predicted noise according to the transmitted signal power, and (NN) output signifies the output perturbations from the NN. If the transmitted symbols are instead pre-compensated at the transmitter side where the same trained NN is employed, To determine the NN performance for different fiber nonlinearities, the nonlinear noise generated by fiber nonlinearity is quantified by the parameter . Since the optimum power level is -dependent the Q-factor enhancement values reported afterward were evaluated at the optimum system power for each . As shown in Fig. 4, the NN performance decreases with increasing for > 2 W −1 km −1 independently of the number of triplets input to the NN and hence the system complexity. Therefore, a curve fit to the average of the results for different thresholds with the same value can be employed to describe the . 4 The performance of NN a at the receiver side b at the transmitter side versus the nonlinearity coefficient. The dotted lines are the optimal algebraic curves while the solid lines are the characteristic equation representation NN performance. This curve is named the "characteristic curve" in the remainder of this paper. Below the R-squared values associated with the curve fit are employed to determine the most appropriate function for each technique. These equations are considered an algebraical approximation to the corresponding compensator performance. When the fiber nonlinearity is compensated by neural networks at the receiver/transmitter sides of the channel, the associated R-squared values are found to be 96.7 and 92%, respectively when the Q-factor enhancements are fit to an exponential of the product of a negative constant and .
In particular, since the curves of Fig. 4, are similar in shape, they can be parameterized in terms of and the magnitude of the threshold 20 log | | C mn |∕|C 00 | | , , according to when a NN is placed at the receiver side and when a NN is placed at the transmitter side. Equations 6 and 7 further indicate that the NN performance is effectively independent of whether the NN is located at the receiver or the transmitter. The curves generated from the algebraic expressions nearly coincide with the data and are nearly identical in the two cases, as evident from Fig. 4a, where the values obtained from the above formula (solid lines) and the optimal fit (dotted lines) are compared with the data for 25 and 20 dB threshold values, respectively.

Siamese neural networks (SNN)
In (Melek and Yevick 2021), the two SNN designs, shown in Fig. 5 incorporating the nonlinear perturbation terms, were employed at the receiver side of the system to mitigate fiber nonlinearity. The proposed SNN can be employed at the transmitter or the receiver, but for simplicity, only post compensation of nonlinear distortion is examined below. This is consistent with the results of the subsection entitled Neural Networks (NN), which suggests that a NN compensation at the receiver yields almost the same performance compared to the transmitter side compensator. In the two branch case, the symbols of interest together with the corresponding co-polarization symbols are input into the first branch while the dominant triplets terms in Eq.
(2) are processed by the second branch. An optimum SNN architecture was employed in which each of the first and second NNs possesses 1 hidden layer of 2 neurons with a "RELU" activation function while the output NN layer contains 2 neurons with "linear" activation functions according to Fig. 3. The inputs to the second branch consist of the real and imaginary parts of 2445 triplet terms. In the three branch SNN, the symbols of interest are inserted into the first SNN branch while two different groups of triplets are input into the second and third branches. In the model of (Melek and Yevick 2021), each branch consisted of a two neuron hidden layer with a "Relu" activation function followed by a second two neuron output layer with a "linear" activation function. The maximum Q factor improvement was obtained when the real and imaginary parts of 748 triplets are input into the second branch while the remaining 1697 triplets are inserted into the third branch. The networks are trained on the 80,000 single-polarization data symbols launched at 2 dB above the optimum launch power associated with each value of to amplify the effect of the nonlinear distortions. To suppress overfitting, the remaining 50,000 symbols were employed for validation. The trained network was subsequently also applied to both orthogonally polarized training symbols and symbols at other signal powers. In the latter case, the output of the trained SNN was simply multiplied by a scaling factor as illustrated in Eq. (4).
To investigate the performance of the SNN for different signal-to-noise ratios, the proposed SNN designs were also applied to systems with different values of , as shown in Fig. 6. While the Q-factor decreases monotonically with similarly to systems incorporating a standard NN either at the transmitter or the receiver, the Q-factor is reduced compared to the standard NN implementation for all values, consistent with the results of (Melek and Yevick 2021) at = 1.4W −1 km −1 . For either SNN architecture the Q factor improvement is approximately  which is clearly smaller than the corresponding quantities in Eqs. (6) and (7) since the exponents in the equations are nearly identical. The R-squared values associated with these fits to the averaged curves are 93 and 96%, respectively. Note as well that the SNN performance depends on the threshold for the number of triplets input to the network. Therefore, as the nonlinear coefficient increases, the SNN performance is inferior to that of the standard NN while requiring the same or greater computational resources.

Part II: Two stage AI techniques
The two-stage AI technique for enhancing the Q-factor employs the trained NN design at the transmitter side while a classifier is instead located at the receiver side as indicated in Fig. 2. Similarly to (Melek and Yevick 2020b), the received data is categorized by 16 labels corresponding to the 16QAM constellation points, while the classifiers are trained to predict the appropriate label for each received symbol by employing 80,000 singlepolarization symbols at the optimum power. The trained classifiers then predict the remaining 50,000 symbols with identical polarization as well as 130,000 symbols with orthogonal polarization. In (Melek and Yevick 2020b), the Q-factor of the optical channel at = 1.4W −1 km −1 was found to be enhanced due to using two-stage ML topology. In this paper, this result is extended to channels with higher values of nonlinear coefficients in order to ascertain the robustness and generalizability of the technique.

Decision tree
In (Melek and Yevick 2020b), a system with = 1.4W −1 km −1 achieved a smaller Q-factor enhancement when a decision tree was employed as a classifier at the receiver compared to an optimized system with a NN placed at the transmitter. In this reference, which employed the scikit-learn package the trees did not have a specified maximum depth (the length of the longest path from a root to a leaf) since the evolution of the tree was governed by the input data. Figure 7 however demonstrates that the performance of the decision tree is only slightly greater than that of the standard NN implementation. Accordingly, decision trees are relatively ineffective classifiers for nonlinearity mitigation. The continuous curve shown in Fig. 7  Indicating that the Q-factor enhancement is described by a negative power of with a 90% R-squared value.

AdaBoosting and GBoosting
To improve the decision tree results, the classifier performance can be enhanced through boosting, as demonstrated for Adaptive boosting (AdaBoosting) in (Melek and Yevick 2020b). The AdaBoosting algorithm continuously updates the probability distribution of the input variables of an initially weak and inadequately performing classifier by multiplying the weight of each input variable by a weight updating parameter, resulting in the next classifier iteration. The parameter value decreases when the input is correctly identified by the previous classifier. As shown in Fig. 8a, a two-stage AI technique with Adaboosting, applied to the decision tree classifier with a maximum depth of 8 at the receiver, yields improved performance relative to an uncompensated system for < 9W −1 km −1 . However, the probability of misclassified data increases with . This negatively affects the performance since the accuracy of the weight updating parameter depends on the ratio of incorrect to correct classifications, therefore it degrades rapidly as the noise increases (Shrestha and Solomatine 2006).
Strong Gboosting classifiers, which minimize the classification error by combining several weak classifiers have proved effective in compensating high nonlinear noise levels (Son et al. 2015). As shown previously (Melek and Yevick 2020b), in the context of this paper Gboosting is optimally applied to decision trees with a depth of 3. Figure 8b demonstrates that the system performance enhancement decreases with , although for small the performance of Gboosting is less than that of Adaboosting. For the values considered, the Q-factor enhancement associated with Adaboosting can be approximated by with a fitting parameter of R 2 = 0.973 which is almost independent of the threshold level. The corresponding Gboosting Q-factor enhancement in Fig. 8b is approximated with R 2 = 0.95 , as, Thus Ada-boosting is most advantageous at small fiber nonlinearities when computational resources are limited as the number of triplets can be considerably reduced compared to G-boosting. Two-stage G-boosting was further found to be preferable to the standard NN or SNN for all values of in the figures.

Random forest and extra trees
The random forest ensemble method is based on the voting average for each class of a group of decision tree classifiers running in parallel where each tree of the forest independently samples random vectors containing identically distributed random numbers. In contrast to the extra trees method in which each decision tree in the forest is constructed from the original training sample, which consists of a set of features (dimensions). In the extra trees procedure, each tree is then given a random sample at each test node, from which the decision tree chooses the best feature to classify the data depending on specific mathematical criteria (Geurts et al. 2006).
Although the decision tree technique yields only a limited improvement in the Q-factor relative to the results of the previous section, Fig. 9 demonstrates that the tree ensembles associated with the random forest and extra trees techniques enable significant further improvements. Moreover, Fig. 9 shows that the random forest and extra trees techniques compensate nonlinear noise more effectively than competing algorithms, especially for large fiber nonlinearities while, as noted in (Melek and Yevick 2020b), exhibiting the shortest training times. As shown in the previous procedures, to model the dependence of the Q-factor enhancement in the random forest and extra trees methods on , an average is performed over all the results for different thresholds with the same value. Unlike the previous methods, the slope of the curves in the random forest and extra trees procedures is dependent on the threshold value as given in the following function,

Multi-layer perceptron (MLP) classifier
Employing a MLP classifier containing a single 4 neuron hidden layer with a 'Relu' activation function at the receiver, trained as indicated at the beginning of this section, yields the curves in Fig. 10. When the size of the classifier is optimized, its Q-factor improvement is effectively identical to that of a standard NN placed at the transmitter. This is also evident from the parametrization which is almost the same as that associated with a transmitter side NN. This is identical to the result in (Melek and Yevick 2020b), which predicted a 0.03dB enhancement for = 1.4W −1 km −1 . Figure 11 displays the average of the results for the Q-factor associated with the AI configurations analyzed in this paper. The plots show that all techniques perform nearly identically for small but differ increasingly for larger values of . This figure further establishes that the most appropriate AI technique for the system under consideration is a two-stage (13) ΔQ = 21.6 exp (−0.019 ) Fig. 10 The performance for a transmitter side NN and a receiver side MLP as a function of the nonlinearity coefficient. The dotted lines are the best fit to the data while the solid lines are generated with Eq. (13) Fig. 11 The system Q-factors for the perturbation-based nonlinear compensation (PB NLC), Chromatic dispersion compensation (CDC) with a NN deployed at the receiver and transmitter side, a SNN deployed at the receiver side, and two-stage AI with Boosting, random forests, Decision trees, and multi-layer perceptrons classifiers structure with either a random forest or extra trees at the receiver. Further, comparing the system Q-factor for the different AI methods with those of the analytical perturbationbased nonlinear compensator (PB NLC) procedure, shown in Eq. (2), for a fixed number (2445) of triplets and a -22 dB threshold value indicates that the accuracy of the PB NLC method falls rapidly as the nonlinear coefficient increases for > 2W −1 km −1 such that the Q-factor improvement even becomes negative for > 5.8W −1 km −1 . In effect, if the number of triplets in the PB NLC calculations is insufficient to model the nonlinear noise, the transmitted symbols are incorrectly interpreted. Increasing the number of triplets, however, increases the complexity of the model and the required computational resources. In contrast, the AI techniques, which provide a significant system Q-factor for values of as large as 17W −1 km −1 can adapt to large system nonlinearities, presumably as a result of the inherent nonlinearity of the neuron activation functions. Finally, Table 3 summarizes the comparison between all proposed ML techniques from complexity and performance at different values points of view.

Conclusion
This paper benchmarks the performance of different ML based nonlinear compensation in single frequency, 3200km , double polarization, and 16-QAM optical communication systems containing fibers with atypically large fiber nonlinearities, namely > 1.3 ∼ 1.4W −1 km −1 (Shakya et al. 2016). The enhanced compensation afforded by the NN could enable deployments with reduced fibers fabrication tolerances and therefore lower cost. Additionally, the ML compensation method could be matched to the nonlinear characteristics of the optical fiber transmission medium based at least partly on the qualitative parametric curve fits as well as the quantitative results for the improvement in the Q-factor of Fig. 11 and Table 3. These results demonstrate that a NN can be employed either at the receiver or the transmitter side over a wide range of nonlinear noise levels with identical Q-factor enhancements. Employing instead a SNN at the receiver side leads to slightly reduced performance. On the other hand, two-stage AI with classifiers such as extra trees and random forests at the receiver can significantly compensate for studied levels of nonlinear noise, while decision trees do not afford any noticeable advantage over the standard NN procedure. Additionally, Ada-boosting improves the performance for small nonlinear coefficients, even if a reduced number of triplets are used as input into the transmitter NN, but its effectiveness decreases rapidly with nonlinearity. For each topology, an empirical algebraic equation was generated for the system Q-factor enhancement in terms of the triplet selection threshold and . The enhancement associated with each technique at any nonlinear noise value can therefore be rapidly estimated, which should be useful in communication system design. Finally, while the AI designs examined in this paper will be applied to WDM systems with differing values and number of channels in a subsequent paper, the results are expected to be largely identical to the single frequency channel case.