Mitigation of nonlinear phase noise in single-channel coherent 16-QAM systems employing logistic regression

We propose and analyze a classifier based on logistic regression (LR) to mitigate the impact of nonlinear phase noise (NPN) caused by Kerr-induced self-phase-modulation in digital coherent systems with single-channel unrepeated links. Simulation results reveal that the proposed approach reduces the bit error ratio (BER) in a 100-km-long 16 quadrature amplitude modulation (16-QAM) system operating at 56-Gbps. Thus, the BER is reduced from 6.88 × 10−4 when using maximum likelihood to 4.27 × 10−4 after applying the LR-based classification, representing an increase of 0.36 dB in the effective Q-factor. This performance enhancement is achieved with only 624 operations per symbol, which can be easily parallelized into 16 lines of 39 operations.


Introduction
Nonlinear phase noise (NPN) caused by Kerr-induced self-phase modulation (SPM) is the dominant nonlinear distortion mechanism in single-channel unrepeated links with coherent reception (Gordon and Mollenauer 1990;Ho and Kahn 2004). This is the case, for instance, of some coherent long-reach passive optical networks (LR-PONs) and high-capacity inter-datacenter links where linear impairments, i.e. chromatic dispersion (CD), polarization mode dispersion, and linear phase and frequency noises, are efficiently compensated by digital signal processing (DSP) schemes (Lavery et al. 2012(Lavery et al. , 2015. The mitigation of SPM, nevertheless, still poses a significant challenge. SPM leads to a phase rotation in a differential length segment dL that depends on the the power at position z and time t according to d NL (z, t) = − P(z, t)dL , being the nonlinear coefficient accounting for the influences of the nonlinear refractive index and the modal effective area (Agrawal 1 3 508 Page 2 of 14 2000). In the absence of CD, the intersymbol interference can be neglected and, consequently, the phase rotation depends uniquely on the actual symbol power. In contrast, when CD is present, the instantaneous power depends not only on the actual symbol but also on the adjacent interfering symbols. This causes a word-dependent rotation that seems stochastic and hinders NPN compensation (Chughtai 2009). Since this nonlinear phase rotation can severely affect the system bit error ratio (BER), several machine learning techniques have been proposed to mitigate its impact. In Zibar et al. (2012) and Aldaya et al. (2020), constellations affected by NPN are clustered using maximum expectation and histogram-based clustering, respectively. These clustering algorithms are interesting because they do not require any training sequences that reduce the data throughput. However, in highly static channels affected by impairments with a long coherence time, supervised algorithms can be employed with a small impact on the overall data transmission capacity. Several supervised approaches have been proposed to compensate the NPN. For instance, in , a k-nearest neighbors (KNN) algorithm is used to mitigate the effect of several impairments, including SPM. Another supervised approach denominated support vector machine (SVM) is employed in Wang et al. (2015), Li et al. (2013), and Giacoumidis et al. (2016). In spite of their satisfactory results, both KNN and SVM suffer from high computational cost that makes their adoption in real-time implementations difficult. Several classification techniques, however, remain unexplored. One example is logistic regression (LR) that is extensively used in areas as geomorphology (Ayalew and Yamagishi 2005) and economy (Laitinen and Laitinen 2000), but to the best of our knowledge, it has not been applied to nonlinear compensation.
In this work, we use multi-class LR to combat the NPN. The proposed solution presents a complexity as low as 624 floating point operations per symbol. In addition, its implementation is easily parallelizable, which further increases its potential for future real-time applications. The rest of the article is organized as follows: Sect. 2 describes the fundamentals of LR, including feature mapping to higher dimensional spaces and the extension to multi-class problems. In Sect. 3, the simulation setup is detailed, whereas in Sect. 4, performance and complexity results are presented. Finally, Sect. 5 concludes the paper.

LR for classification of constellations affected by NPN
Since this is, to the best of our knowledge, the first time LR is applied to the compensation of NPN, in this section we provide a detailed explanation of this approach (So et al. 1995;Friedman et al. 2001). The fundamentals of binary LR for the single and multiple dimensional cases are firstly explained. Then, we discuss the case where multiple classes are used. Figures illustrating examples for each LR classification process are used. The application of LR to NPN compensation problem is approached at the end of this section.

One-dimension binary logistic regression
Let us assume that we want to classify a set of symbols X using LR. We will first split these symbols in a training set X tr = {x (i) } composed of m symbols with associated labels Y = {y (i) } and a test set X test = {x (j) } . The classification is then performed in two steps: (1) Training stage During this stage, basic binary LR requires X tr to be fitted by a sigmoid function (also known as logistic function), which is given by: where is the model parameter to be optimized and h (⋅) is the so-called hypothesis function. To find the best fitting parameter , a proper cost function must be adopted.
Since the quadratic error is not convex in the case of binary LR, a different cost function needs to be considered. Typically, the cross-entropy function is employed (Buja et al. 2005). In addition to its compact form, one of the advantages of the adoption of cross-entropy as the cost function is the simplified parameter optimization because it leads to the closed-form expression for updating . In this way, the value of the model parameter after the k-th iteration, [k] , can be obtained from its previous value, [k − 1] , considering a learning rate according to: (2) Test stage Once the LR model has been properly trained, a new symbol x (j) can be classified by assigning the class y = }}1 �� if h (x (j) ) > 0.5 or to class y = }}0 �� if h (x (j) ) ≤ 0.5 , as can be seen in Fig. 1a.

Two-dimension binary logistic regression
The LR classifier can be easily generalized to a higher dimensional input data space. In this case, becomes a vector composed of the weights of each feature (input variable). Figure 1b illustrates an example of the classification of a set of two-dimensional data employing LR. The resultant decision boundary is a straight line that corresponds to the line where the 2D hypothesis function equals 0.5. Although for certain problems straight boundaries are suitable, more complex problems cannot be accurately addressed employing such a simple model. In this case, the system is said to suffer from biasing. This issue can be overcome by mapping the original data to a higher dimensional space where classification can be more accurately performed. Among the different approaches to perform feature mapping, polynomial mapping can be applied. For instance, if the input features are given by {x 1 , x 2 } , then we can map them to a higher dimensional space according to Table 1. By performing this mapping, the model considers a higher number of features that can fit much more complex systems. However, inceasing the feature space can be detrimental because the model can fit disturbances of the training set. As a result, the ability to predict the class of new incoming symbols may be compromised. This effect, denominated overfitting, can be solved by introducing a penalization term in the cost function throughout a regularization factor . The regularized cost function, J � ( ) , then can be written in terms of the unregularized cost function as: (4) where l is the l-th component of the parameter vector. The updating expressions now acquire a different form depending on the gradient component to be updated. In this way, the 0-th component is updated by whereas the rest of components are iteratively optimized according to   Figure 1c shows the decision boundaries for an example of polynomial feature mapping for three different values of corresponding to: an unregularized case, = 0 where overfitting can be easily observed; an intermediate value of = 0.02 that results in an accurate classification; and an excessively high value of = 0.04 , leading to biasing issues.

Multiclass logistic regression
LR can also be extended to multiple class problems, which can be addressed in different ways. One of the alternatives is to substitute the sigmoid function by a softmax function that considers multiple classes. Alternatively, the one-vs-all approach can be adopted, where for each class a binary LR is performed considering the data of the desired class as one class and all the points belonging to other classes as another class (Rifkin and Klautau 2004). Once the parameters for each class are found, the receiver classifies a new received data by computing the values of the different hypothesis functions and assigning the label corresponding to the class resulting in the highest value of hypothesis function. As shown in Fig. 1d, when employing feature mapping, this process leads to decision regions with curved boundaries that match nonlinearly distorted constellations better than straight lines.

LR for classification of QAM constellations
LR can then be employed to classify QAM constellations in single-channel systems where SPM is the dominant nonlinear distortion mechanism. In order to gain some knowledge on the structure of the clusters, in Fig. 2a and b, we show examples of constellations affected by SPM and receiver noise in absence and in presence of CD-induced inter-symbol interference (ISI), respectively. As can be seen, when no ISI is considered, the center of each cluster is rotated according to the amplitude of the symbol but it keeps its symmetrical two-dimensional Gaussian distribution. When ISI is present, on the other hand, in addition to an amplitude dependent rotation, the clusters acquire a more complex distribution. This distribution depends on the dispersion of link and on the relation between dispersion,

Fig. 2
Constellations affected by SPM and receiver additive noise (a) in absence of ISI and (b) considering CD-induced ISI received power, and receiver noise, making an analytical expression difficult to derive. As in many other cases, machine learning algorithms can assist to solve problems where obtaining an exact solution is not possible or it is too complex. Thus, for classifying the distorted constellations, we will test polynomial feature mapping of different orders alongside with regularization parameters to avoid overfitting.

System model
The simulation setup employed to evaluate the proposed classification technique in a system employing an unrepeated single-channel link and 16-QAM modulation format is shown in Fig. 3. We used a Matlab/VPI Transmission Maker co-simulation where the QAM electrical modulation and demodulation, as well as the required digital signal processing, were carried in Matlab and the conversion to optical domain, fiber transmission, and signal detection were simulated using VPI Transmission Maker. In the transmitter side, a pseudo-random bit sequence (PRBS) was converted from serial to parallel in blocks of 4 bits and mapped into a 16-QAM constellation with Gray code. The resulting QAM symbols were decomposed into their real (in-phase) and imaginary (quadrature) parts, oversampled, and filtered to emulate digital-to-analog conversion (DAC). These signals drove a dual-parallel Mach-Zehnder modulator (DP-MZM) that modulated the light coming from a continuous wave (CW) laser diode (LD 1 ) with an operating wavelength of 1550 nm. The modulated signal was amplified and its power was controlled by a variable optical attenuator (VOA) before being launched into an 80-km standard single mode fiber (SSMF). Afterwards, the propagated signal was attenuated by 18 dB, which may represent either a 1:64 splitting ratio or a power penalty due to other impairments. The signal was then transmitted over a second SSMF span of 20 km. At the Fig. 3 Simulated coherent LR-PON system. S/P serial-to-parallel conversion, DAC digital to analog converter, LD laser diode, DP-MZM dual parallel-Mach-Zenhder modulator, EDFA erbium-doped fiber amplifier, VOA variable optical attenuator, SSMF standard single mode fiber, Att attenuator, LPF low-pass filter, ADC analog-to-digital converter, DSP digital signal processing receiver, a 90° hybrid network was used to combine the incoming signal with a second CW laser (LD 2 ), also operating at 1550 nm. The outputs of the hybrid network were detected by two pairs of balanced photodetectors (PDs) that generated two differential signals that were digitally processed. The first stage of the DSP was a frequency-domain chromatic dispersion compensation. It was followed by the time synchronization stage and by a downsampling process to get a single sample per symbol. The rotation of the constellation caused by the combined phase noise from LD 1 and LD 2 was compensated using a blind phase search algorithm. In our simulations we considered a bit rate of 56 Gbps and a number of simulated QAM symbols of 81,500. These symbols were split in a training set of up to 25,000 symbols and a fixed test set of 56,500 symbols. Due to the non-Gaussian distribution of the constellation points, the bit error rate (BER) was calculated through error counting and the effective Q factor was obtained from it according to Q eff = √ 2 erfc −1 (BER) (Shafik et al. 2006). The most important simulation parameters are summarized in Table 2.  Figure 4a shows the BER after detecting the signal using the widely adopted maximum likelihood (ML) method and LR with different configuration parameters. In particular, we considered combination of second and third order polynomial mapping and regularization factors of 0 (unregularized), 0.02, and 0.04. As can be observed, at low power levels, where the additive noise is dominant, the BER curves corresponding to the LR configurations and maximum likelihood converge. For high launch optical power, on the other hand, LR outperforms ML, showing that indeed logistic regression is compensating the effect of SPM when this is more harmful. Comparing the different considered LR configurations, for both low and high launch optical power levels they present similar performance. Nevertheless, for intermediate values, a second-order polynomial mapping with a regularization factor of = 0.02 results in an optimum BER at a 6 mW of 4.27 × 10 −4 , representing a 0.36-dB reduction in the effective Q-factor with respect to the commonly adopted ML approach. In addition, considering a forward error correction (FEC) limit of 2 × 10 −3 , the use of LR leads to a power margin enhancement of ~ 2 mW, that is ~ 1 dB. In order to understand how LR classification reduces the BER in systems affected by SPM, in Fig. 4b, c we show the constellations for low (2 mW), intermediate (6 mW), and high (13 mW) power levels, alongside with the obtained decision regions when using conventional ML detection, whereas in Fig. 4e-g, the decision regions obtained employing LR are shown. As can be seen, for 2 mW, when employing LR, the decision regions are limited by almost straight boundaries, which are very similar to the boundaries obtained using ML and, therefore, the obtained BER is similar for both approaches. For intermediate and high launch optical power levels, the boundaries found by LR are curved, being better adapted to the distorted constellation than those straight obtained ML. The previous analysis considered the averaged BER. That is, the BER curves are for the overall classification error considering all the constellation points simultaneously but, looking at the constellations in Fig. 4b-g that at moderate and high power levels, points with larger amplitude are more prone to errors than those with lower amplitude. This effect is quantified in Fig. 5, where the BER for each symbol is graphically presented for both maximum likelihood (a-c) and LR (d-f) at three different power levels, i.e. 2, 6, and 10 mW. These values can be interpreted as the complimentary of accuracy matrices, as this metric better exemplifies the effect of nonlinear compensation (Friedman et al. 2001). Indeed, for the highest power level it can be seen that the great majority of the errors occurs only in four constellation points. This effect is common to maximum likelihood and LR but it can be clearly observed that LR is more efficient in compensating the BER in the most affected constellation points, thus resulting in an improved overall BER.

Comparison with other supervised classification techniques
In order to show the potential of the proposed LR-based nonlinear classifier, in Fig. 6, we show the BER performances of different classification methods. In particular, we compare LR with KNN, SVM artificial neural networks (ANNs), which operate in supervised mode and K-means that is an unsupervised classification method. Each algorithm was configured for optimum performance. Thus, for LR, N = 2 and = 0.02 were used. In the case of KNN, we adopted K = 13 neighbors, as this was the best configuration according to Paula et al. (2020). For SVM, and C were set to 0.1 and 20, respectively. Regarding the ANN, a single hidden layer composed of s 2 = 16 neurons. It is also worth noting that for K-means we employed K-means++ as initialization method as this reduces the amount of iterations and reduces the probability for erroneous classification . As can be observed, at low power levels, all the classification methods converge to the BER values of ML but, as the power increases, the different machine learning algorithms outperform ML. Nevertheless, although KNN, SVM, ANN, K-means and LR can clearly mitigate NPN effects, the latter shows the best performance, as can be quantitatively observed in Table 3.
The closest performance to that of LR is achieved with SVM, however, LR presents a lower computational complexity. In order to show the potential of the proposed LR-based nonlinear classifier, in Fig. 6, we show the BER performances of different classification methods. In particular, we compare LR with KNN, SVM, artificial neural networks (ANNs), which operate in supervised mode and K-means that is an unsupervised classification method. Each algorithm was configured for optimum performance. Thus, for LR, N = 2 and = 0.02 were used. In the case of KNN, we adopted K = 13 neighbors, as this was the best configuration according to Paula et al. (2020). For SVM, and C were set to 0.1 and 20, respectively. Regarding the ANN, a single hidden layer composed of s 2 = 16 neurons. It is also worth noting that  for K-means we employed K-means++ as initialization method as this reduces the amount of iterations and reduces the probability for erroneous classification . As can be observed, at low power levels, all the classification methods converge to the BER values of ML but, as the power increases, the different machine learning algorithms outperform ML. KNN, SVM, ANN, K-means and LR are clearly able to mitigate NPN effects. However, LR outperforms the other methods, as can be quantitatively observed in Table 3. The closest performance to that of LR is achieved with SVM, however, LR presents a lower computational complexity.

Training set size and complexity analysis
Albeit the considered optical channel may be quite static and, therefore, training process should be carried out rarely, it is important to analyze the amount of data required to accurately train the LR-based receiver as this measures how prone the system is to eventual re-configurations. In Fig. 7, the BER of the test set in terms of the training set size for a second-order polynomial mapping and a regularization factor of = 0.02. The performance of the LR-based classifier depends on the training set, which comes from a stochastic process. Therefore, for training set sizes ranging from 500 to 25,000, we performed the training 100 times by randomly selecting subsets of the training set while maintaining the test set fixed. In Fig. 7, dots represent the outcome of each training process and solid line corresponds to the ensemble average for each training set size. As expected, for small training sets, not only the BER variance but also its mean are relatively high, resulting in a poor classification performance. As the training set size increases, the mean BER converges to ~ 5 × 10 −4 and the standard deviation stabilizes at ~ 7 × 10 −5 . From this figure, we can then conclude that we can use a training set of 10,000 symbols with a negligible performance penalty. Furthermore, the size of the training set can even be reduced down to 5,000 with an average increase in the BER of 27.21%. Another important point to consider is the computational complexity of the algorithm, since an excessively high complexity would prevent its adoption in real-time systems. As mentioned, the system is relatively static and, therefore, we limit our study to the test stage, that is, after training process has been carried out. If the logistic function is implemented using a look-up table, then the only operations that should be performed are those required to calculate the activation parameter ⋅ , which will depend on the degree of the polynomial feature mapping. In addition, it is worth noting that since in-phase and quadrature components are independently considered, all the involved operations are real. We can split the total complexity Fig. 7 BER of the test set in terms of the training set size. For each set size we repeated the train 100 times into three terms: (i) the number of operation required to compute the values of the new features, (ii) the number of operations corresponding to multiplying each element of the feature vector by the corresponding weight, and (iii) the sum of all weighted terms. The number of operation to calculate the values of the feature is given by: Regarding the number of product operation, this corresponds to the number of features, which is given by: And, finally, the number of adding operations can be calculated by subtracting one to the number of features: Since we adopted a one-vs-all approach, these calculations have to be performed for each of the M classes, so the total number of operations is given by: Thus, particularizing expression (10) to our parameters, N = 2 and M = 16 , we obtain that the required number of operations per symbol is 624. It is worth noting that, since for each class the hypothesis function can be independently computed, parallelization is straightforward. Therefore, the latency of classification can be as low as 39 floating-point operations.
Comparing to KNN and SVM applied to the current scenario, the proposed approach presents a significantly lower complexity. In the case of KNN, its complexity depends on the particular implementation because, even if all implementations require the computation of the distances to all the training symbols, different approach to find the K nearest neighbors can be adopted. If, for instance, we consider sweeping K times over the whole distance vector to find the K nearest neighbors, it results in a total complexity of (3 + K) ⋅ N tr operations, where N tr is the length of the training set. Since for the system under test the best configuration for KNN is achieved for values of K = 13 and a training set length of 1000, the total number of operation is 16,000. Regarding the SVM complexity, the number of required operations is given by M ⋅ (3 ⋅ N SV + N SV − 1) , where N SV is the number of support vectors that ultimately depends on the distribution of symbols to classify. Therefore, the number of support vectors varies with the launch optical power. In particular, for the optimum power level, i.e. 6 mW, and a training set size of 10,000 symbols, 521 vectors were required, resulting in a total operation count of 33,328. Consequently, for optimum configurations of the different approaches, LR presents a complexity 25 and 53 times lower than those of KNN and SVM, respectively.

Conclusions
In this work, we have presented a LR-based classifier to reduce the BER in unrepeated digital coherent systems employing 16-QAM affected by the combination of nonlinear phase and additive noise. Numerical simulations performed in co-simulated VPI Transmission (7) C (i) = 1 3 N 3 + N 2 + 2 3 N, ∀N ≥ 1.
Maker for Matlab show that the adoption of LR with a second-order polynomial mapping and a 0.02 regularization factor improves the system performance with respect to the traditional ML detection method, reducing the BER from 6.88 × 10 −4 to 4.27 × 10 −4 and offering a power margin of ~ 1 dB. Simulations also reveal that a training set of 10,000 symbols is sufficient to achieve optimum performance. In addition, LR-based classification presents low complexity (16 parallel executions of 39 floating point operations), which makes this approach promising for future implementations.