An empirical investigation of deviations from the Beer-Lambert law in near-infrared spectroscopy A case study of lactate in aqueous solutions, serum, blood and tissue

— The linear relationship between optical absorbance and the concentration of analytes -as postulated by the Beer-Lambert law- is one of the fundamental assumptions that much of the optical spectroscopy literature is explicitly or implicitly based upon. The common use of linear regression models such as principal component regression and partial least squares exemplifies how the linearity assumption is upheld in practical applications. However, the literature also establishes that deviations from the Beer-Lambert law can be expected when a) the light source is far from monochromatic, b) the concentrations of analytes are very high and c) the medium is highly scattering. The lack of a quantitative understanding of when such nonlinearities can become predominant, along with the mainstream use of nonlinear machine learning models in different fields, have given rise to the use of methods such as random forests, support vector regression, and neural networks in spectroscopic applications. This raises the question that, given the small number of samples and the high number of variables in many spectroscopic datasets, are nonlinear effects significant enough to justify the additional model complexity? In the present study, we empirically investigate this question in relation to lactate, an important biomarker. Particularly, to analyze the effects of scattering matrices, three datasets were generated by varying the concentration of lactate in phosphate buffer solution, human serum, and sheep blood. Additionally, the fourth dataset pertained to invivo, transcutaneous spectra obtained from healthy volunteers in an exercise study. Linear and nonlinear models were fitted to each dataset and measures of model performance were compared to attest the assumption of linearity. To isolate the effects of high concentrations, the phosphate buffer solution dataset was augmented with six samples with very high concentrations of lactate between (100-600 mmol/L). Subsequently, three partly overlapping datasets were extracted with lactate concentrations varying between 0-11 mmol/L, 0-20 mmol/L and 0-600 mmol/L. Similarly, the performance of linear and nonlinear models were compared in each dataset. This analysis did not provide any evidence of substantial nonlinearities due high concentrations. However, the results suggest that nonlinearities in scattering media may be substantial, justifying the use of complex, nonlinear models.


I. INTRODUCTION
Near Infared (NIR), Mid-Infrared (mid-IR) visible and Ultraviolet (UV) optical spectroscopy provide a low-cost and noninvasive alternative to electrode-based approaches in the characterization of chemical compounds and the quantification of analytes. Such applications necessitate training predictive models on datasets with sufficient variations in the concentration of absorbing species. However, the provision of such datasets is highly time-and resource-demanding, as a result, the number of samples are often small. Moreover, the identification of discriminative optical patterns requires scanning broad ranges of the optical spectrum. Therefore, the acquired optical spectra contain absorbance values at hundreds or thousands of wavelengths. The limited sample sizes, n, along with the large number of wavelengths (variables), p, constitute the main features of most optical spectroscopy datasets; known as large p small n problems [1]. While this poses a challenge for predictive modelling, two conditions make the problem tractable: where A denotes abosorbance, and are the intensity of beam before and after passing through the absorbing layer, is the number of absorbing species in the matrix, is the molar decadic extinction coefficient for the i th absorbing species, is the concentration, and is the path length of light.
These considerations justify the choice of Principal Component Regression (PCR) and Partial Least Squares (PLS) in spectroscopic studies [2,3]. Both of these methods are linear. The former achieves dimensionality reduction by finding the axes of maximal variance in the space of independent variables, × , while the latter does so by finding the axes of maximal *This research is funded by the Engineering and Physical Sciences Research Council (EPSRC).

An empirical investigation of deviations from the Beer-Lambert law in near-infrared spectroscopy
A case study of lactate in aqueous solutions, serum, blood and tissue covariance between the independent variables and the dependent variable, ×1 . In spite of major differences in the interpretation of latent variables in PCR and PLS, they often deliver similar predictive performances [4]. Minor improvements in predictive performance might be expected from PLS, particularly when noise constitutes much of the variance in the independent variables space [5,6,7].
While PLS and PCR remain workhorses of quantitative analytics in spectroscopic studies, it is also well-understood that deviations from the linearity assumption can take place when the light source is not monochromatic, the concentration of the analytes are high, and the medium is highly scattering. [8] showed errors that arise from the Beer-Lambert law can exceed an order of magnitude compared to the exact solution of the Maxwell equations. [9] derived analytical expressions for the critical concentration and the extinction coefficient beyond which deviations from the Beer-Lambert law become significant. Firstly, the expectation of nonlinearities that are challenging to quantify apriori, and secondly, the prevalence and success of nonlinear Machine Learning (ML) models in different fields have paved the way for their application in spectroscopic studies. For instance, [10] used Artificial Neural Networks (ANN) for classification of drug strength from NIR spectra. [11] compared the application of discriminant PLS and Random Forest (RF) on classification of adulterated oil and spice samples from Fourier Transform Infrared (FT-IR) and NIR spectra. They reported that RF delivers a superior performance. [12] compared the performance of methods such as PLS, Support Vector Regression (SVR), ensemble decision trees, and ANNs on the estimation of the concentration of glucose in aqueous solutions. This comparison was conducted on an NIR dataset comprising of 47 concentrations of glucose. SVR, ANN, and a variant of ensemble decision trees obtained better performances. [13] compared the performance of PLS, SVR, and ANN on NIR spectra obtained from different petrochemical matrices. It was shown that ANN and SVR deliver comparable performances and both offer more accurate predictions than PLS; the authors concluded that SVR can provide a robust alternative to ANN. Similar investigations of linear and nonlinear regression models have been reported in the literature for the estimation of soil carbon content, sugar content of orange, active substance content of tablets, moisture, fat and protein content of meat, and finally protein content of wheat in [14,15,16]. The PLS model is often found to deliver poorer performance compared to nonlinear models.
The present study focuses on lactate. The association of lactate with one of the most fundamental processes in the body, namely cellular respiration, makes it an important biomarker akin to glucose. Therefore, not surprisingly, clinical literature underlines the diagnostic and prognostic value of lactate in relation to numerous life-threatening conditions and diseases, such as sepsis, diabetes, cancer, pulmonary and kidney diseases [17,18,19,20]. Lactate has also been referred to as an important indicator of the risk of morbidity and mortality in critically ill patients [21]. Currently, the gold standard in the measurement of lactate requires blood sampling. This limits the ability of intensivists to frequently monitor patients' lactate levels; in spite of the calls for its routine measurements in patients with sepsis [22]. These considerations have given rise to the pursuit of non-invasive and continuous alternatives to intermittent blood sampling for lactate measurement.
[23] used the mid-IR region of the optical spectrum to estimate the concentration of lactate in plasma and reported a coefficient of determination, 2 , of 0.94 in the test set and a Root Mean Square Error of Prediction (RMSEP) of 0.15 mmol/L. [24] showed the potential of NIR spectra in the estimation of lactate concentration in blood, reporting a coefficient of determination of 0.96 with cross-validation, 2 .
[25] applied a number of variable selection methods to the mid-IR spectra of lactate and showed that highly accurate estimates, 2 = 0.996, can be achieved with models that only use a small subset of wavelengths. [26] conducted a comprehensive comparison of the different regions of the optical spectrum, namely ultraviolet/visible, NIR, and mid-IR, for the measurement of lactate and highlighted the merits of mid-IR for in-vitro applications and NIR for transcutaneous applications.
In this study we adopt an empirical approach to investigate potential deviations from the Beer-Lambert law that arise from high concentrations of lactate and scattering matrices. To this end, we compare the performance of linear models, namely PCR, PLS, and linear SVR, with nonlinear models, specifically, SVR with quadratic, cubic, quartic, and Radial Basis Function (RBS) kernels. To isolate the effects of high concentrations, this comparison is performed on three partially overlapping datasets comprising of NIR spectra of lactate in PBS with concentrations in the range of 0-10 mmol/L, 0-20 mmol/L and 0-600 mmol/L. To investigate the effects of scattering matrices, the comparison is extended to incrementally more scattering matrices; from PBS to serum, whole blood, and invivo transcutaneous spectra.

II. RESULTS
The performance of seven linear and nonlinear models are compared to investigate the extent of nonlinearities caused by high concentrations and scattering matrices. The hypothesis is that if these factors introduce significant nonlinearities, nonlinear models are expected to perform better compared to linear models. The performance of models is evaluated using cross-validation, with validation sets of size three in each iteration. These validation sets are randomly selected with uniform distribution and without replacement. This is referred to as the main cross-validation loop. For SVR models, hyperparameter optimization (to find the values of C, , and the kernel scale) is performed within each fold with another five-fold cross-validation routine and Bayesian optimizer. This (hyperparameter optimization) cross-validation is nested inside the main cross-validation loop to eliminate data leakage and ensure that the predictions in each fold are representative of the external predictive performance. After the completion of the main cross-validation loop, RMSECV and 2 are calculated. Due to the stochastic nature of the process, different results (RMSECV and 2 ), may be obtained in different runs. Therefore, the main cross-validation loop is repeated 10 times; 2 values are visualized with boxplots and the lowest RMSECV among all ten runs is reported separately in Table 1.
In the interpretation of the results, it is worth noting that the PLS model is fitted directly on the spectra while the PCR and all SVR models are fitted to the PCs that present a loss of 0.01% in variance in the PBS, serum, and blood datasets, and a loss of 20% in the invivo dataset. Therefore, a major difference between the performance of PLS and PCR models can imply that a) the variance not captured by the PCs are important for predictions b) the ratio between the number of PCs and the number of observations is large, leading to poorer regression. Hence, the assessment of linear and nonlinear effects should be primarily based on the comparison of how well the PCR and the SVR with linear kernel perform relative to the nonlinear models. A fairer comparison may take the hyperparameter tuning requirements of the SVR models into account, restricting pairwise comparisons to SVR with linear kernel to SVR with nonlinear kernels.

A. Nonlinearities due to high concentration
To investigate the effect of high concentrations of lactate, first the PBS dataset is analyzed where scattering due to compounds other than lactate is minimal. To this end, three partially overlapping datasets with concentrations of lactate ranging between 0-11, 0-20, and 0-600 mmol/L are used to compare the performance of linear and nonlinear regression methods. This analysis expands on our previous preliminary work [27] in that in the present study the main (performance evaluation) cross-validation is run ten times and the results are visualized using boxplots. This helps capture the stochastic nature of the results as well as the convergence properties of the optimization in SVR models; a wider spread implies poorer convergence and vice versa.  In all datasets -specifically in the datasets with medium and high concentrations of lactate-the linear models obtain the best performances. Therefore, this analysis does not provide any evidence of the presence of significant nonlinearities due to high concentrations of lactate. Figure 3, summarizes the performance of models in increasingly more scattering matrices, namely PBS, human serum, sheep blood and in transcutaneous spectra acquired from the participants of an exercise study. While the results do not show a strong upward trend with increasing model nonlinearity, both in the serum dataset and the blood dataset interesting patterns emerge. Unlike PBS, in serum and blood, whether the best observed 2 in ten runs (top end of the top whisker) or the median 2 (the red bar) are considered, the trend is either almost flat or upward. In serum, the SVR model with cubic polynomial kernel shows a better performance than the linear SVR and PCR models when the top whisker is considered. This effect is much more pronounced in the blood dataset. The SVR model with RBF kernel obtains the best 2 , in terms of both the median and the best observed case. This is despite the fact that in small datasets, the complexity and the requirement of hyperparameter tuning for the SVR models puts them at a disadvantage compared to simpler PCR and PLS. This suggests that the nonlinearities must be substantial to compensate for and exceed the expected marginal loss of accuracy due to additional model complexity.

B. Nonlinearities due to scattering matrices
Surprisingly, the same pattern does not emerge in the invivo dataset. One possible explanation is that proxies of lactate concentration might have been detected rather than lactate itself. In the exercise study it was expected that other absorbing species, such as oxyhaemoglobin and deoxyhaemoglobin, can show highly correlated variations with lactate. The observations that RMSECV is lower in this dataset relative to serum and blood, supports this possibility. Especially because in this dataset, a portable spectrophotometer with a much lower resolution and a shorter wavelength range was used, the number of observations was smaller, and baseline differences in optical properties of the four participants are expected to be far greater. Table 1. Shows the best RMSECV obtained in ten different runs of the cross-validations routine.

III. DISCUSSION AND CONCLUSION
This study focused on the optical behavior of lactate; an important and fundamental biomarker that, if measured routinely and accurately, can shed light on numerous diseases and health problems. Previous studies have underlined the potential of optical estimation of lactate as a noninvasive, inexpensive and continuous alternative to blood sampling. However, accurate and reliable optical measurements are still not within reach. A better understanding of the interactions of light and this biomarker is necessary to recognize the merits, limitations, and practical issues of optical sensing. Since previous attempts in the optical estimation of lactate have used linear models, we investigated potential deviations from the linearity assumption postulated by the Beer-Lambert law. To this end, a series of experiments were conducted to analyze potential nonlinear effects that can arise due to high concentrations of lactate and scattering matrices. The assumption was that if nonlinear effects become substantial, nonlinear models will deliver better accuracies than linear models.
Seven linear and nonlinear models were compared in datasets with concentrations of lactate ranging between 0-11 mmol/L, 0-20 mmol/L, 0-600 mmol/L. For this investigation a minimally scattering matrix (PBS) was used to ensure potential deviations are mainly due to high concentrations. This analysis did not provide any evidence of significant nonlinearities. A similar comparison in incrementally more scattering matrices, namely human serum, sheep blood, and transcutaneous spectra, showed some merits to the use of nonlinear models. Both in serum and blood, nonlinear models obtained better performances than PCR and SVR with linear kernel.
In summary, a) the results confirm the potential of optical measurements, both invitro and invivo, albeit the latter is likely to have been indirect. Therefore, more studies, with more participants, and in different scenarios is necessary to assess the feasibility of indirect optical sensing of lactate. b) It was shown that concentrations of lacatate, even far beyond the biological range, do not present substantially nonlinear absorbance. c) Nonlinear models showed merits in direct measurement of lactate when the medium is scattering, i.e. serum and blood.
Finally, the authors would like to emphasize that the RMSECV presented in Table 1 is not representative of the best accuracies that can be achieved. Previous studies have shown that major improvements may be observed when nonlinear baseline corrections are used and redundant wavelengths are excluded. In the present study, since the primary objective was the analysis of nonlinear absorbance, the aforementioned topics were not covered.

A. The datasets
All datasets were produced after obtaining approval by the Senate Research Ethics Committee (SREC) at City, University of London (SREC 17-18 05 6ii 27.06.2018) and all methods were carried out in accordance with the relevant guidelines and regulations.

PBS
The dataset consists of 57 NIR spectra of different concentrations of lactate in a Phosphate Buffer Solution (PBS). The procedure for the preparation of the solutions and acquisition of the spectra is detailed in [27]. The dataset contains 31 samples with concentrations of lactate between 0-10 mmol/L (increments of 0.25 mmol/L), 21 samples between 10.5-20 mmol/L (increments of 0.5 mmol/L) and finally, six samples with extremely high concentrations of 100-600 mmol/L (increments of 100 mmol/L). Spectra were acquired using the Lambda 1050 dual-beam UV/Vis/NIR spectrophotometer (Perkin Elmer Corp, Massachusetts, USA), with a spectral resolution of 1 nm. The light source used in the spectrometer was a halogen tungsten lamp. Indium gallium arsenide (InGaAs) and the lead sulfide detectors (PbS) were used to detect the transmitted NIR light in the range between 800 -2600 nm. Baseline correction was performed on the spectrophotometer prior to the acquisition of a spectra, at 100% Transmission / 0% absorbance to remove background noise. Once, background correction was performed, 300 µl of each sample was transferred to in a macro quartz cuvette (Hellma GmbH & Co.KG, Jena, Germany) with a path length of 1 mm to acquire the three NIR spectra of each sample in the desired wavelength range. The three spectra were then averaged, and the resulting spectrum of each sample was considered for further analysis.

Serum
The dataset consists of 36 NIR spectra of lactate in human blood serum. Mixed pool human serum collected from healthy volunteers was purchased from TCS Biosciences Ltd., (Buckingham, UK). The base lactate of the purchased serum was 7.7 mmol/L. Thirty-five serum samples of 29 mL were then serially diluted with 1 mL of stock solutions containing varying concentration of lactate in PBS. The concentration of the serum samples was measured before the acquisition of spectra using the ABL 825 Flex (Radiometer UK Limited, Crawley, West Sussex, UK). The concentration of lactate within the samples ranged between 7.7-15 mmol/L with an average increment of 0.20 mmol/L.
Once the samples were prepared and the concentration measured, the spectra of each sample was acquired again in transmission mode using the Lambda 1050 dual-beam UV/Vis/NIR spectrophotometer (Perkin Elmer Corp, Massachusetts, USA). The acquisition procedure of serum spectra was similar to that of the PBS spectra, as detailed above.

Sheep blood
The dataset consists of 36 spectra of lactate in sheep blood acquired using the Lambda 1050 dual-beam UV/Vis/NIR spectrophotometer (Perkin Elmer Corp, Massachusetts, USA) equipped with 100 mm InGaAs Integrating Sphere. All the spectra were acquired in reflectance mode using a 300 µl sample in the range between 900 and 2500 nm. The procedure for the preparation of the solutions and acquisition of the spectra is detailed in [28]. The concentration of lactate is within the range of 4.8-13.8 mmol/L and the average increment is around 0.25 mmol/L. The spectral resolution and acquisition setup used to acquire blood spectra were similar to that of PBS.

Invivo
The dataset consists of 27 reflectance spectra obtained from four healthy participants during maximal effort cycling on a spinning bike (WattBike Pro, Wattbike Ltd, Nottingham, UK). The participants were 22 to 31 years old and gave informed written consents before commencing the experiment. Subsequently they were asked to cycle for as long as they can at a fixed peddle speed and magnetic resistance, the air resistance was increased after every minute until volitional exhaustion of the volunteer or upon reaching 90 % of predicted maximal heart rate (derived using the equation max heartrate = 220age in years). A 45 second rest was allowed after every minute of cycling. During the rest period, optical spectra were acquired from the right index of the volunteer and a drop of blood was drawn from the left index finger using a sterile lancet. The capillary blood was collected on a finger stick and the lactate level in blood was measured using the portable Lactate Pro 2 analyzer. The concentration of lactate measured was within the range of 1.1-11.7 mmol/L and the average increment was 0.41 mmol/L.
Invivo spectra in the 900 -1700 nm range were acquired using the NIRQUEST 512-1.9 NIR Spectrometer (Ocean Optics Inc., Florida, USA). A reflectance optical fiber probe (600 um fibers) was used to transmit and detect NIR light reflected from the finger of health volunteers. A small slit of 25µm was to improve optical resolution and stop the detector from saturating.

B. Preprocessing of spectra
In the PBS and serum datasets, wavelengths between 1900-1980 nm and 2450-2600 nm show high levels of noise and are hence removed. This noise is caused by the oversaturation of the lead sulfide detector in the transmittance mode due to water absorption peaks. The PBS spectra were processed using Multiplicative Scattering Correction (MSC) and a Savitzky-Golay (SG) filter with the window length of 135, second order polynomial and second order derivative. The serum spectra, were processed with MSC and SG filter with window length of 151, third order polynomial and second order derivative. The blood spectra were acquired in reflectance mode and subsequently transformed to absorbance. Therefore, no noise was observed in the aforementioned regions, however, for consistency, they were removed. The transcutaneous spectra were processed with SG filter, polynomial order of three, window length of seven and derivative order of three.

C. Dimensionality reduction
Principal Component Analysis (PCA) is applied to all datasets to reduce the dimensionality of the spectra prior to model fitting.
For the in-vitro sets the number of components is selected such that 99.99% of the variance is explained by the PCs. This led to the selection of 12, 14, 13 PCs in the PBS datasets with low, medium, and high ranges of concentrations respectively, 16 PCs in the serum dataset, and 22 PCs in the blood dataset. For the tissue dataset, given the noisy nature of the data, this criterion led to a large number of PCs and, consequently, overfitting. The explained variance of 80% was found to produce good results across all models and was therefore selected. This led to 8 PCs.
For the PLS model the number of Latent Variables (LVs) was selected as the point where the Predicted Residual Error Sum of Squares (PRESS) plateaus. This criterion led to 11, 9 and 8 LVs for the PBS datasets with low, medium, and high ranges of concentrations respectively, 10 LVs for the serum dataset, 12 LVs for the blood dataset, and 6 LVs for the invivo dataset.

D. Linear and nonlinear models
The linear models used in this study are of PCR, PLS, and SVR, In SVR the objective is to find the flattest line while the prediction error shows minimal deviation beyond . Therefore, given the training set ( , ), ∈ {1, 2, . . . , }, * is defined as * = min where ‖. ‖ is the Euclidean norm, and , * are slack variables that absorb excess errors when a solution is not possible that guarantees errors restricted to boundaries. With an -insensetive loss, is defined as, where C is the capacity control parameter that determines the tradeoff between higher loss values and higher norm ‖ ‖ 2 (less flat plane).
The incorporation of nonlinearities can be achieved by using nonlinear transformations, ( ), to map the explanatory variables into new hyperdimensinoal spaces. For instance, a quadratic polynomial augmentation of a two-dimensional feature space ( 1 , 2 ) ⊂ ℝ 2 may include all polynomial terms of degree two ( 1 , 2 , 1 . 2 , 1 2 , 2 2 ) ⊂ ℝ 5 . In high-dimensional data the explicit use of such nonlinear transformations, ( ), can be intractable. However the use of the "kernel trick" provides a computational effective way to achieve this. Specifically, solving the optimization above necessaitates the calculation of ( ) . ( ′), these computationally demanding transformations can be avoided by finding the equivalent kernel, ( , ′)= ( ). ( ′). Therefore, the four main parameters that need to be selected are the scaling of features, the kernel function, ( , ′), the loss function, , and the capacity control parameter, [29,30].