2.1. Overview of methodological approach
The experiment consists of 4 steps. Figure 1 shows the overview of the methodological approach in this study. Sandy beach sediment was treated and spiked with varying concentrations of LDPE, PET and ABS MPs. The reflectance of the spiked sediment was recorded through vis-NIR spectroscopy, and predictive polynomial regression and machine learning models developed and validated.
2.2. Sample preparation
Collection and pre-treatment of beach sediment
Sandy beach sediment was collected from Damai Beach, Sarawak (1°45’05.5”N 110°18’50.0”E). A sterile metal spoon was used to collect the top 5 cm layer of beach sediment and transferred into a sterile 1 L glass beaker. The mouth of the glass beaker was securely covered with aluminum foil to prevent contamination from the environment during transport to the laboratory. The removal of MPs and preparation of the beach sediment sample was adapted from He et al. (2018). The beach sediment was sieved using a metal sieve with a mesh size of 1 mm to remove shells, leaves and other large organic substances. 400 g of sieved beach sediment was transferred into a new 1 L glass beaker and density separation was carried out (400 mL of saturated 8.56 molar NaCl, HiMedia, Germany, solution was added into the glass beaker containing the sieved beach sediment). The mixture was stirred for 10 minutes using a large metal spoon and left overnight, after which the suspension was decanted carefully. Density separation and decanting were repeated twice to ensure all impurities were removed from the beach sediment. To remove excess NaCl after the density separation, the sediment was poured into a 63 µm metal sieve and 1 L of Milli-Q was allowed to run through sediment in the metal sieve. The sediment was then transferred into a glass beaker and allowed to oven-dry at 40 °C for 6 hours to obtain a treated beach sediment sample.
Reflectance measurements of artificially contaminated beach sediment samples
20 g of the purified beach sediment were transferred onto a watch glass and spiked with virgin LDPE or ABS or PET micro pellets at sequential increments of 0.1% w/w. The microplastic pellets were obtained from Fraunhofer-Institute Karlsruhe, Germany and less than 5 mm in size (Jang 2020). ASD HandHeld 2 VNIR Spectroradiometer (Malvern Panalytical, Worcestershire, United Kingdom) was used to record the reflectances in the vis-NIR wavelength range of 325 nm to 1075 nm. For each concentration, the reflectance was recorded using the contact probe at five different locations, working clockwise from the outer edge of the sample to the center of the sample. A total of 46 different concentrations were prepared and recorded per plastic type (ranging from 0.1% to 15% w/w) with five (5) replicates recorded per concentration, resulting in a total of 241 readings (see Fig. 1) per plastic.
2.3. Overview of reflectance processing approaches
The resulting reflectance measurements were analysed using both OriginPro 2021 (OriginLab, Northampton, MA, USA) and Scikit-Learn to build predictive models for the three microplastic types in beach sediment samples.
In OriginPro 2021, pre-processing approaches were applied to the respective microplastic reflectance data, after which PCA was implemented in order to obtain the wavelengths which displayed the most significance. Polynomial regression models were constructed for LDPE, PET & ABS respectively using the wavelength of the highest significance. The accuracy was assessed via R-squared value (R2), root mean squared error (RMSE) and standard deviation (SD).
For machine learning, Scikit-Learn software library (Pedregosa et al. 2011) was implemented in order to identify and select the most significant features (i.e. wavelengths) for each respective microplastic using the feature importance algorithm and Random Regressor algorithm available in the Scikit-Learn library. The feature importance acts as an indicator for each individual contribution of every corresponding feature in a particular classifier (Saarela & Jauhiainen 2021). From the regression algorithm selection pipeline, Random Forest (RF) Regressor was used for LDPE, whereas K-nearest neighbor (KNN) Regressor was used for PET and ABS in developing the regression models.
The regression models, R2, RMSE and SD obtained via spectral processing in OriginPro 2021 and via machine learning were then compared.
2.4. Development of predictive models
Polynomial Regression Models
Spectral files were imported in Viewspec Pro and an average spectrum created for five replicate readings. The summary was exported as an excel file, imported to OriginPro 2021 (OriginLab, Northampton, MA, USA), and visually inspected using a scatter plot before further processing. 2nd order polynomial Savitzky-Golay was applied to the spectral data of LDPE, ABS and PET to remove background noise without changing the overall spectrum shape (Dai et al. 2015).
Principal Component Analysis (PCA) models were created for all three plastics. The score and loading plots were used to find the most significant wavelengths which had the greatest statistical difference among the samples and to classify the plastics. The selected wavelengths for PET, LDPE, and ABS respectively are listed in Table 2. Using the significant wavelengths as predictors, three (3) polynomial regression models were constructed for each of the plastics.
The significant wavelengths of the plastics were analyzed using the Rank model plugin (OriginLab, Northampton, MA, USA) and regression plots developed. The regression model was plotted using the most significant wavelength’s reflectance values against MP concentration. R2, SD and RMSE were generated in the software (Table 2). Additionally, the residual plots of each of the polynomial regression models were observed to check whether the errors were normally distributed.
Machine learning model
First, the feature importance function and random regressor algorithm from Scikit-Learn library was used to select fifteen features (wavelengths) from the vis-NIR readings of the LDPE, PET and ABS data. The selected features and its importance scores are provided in Fig. 4a – 4c. The reflectance data from the highest scored wavelength from feature importance function were split into 70% for training and 30% for testing data. Next, a pipeline of regression algorithms with default hyperparameter settings from the Scikit-Learn library was created. The regression algorithms included in the pipeline are included in Table S2. Training data from the microplastic samples were iterated into the pipeline and the regression model with the lowest mean squared error (MSE) computed using cross-validation was returned. From the regression model selection pipeline, RF Regressor was selected for LDPE data and KNN was selected for PET and ABS data. Then, the training data for each MP sample was used to train the baseline model of the selected algorithms by using default hyperparameter settings. Next, the n_estimators, max_depth and min_samples_split hyperparameters from the RF regressor for the LDPE samples were chosen for tuning. The leaf_size, n_neighbors and p settings for the KNN regressor were selected for tuning for the PET and ABS samples. The best hyperparameter combination settings were determined by using the GridSearchCV function in Scikit-Learn and the hyperparameter-tuned models trained using the training dataset. The models developed were tested using the testing data and the regression graph of predicted vs actual values from the models plotted (Fig. 3a, 3c, 3e). The performance of baseline vs tuned models were compared using the computed mean absolute error (MAE), MSE, RMSE and R2 values. Learning curves were plotted to determine the models were not overfitted (Fig. S3a – S3c).