A Protolith Reconstruction Model (PRM) for Metabasalt: Quantitative Protolith and Mass Transfer Estimation Based on Machine-learning Approach

Mass transfer in rocks provides a direct record of fluid–rock interaction within 16 the Earth, including metamorphism, metasomatism, and hydrothermal alteration. 17 However, mass transfer analyses are usually limited to local reaction zones where the 18 protoliths are evident in outcrops (1–100 m in scale), from which regional mass transfer 19 can be only loosely constrained due to uncertainty in protolith compositions. In this study, 20 we developed protolith reconstruction models (PRMs) for metabasalt based on a machine 21 learning approach. We constructed PRMs through learning multi-element correlations 22 among basalt compositional datasets, including mid-ocean ridge, ocean island, and island 23 arc basalts. The PRMs were designed to estimate trace-element compositions from inputs 24 of 2–9 selected trace elements, and basalt trace-element compositions (e.g., Rb, Ba, U, K, 25 Pb, Sr, and rare earth elements) were estimated from only four inputs with a 26 reproducibility of ~0.1 log 10 units (i.e., ±25%). Using Th, Nb, Zr, and Ti, which are 27 typically immobile during metamorphism, as input elements, the PRM was verified by 28 applying it to seafloor altered basalt with known protoliths. When suitable immobile 29 elements are incorporated, a PRM can yield unbiased and accurate mass transfer analysis 30 of any metabasalt with unknown protolith.


Background
Chemical alteration of rocks, or mass transfer, provides direct evidence for fluidrock interactions within the Earth and represents various geochemical processes such as seafloor alteration, subduction zone metamorphism, geothermal fluid activity, and fault zone processes. Mass transfer analyses for subduction-related metamorphism reveal element transport via dehydration reactions in subducting slabs [1][2][3] and element cycling in subduction zones 4,5 that are chemically linked to arc basalt 6,7 . The distribution of mass transfer in a regional metamorphic belt can reveal the spatial distributions of fluid flow in the crust and mantle [8][9][10] . Mass transfer analyses in mineral-filled veins and fault zones are related to the fluid flux 11 , duration of fluid infiltration 2,12,13 , and/or degree of fault heating 14,15 . Hydrothermal alteration of Archean seafloor basalt is known to be linked to the chemistry of seawater 16,17 . Therefore, analyses of mass transfer in chemically altered rocks are essential to better understand fluid-related processes within the Earth and the evolution of surface environments.
Mass transfer analyses are generally achieved by comparing the compositions of protolith with those of metamorphosed/altered rocks. Mass transfer at the outcrop scale (<100 m) can be estimated by comparing compositions of altered rocks with those of adjacent unaltered host rocks [18][19][20] . For larger scales (i.e., comparisons of rocks from various metamorphic belts), mass transfer can be qualitatively evaluated by comparing chemical differences between metamorphosed rocks (i.e., metabasalt) and their likely protoliths 4,5 (i.e., mid-ocean ridge basalt or MORB). Such mass transfer analyses, which compare protolith and metamorphosed/altered rocks, have contributed to understanding material recycling in subduction zones 4,5 .
In many cases, information about the exact protoliths of metamorphosed/altered rocks is unavailable, except for cases where the protoliths are evident in outcrops. As the compositional variations of likely protoliths (i.e., basalt and sediment) are generally large [21][22][23][24] , it is difficult to quantitatively evaluate the amount of mass transfer for each sample. Recent analyses of regional metamorphic belts have also revealed that protoliths of metamorphic rocks differ in their depositional ages and/or tectonic settings among different units or grades of rock 5,25,2627,28 , suggesting that it is unrealistic to assume a uniform protolith composition in regional metamorphic belts or alteration zones. Therefore, to better understand quantitative mass transfer, estimation of protolith compositions from individual samples is required.
The mobility of these elements during metamorphism has been confirmed by various mass transfer analyses of mineral veins and alteration zones 1,3,36 . Compilations of mass transfer in various metamorphic conditions and environments have suggested that the mobility of HFSEs decreases roughly in the order of rare earth elements (REEs) > U > Nb > Ti > Th ~ Zr for high-pressure subduction zone environments 35 . These elements are widely considered as immobile elements and are therefore used for discriminating the tectonic settings of metabasalt. The general success of discrimination diagrams 37,38 suggests that these immobile elements retain protolith information, meaning that it should be possible to use these elements to reconstruct protolith compositions from metamorphosed/altered rocks.
Advances in data science have provided tools for extracting information from large numbers of multidimensional data. In particular, machine learning is an effective way of recognizing complex patterns in images and extracting information from multidimensional table data. The recent increase in data held in geochemical compositional databases (e.g., PetDB and GeoRock) has allowed machine learning to become established in its application to research problems in geochemistry. For example, machine learning has been successfully applied to discriminate tectonic settings of basalt from chemical componsitions 24 , and classify metamorphic protoliths from major elements 39 . However, previous applications of machine-learning to geochemical data have been limited mainly to classification problems and have not dealt with regression problems (i.e., predicting quantitative chemistry). For quantitative mass transfer/protolith analyses, a new quantitative and predictive machine-learning scheme for geochemical data is needed.
In this study, we develop protolith reconstruction models (PRMs) that estimate protolith compositions of metabasalt using machine learning. First, using a basalt wholerock compositional dataset, we develop empirical models that learn multi-elemental correlations among the dataset and then estimate trace-element compositions of basalt based on the contents of a few (two to nine) elements. The numbers and combinations of input elements are optimized to precisely predict the output contents. Results show that basalt trace-element contents (i.e., Rb, Ba, U, K, Pb, Sr, and REEs) can be estimated from only four element contents (i.e., Th, Nb, Zr, and Ti). Finally, we apply the selected fourelement PRM to seafloor altered basalt and metabasalt, and demonstrate the validity of the model and examples of mass transfer analyses for metamorphic rocks.

Model description
PRMs were developed through machine learning of a compositional dataset of basalt. The empirical models were designed to estimate contents of a particular trace element as an output from between two and nine HFSE contents as inputs ( PRMs were constructed with the machine learning algorithm of the gradient boosting decision tree (GBDT), with separate training/test data being used to evaluate the model (Fig. 1d). The GBDT is one of several decision tree algorithms that are capable of fitting complex datasets (i.e., non-linear structural data) and which perform with high accuracy 40 . The models were evaluated by the root mean squared error (RMSE) in log space between the estimated output and the measured data: where Ei is the RMSE for element i, N is the number of samples, yi,j estimated is the estimated content of element i in sample j, and yi,j test is the measured content of element i in sample Elements used as output elements were excluded from input elements. For each model (combinations of particular input elements and an output element), basalt compositional data were chosen to ensure that there were no missing values for input and output elements in the utilized dataset (typically 3000-5000 samples).

Model dependence on input elements
We firstly selected the numbers and combinations of input elements to estimate basalt composition. Machine-learning models were constructed for each combination of input and output elements (e.g., input: Th, Nb, and Zr; output: Rb). Therefore, the number of possible combinations of the input elements is 2 9 − 1 = 511. As each machine-learning model was developed for each output element independently, 5872 machine learning models were developed in total.
Examples of estimated compositions for a specific basalt sample are shown for different sets of input elements in Figure 2a, b, and c. The reproducibility of the estimation is dependent largely on input elements. For example, in the case of the input elements being Yb and Lu, the reproducibility (i.e., the difference between the actual and estimated compositions) for each element is large ( Fig. 2a; i.e., >1 in log10 units); in contrast, when the input elements are Th and Ti, or Nd, Ti, Yb and Lu, the reproducibility for each element is greatly improved and is <0.2 in log10 units ( Fig. 2b and c). Consequently, this dependence of reproducibility on input elements indicates that the numbers and combinations of elements affect the estimation of composition.
Effects of input elements were evaluated by taking averages of RMSE scores.  Fig. 1). The dependence of RMSE on input elements indicates that input elements with closer compatibility to that of the output element contain more identifying information on protolith composition. For example, the RMSE of Ce gradually improves when the input elements have closer compatibility with Ce 41 . As a whole, to improve the overall estimation, it is necessary to choose input elements that have a wide range of incompatibility when combined.
The constructed models for estimating basalt composition can be used to reconstruct the protolith composition of metamorphosed or altered basaltic rocks.
Assuming that the contents of immobile elements in metabasalt are identical to those of its protolith, these contents can be assigned as input elemental contents of PRMs (Fig.   1d). Then, the amounts of transfer of the other elements (mobile elements during metamorphism or alteration) can be obtained by comparing their observed and predicted contents. It is noted that elements that are immobile during alteration or metamorphism may vary from case to case 35 ; as such, users can choose PRMs with other input combinations by selecting the appropriate four to five immobile elements for the geochemical system of interest.

PRM reproducibility using the example of models incorporating Th, Nb, Zr, and Ti
In the following application to metabasalt, the combination of the four elements of Th, Nb, Zr, and Ti was chosen as the input of the PRM, as these elements are the most immobile elements from both natural observations and experiments [32][33][34][35] and have a wide variety of compatibility 41 . The PRM was constructed by using ~3000 basalt samples (i.e., data containing all of the input elements and an output element) and can estimate protolith compositions with an RMSE of ~0.1 (i.e., ±25%; Fig. 2d).
We applied the PRM using Th, Nb, Zr, and Ti as input elements to the test data of the compositional dataset for basalt. The estimated contents show largely linear relationships with the raw (measured) contents in log-log space, with a slope of 1.0 ( In particular, dispersions of Rb and K are apparent for low contents of elements. These results also affect the distribution of reproducibility of each element (Fig. S2). The reproducibility of Rb, U, K, Pb, and Sr differs according to tectonic setting, with the other elements showing no or only slight dependence on tectonic setting. The distributions of reproducibility for MORB have a wider range than those for OIB and IAB for Rb, and U, whereas those for IAB are slightly wider than those for MORB and/or OIB for Ce, Sr, and Nd.
One explanation for the dependence of element content on tectonic setting is the analytical detection limit. In particular, the raw data for K have identical values for samples with low contents (≤10 3 ppm), and the reproducibility of such data are large, probably because of the detection limit of K in X-ray fluorescence (XRF) analyses and/or the resolution of the original dataset (i.e., ~0.1 wt.%). An alternative explanation is seafloor alteration, for which Rb, Ba, U, K, Pb, and Sr are mobile 30,31,42 . Some samples of MORB and IAB might have already undergone mass transfer by hydrothermal alteration because parts of these were collected from the ocean seafloor, with the sample data being correspondingly affected. It is likely that some of the "fresh" basalts in the training data have been affected by the detection limit and/or seafloor alteration, contributing to enlargement of the reproducibility of models; the estimation of Rb, Ba, U, K, and Pb can be potentially changed by removing such alteration-affected data.
Examples of PRM estimation for each tectonic setting are presented in Figure 4. Super, which are characterized by enrichment in Rb, U, K, and Li.
The PRM was used to reconstruct protolith compositions from altered basalt.
The PRM-based primitive-mantle-normalized protolith compositions have smooth patterns, and elements with higher compatibility have higher contents 41 (Fig. 5a and c).
These PRM-based compositions are within the range of protolith compositions estimated from fresh glass. Protolith composition can be accurately reconstructed from seafloor basalt.
Furthermore, mass mobility (i.e., the ratio of element content in the altered sample to that in the protolith) was calculated from the estimated protolith composition and altered sample composition ( Fig. 5b and d). Compared with previous estimates of mobility 31  with empirically determined likely protolith composition, the fluid is inferred to have been strongly undersaturated in light REEs (LREEs) and LILEs 5 . We applied the PRM to sample Z139-6, which is characterized by depletion in Rb, Ba, La, Ce, Sr, and Nd.
The reconstructed primitive-mantle-normalized protolith composition shows that elements with higher compatibility have higher contents (Fig. 5e). Compared with its protolith, the eclogite is depleted in LREEs (La, Ce, and Nd) and LILEs (Rb, Ba, and Sr), with LREEs and Sr decreased by about 95%, and Rb and Ba decreased by 60% and 50%, respectively (Fig. 5f). In addition, U, and heavy REEs (HREEs) do not show mass transfer.
This chemical pattern of protolith composition and element mobility is consistent with the empirically estimated protolith composition and mass transfer 5 . These results suggest that the PRM can accurately reconstruct protolith compositions from metamorphic rock geochemistry.

Implications of PRM-based estimates of mass transfer
The mass transfer estimated using a PRM is an integral value between fresh basalt and an altered sample. In the case where an analyzed sample has undergone regional metamorphism, this value includes the mass transfer that occurred during seafloor alteration, prograde metamorphism, and/or retrograde metamorphism. By utilizing multi-elemental mass transfer data as well as petrological indexes such as reaction extent, these complex mass transfers can be assigned to each process; a comparison of PRM-based mass transfer with the degree of alteration or retrogression can reveal element transport at a particular stage of alteration or retrogression.
A PRM can reconstruct protolith compositions from samples in which mass transfer has occurred and for which the protoliths are unknown. For example, in previous studies, quantitative analyses of mass transfer during metamorphism and metasomatism have usually been limited to a scale of <10 m, where protolith homogeneity can be assumed 1,3 . Provided that the distribution of data is retained within training data (i.e., mafic rocks with either a MORB, OIB, or IAB origin), the mass transfer can be estimated by a PRM for individual samples independently, and thus their spatial variation provides important information for constraining the regional-scale (i.e., >1 km) mass transfer, even if the protolith compositions are heterogeneous. The PRMs utilized in this study allow analysis of various types of sample that have undergone mass transfer (e.g., seafloor altered basalt or contact metamorphic rock) with incorporation of appropriate immobile elements. Immobile elements used for PRM inputs can be selected from 511 combinations of 9 elements according to petrological and geochemical observations. In cases where protoliths are unknown, conventional mass transfer analyses have relied on the experience and intuition of the trained geochemist, including empirical fitting or assuming a suitably representative basalt as the protolith, such as MORB or OIB.
"Anomalies" on normalized multi-elemental variation diagrams (i.e., spidergrams) are considered to show mobile elements. In contrast, the data-driven approach of the present study is applicable to investigating heterogeneities of protolith compositions and provides a less biased and more accurate estimation of metamorphic mass transfer for independent samples. Such a data-driven method is suitable for quantitative mass transfer analysis, especially in cases where protoliths are unknown and/or when there is a need to analyze mass transfer from a compiled dataset with samples from various tectonic origins.

Conclusion
In this study, we developed protolith reconstruction models (PRMs) for metabasalt, using machine-learning with large numbers of basalt compositional data. The best PRMs can estimate trace-element contents of basalt with an error of around ±0.1 in log10 units or ±25% incorporating only four or five input element contents. Using immobile elements as input elements, a four-element PRM was used to estimate protolith compositions of metabasalt. Application to seafloor altered basalt and eclogite verified the accuracy of protolith reconstruction within reasonable uncertainty of the estimation (0.1 in log10 units or 25%). This machine-learning-based method enabled an analysis of mass transfer of metabasalt with unknown protolith and can be applied to regional metamorphic belts or alteration zones where the protolith is heterogeneous.

Method
PRMs were constructed using the machine learning algorithm of the gradient boosting decision tree, specifically, the LightGBM algorithm. To improve empirical model reproducibility, hyperparameters of LightGBM were automatically tuned through Bayesian optimization by using a partial training dataset. Partial training datasets for hyperparameter tuning were prepared by K-fold cross validation, which enabled us to use all training data for constructing the PRMs. Details of the machine-learning calibrations for PRMs are described below.

Gradient boosting decision tree (LightGBM)
Gradient boosting decision tree (GBDT) is a supervised machine-learning method from which prediction models can be constructed from multidimensional data and used to solve classification and regression problems 44 . In the field of geochemistry, this machine-learning method has been applied to extract information, discriminate classes, and predict values; for example, to discriminate and extract characteristics from a volcanic rock dataset of eight different tectonic settings 24 , classify metamorphic protolith(s) from the major element contents of a rock 39 , and complement geochemical mapping for improvement of accuracy and interpretation 45 .
Both random forests and GBDT have been proposed as explainable models with high accuracy. GBDT is an ensemble method that combines multiple decision trees to build a powerful model. In the GBDT method, decision trees are built one after another in such a way that the next decision tree corrects the errors of the previous one 40 . The development of GDBT has allowed various algorithms such as Xgboost 46 and Catboost 47 to be proposed, of which LightGBM is an algorithm with fast calculation time and high accuracy 48 . For this reason, LightGBM was used as the machine-learning algorithm and for constructing models to predict element contents.

Tuning hyperparameters
LightGBM is a decision-tree-based nonparametric model. A nonparametric model has higher degrees of freedom compared with a linear model because of the fewer assumptions needed regarding the training data. However, the flexibility of a decision tree model makes it easier to overfit the training data. To solve this overfitting problem, each model has hyperparameters to restrict the degrees of freedom. Given that appropriate values can be assigned depending on the structure and number of dimensions of datasets, the hyperparameters need to be selected.
To choose appropriate hyperparameters, we used Bayesian optimization to tune them automatically for the dataset. Bayesian optimization is a method that uses the framework of Bayesian probability to select the next parameter to be explored based on the history of previously calculated parameters 49 . In this study, Optuna was used as the optimization software 50 , with a part of the data being used as the validation for hyperparameter tuning by Bayesian optimization. The number of hyperparameter searches was set to 50. The tuned hyperparameters were as follows: num_leaves: the maximum number of leaves in one tree; max_depth: limit the depth for the tree model. This can be used to deal with overfitting; and min_data_in_leaf: the minimum number of data in one leaf.
These three parameters are specified in the official LightGBM documentation as the first to be tuned. The other parameters are set with default values.

K-fold cross-validation
Data with no missing values in the input and output elements were extracted from the basalt compositional dataset and divided into training or test data. One-fifth of the data were used as test data to evaluate the accuracy of model, and the remaining data were used as training data to construct machine-learning models. K-fold cross-validation is a way of evaluating the effects of tuning hyperparameters and to prevent a reduction in the number of available data (Fig. S3). The training data are randomly split into K distinct subsets. K − 1 subsets are assigned for training the model, and the other subset is used for evaluating the hyperparameters (i.e., validation data). By changing the subsets used for training and validation, the model is evaluated K times (i.e., K folds) 40 . The average RMSE obtained from all folds is used for hyperparameter tuning by Bayesian optimization. In this study, we constructed a 4-fold cross validation. The reproducibility of the model was evaluated by using the test data (which are independent from the training and validation data used for hyperparameter tuning).

Preprocessing of each set of compositional data and Bayesian optimization
To improve the estimation error, input variables are transformed to ratios and products, along with dimensional compression, with a search for the best data representation (i.e., feature engineering). Feature engineering is a common technique for constructing machine-learning models 40 . In this study, data were transformed as ratios and products of contents between two arbitrary elements, and scores of Principal Component Analysis (PCA) and Independent Component Analysis (ICA) were calculated from log-transformed datasets for the training data of each fold. The validation and test data of each fold were also transformed using the same procedures, and their PCA and ICA scores were calculated by projecting the validation/test data onto the PCA/ICA eigenvectors of the training data. All of the measured content data, products, ratios, and PCA/ICA scores were used as preprocessed data for training and estimation of the machine learning models.
Preprocessed training data were used to construct machine-learning models, and we applied the models to preprocessed validation data to evaluate the reproducibility by RMSE (Fig. S3). Based on the averages of the obtained RMSE, Bayesian optimization software (Optuna) performed to tune the model's hyperparameters. We repeated model construction and evaluation 50 times to find the appropriate hyperparameters for each set of compositional data. Average RMSE scores for all of the models using a particular number of input elements.
(f) Average of all of the models containing a particular element as an input.     Scatter plots of predicted contents versus raw (measured) contents with the nal PRM using Th, Nb, Zr, and Ti as input elements. The PRM was applied to test data of the basalt dataset, which covers three different tectonic settings (mid-ocean ridge basalt, ocean island basalt, and island arc basalt).