Qmin: A machine learning-based application for mineral chemistry data processing and analysis

Mineral chemistry analysis is a valuable tool in several phases of mineralogy and mineral prospecting studies. This type of analysis can point out relevant information, such as concentration of the chemical element of interest in the analyzed phase and, thus, the predisposition of an area for a given commodity. Due to this, considerable amount of data has been generated, especially with the use of electron probe micro-analyzers (EPMA), either in research for academic purposes or in a typical prospecting campaign in the mineral industry. We have identied an eciency gap when manually processing and analyzing mineral chemistry data, and thus, we envisage this research niche could benet from the versatility brought by machine learning algorithms. In this paper, we present Qmin, an application that assists in increasing the eciency of mineral chemistry data processing and analysis stages through automated routines. Our code benets from a hierarchical structure of classiers and regressors trained by a Random Forest algorithm developed on a ltered training database extracted from the GEOROC (Geochemistry of Rocks of the Oceans and Continents) repository, maintained by the Max Planck Institute for Chemistry. To test the robustness of our application, we applied a blind test with more than 11,000 mineral chemistry analyses compiled for diamond prospecting within the scope of the Diamante Brasil Project of the Geological Survey of Brazil. The blind test yielded a balanced classier accuracy of ca. 99% for the minerals known by Qmin. Therefore, we highlight the potential of machine learning techniques in assisting the processing and analysis of mineral chemistry data.


Introduction
Mineral chemistry analysis constitutes a signi cant part of studies involving different branches of geosciences (e.g., mineralogy, petrology, economic geology). Nowadays, the amount of chemical data produced is enormous, especially by electron probe micro-analyzers (EPMA). Processing, analyzing, and interpreting high-dimensional data, such as those from EPMA, present a tremendous challenge (e.g., Cracknell et al., 2014;Radford et al., 2018). Many datasets, especially in geosciences, represent complex and non-linear physical systems (see Bergen et al., 2019). Therefore, manipulation and interpretation of high-dimensional geoscienti c data through basic graphics and spreadsheets are exhaustive and time-consuming tasks that, in turn, may be prone to unsystematic human biases (e.g., the misclassi cation of a mineral).
Machine learning algorithms (MLA) have emerged as powerful tools to deal with massive datasets and recurring tasks in recent years. Several works have used MLA to solve different geoscienti c problems, such as geological mapping (e.g., Costa et al., 2019;Cracknell et al., 2014;Kuhn et al., 2020Kuhn et al., , 2018Radford et al., 2018), data-driven mineral prospectivity mapping (e.g., Brandmeier et al., 2020;Laborte, 2016, 2015;Prado et al., 2020;Rodriguez-Galiano et al., 2015;Zhang et al., 2021), anomaly detection, among many others (see Dramsch, 2020, and references therein). Speci cally, in mineralogy, MLA have been used for mineral identi cation and classi cation from rock thin sections images (e.g., Borges and Aguiar, 2019;Rubo et al., 2019a) or from drill cores (e.g., Koch et al., 2019), and for the calculation of mineral formulas, e.g., for amphiboles (Li et al., 2020). However, to our knowledge, there is no application that tries to deal, in a holistic way, with mineral classi cation and formula calculation for some of the most common minerals (i.e., the rock-forming minerals) with the use of EPMA data.
In this scope, we provide Qmin -Mineral Chemistry Virtual Assistant -a machine learning-based algorithm focused on processing mineral chemistry data from EPMA analyses. With our application, we aim to automatize and statistically evaluate mineral classi cation and mineral formula calculation within a single integrated open-access code and, thus, simplify and speed up many post-analytical steps of studies that depend on mineral chemistry results.

The application
MLAs can be separated into two critical groups: i) algorithms with a de ned target (supervised learning) and ii) algorithms that cluster groups of data with similar features in a high-dimensional domain without a pre-de ned target (unsupervised learning). One of the most employed MLA in geoscience prediction problems is the Random Forest (RF -Breiman, 2001). The RF combines several independent decision trees to build classi cation or regression models through bootstrap aggregation.
In this sense, Qmin is a web application, built on top of Python 3 Flask (Grinberg, 2018) and sci-kit-learn library (Pedregosa et al., 2011), that gathers several nested Random Forest models (Breiman, 2001) trained to recognize new entries of EPMA analysis, to classify the mineral group, and to identify the most probable mineral, according to the comparison with a reference dataset.
One of RF's most signi cant advantages is that this algorithm has a high performance combined with hyperparameters that are easier to be tuned. Also, several geoscienti c works have shown that the RF outperformed the other MLA, such as Support Vector Machines, Arti cial Neural Networks, Logistic Regression, among others (e.g., Costa et al., 2019;Kuhn et al., 2018;McKay and Harris, 2016;Rodriguez-Galiano et al., 2015). These characteristics make RF widely and effectively used (e.g., Carranza and Laborte, 2015;Costa et al., 2019;Ford, 2019;Hariharan et al., 2017;Harris et al., 2015) Qmin can evaluate the quality of the analysis by measuring the statistical entropy (Shannon, 1948) of each new data entry. Exploratory data analysis can be done directly on the application, with biplot and triplot graphs.
Within Qmin, we developed a tool to determine the empirical mineral formula for each analysis. In the current version of Qmin, this is applicable for some mineral groups, such as Pyroxene, Feldspar, Mica, Garnet, Olivine, and Spinel. In this tool, the mineral formula is calculated explicitly by the charge balance method (Deer et al., 2013). We also developed another tool that calculates the mineral formula of Amphiboles by a probabilistic approach, with a multivariate regression based on the Random Forest Algorithm. However, we emphasize that the latter is experimental.

Data source
The original data used to train the Qmin algorithm are derived from the GEOROC (Geochemistry of Rocks of the Oceans and Continents) repository, supported by the Max Planck Institute for Chemistry (Sarbas, 2021;Schramm et al., 2006). The GEOROC initiative collects and organizes several standardized spreadsheets with geoscienti c data in tables and other supplementary materials. Among the data available on the GEOROC repository are geochemistry of rocks, uid inclusion data, petrographic descriptions, and mineral chemistry data. The GEOROC repository has more than one million mineral chemistry analyses from 17 different mineral groups and almost 400 different elds to gather metadata, reference, and chemical analysis on different representations.

Work ow
Dealing with a collection with the nature of the GEOROC dataset is a challenge. We aimed at selecting the most consistent analyses of each mineral to train the algorithm to classify new data with as minimum bias as possible. To reach this, we have applied a series of actions that yielded a uniform and workable database that can train a more effective model, which will be much less disturbed by features that could impact the results. Figure 1 summarizes the steps to create the mineral chemistry data classi cation model. It is the evaluation and deployment of the system. The pre-processing gathers data from an external source (i.e., GEOROC) and proceeds to make the data wrangling to clean and adapt the database for the development of the application. Next, we brie y describe these actions.

Pre-processing
As the GEOROC analyses are compiled from publications with different objectives, not all original data entries were necessary for our application. We rst selected 20 elements from the original GEOROC data that can represent all the variations among the different minerals present in the dataset (SiO 2 , S, Al 2 O 3 , ZrO 2 , F, CoO, CaO, Na 2 O, MgO, ZnO, Cl, K 2 O, FeO t , TiO 2 , CuO, MnO, NiO, P 2 O 5 , Cr 2 O 3 , and As).
We then performed a logical data cleaning procedure to eliminate non-descriptive chemical data (see Table 1 for a summary). Where needed, variables were leveled (i.e., concentrations in ppm -parts per million -of elements were converted to wt.% -weight percentage). For data populated with low instance numbers, we replaced missing data using multiple regression imputation methods (Martín-Fernández et al., 2003;Schroeder et al., 2008). Finally, we applied several lters to select the adequate analysis based on the range of values for the sum of the weight percentage (e,g, ltering for range between 99-101% w.p. for anhydrous minerals).

Mineral name reclassi cation
To achieve mineral nomenclature consistency and, with that, improve our algorithm output, we removed 17 and reclassi ed 50 mineral name entries of the original GEOROC database (see Table 1). The main criterium adopted for this step was that the mineral name must have been approved by the IMA (International Mineralogical Association). However, when coherent, exceptions to this rule were allowed to exist. The rst exception is in some minerals whose solid solution names are established and can be prominently distinguished with an EPMA analysis (e.g., the plagioclase series). The second exception is when some end-member entry of a solid solution is missing in the original GEOROC database (e.g., apatites, eastonite, polylithionite, and trilithionite) or has a high complexity (e.g., the hornblendes). In these cases, we incorporated the respective mineral as a sensu lato entry (e.g., apatite, biotite, lepidolite, and hornblende). This approach is a mineralogically acceptable and a simple and straightforward way to mitigate classi cation inconsistencies when training the algorithm. Besides these, we also applied some complementary criteria to reclassify the GEOROC mineral nomenclature. When it was possible, we retrieved an IMA approved name from the original GEOROC entry (e.g., breunnerite, an informal variety of magnesite, was renamed to magnesite, the IMA approved name for the mineral phase). When the mineral has polymorphs (e.g., alabandite), we added all the natural polymorphs to our algorithm-related class (e.g., alabandite/browneite/rambergite). Please refer to Table 1 of the Supplementary Material for a complete list of all the alterations applied to the original GEOROC mineral nomenclature.

Outlier removal
After the previous ltering and reclassi cation step, some non-compliant analyses on the GEOROC dataset remained. We interpreted these samples as residual outliers and, to avoid confusion on the machine learning training stage (Smiti, 2020), we removed these outliers from the prepared database.
The outliers fall into three categories: i) point or global outliers, as bad analysis (i.e., the sum of elements signi cantly diverts from 100%); ii) contextual outliers, as atypical samples enriched in certain chemical elements; iii) or classi cation errors, as the original mineral classi cation from the original GEOROC database.
We applied the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) unsupervised algorithm (Ester et al., 1996) to solve the outliers problem. The DBSCAN seeks, in the vectorial space, for areas of high density (i.e., inliers) and low density (i.e., outliers). The main idea is that for each sample in the inlier cluster, the neighborhood region of a speci c user-de ned size must contain at least a minimum number of samples. That is, the density in the neighborhood must exceed a user-de ned threshold (Ester et al., 1996;Misra et al., 2020). The DBSCAN's hypermeters are , the minimum number of neighbors, and , the maximum distance where must be found. Basically, the algorithm runs this analysis for each point in the database and those who do not meet the conditions are marked as outliers (Figure 2).
The DBSCAN algorithm was run for every one of the 17 mineral groups. The algorithm detected 3,457 outliers from 523,846 samples (i.e. around 0.67%). Despite the algorithm successfully detecting both global and contextual outliers, it was not e cient in removing misclassi cation outliers. For this case, we had to apply a visual inspection and manually remove the clear cases within the code lines (e.g., minerals classi ed as anorthite but with more than 2 wt.% of Na 2 O).

Data balancing
Imbalanced data is a common phenomenon in the development of machine learning models, not only in geoscienti c problems but in almost all arti cial intelligence problems, such as medical diagnosis (e.g., Vijayvargiya et al., 2021). Imbalanced data considerably reduces a model's capacity to perform predictions, especially for the minority class, where the recognition rate decreases considerably (Japkowicz and Stephen, 2002). Therefore, resampling data con gures as a mandatory pre-processing step for a successful, high-performance machine learning model.
In this context, and by analyzing the number of instances of each mineral in the ltered data mineral groups, we empirically de ned 50 instances as an adequate and su cient threshold for each mineral class. In this way, if a speci c mineral class had more than 50 instances, we randomly undersampled the instances to balance the data for the model training. On the other hand, if a class had less than 50 instances, we oversampled the instances with the synthetic minority oversampling technique (SMOTE, Chawla et al., 2002). This technique creates synthetic samples in a random point between a valid instance and a random neighbor from a determined number of nearest neighbors in the feature space ( Figure 3).

MODEL IMPLEMENTATION
The creation of the model for mineral chemistry classi cation is divided into two steps (see Figure 1). The rst step classi es the minerals into 17 different groups (see Table 1). The second step classi es the mineral specimens within each group. Currently, Qmin can correctly classify 103 different minerals. The classi ers were trained by Random Forest models, a non-parametric technique that grows decision trees to select the best output for the classi cation task (Breiman, 2001).

Model tunning
To tune the model, we applied a grid search to choose the adequate hyper-parameters of each model.
During the random forest training, we searched for the number of trees between 10 and 150, the decision function between Gini and entropy, and the criteria for the maximum features between the square root or the log2 of the number of features.
To assess the best model, we used ten-fold cross-validation. When the best parameters are found in the grid search, we selected them and proceeded to nd the parameters for the next one.

Final model validation
We performed two types of quantitative validation in the model implementation: cross-validation and "train-test split" validation.
The cross-validation approach involves taking a random sample of data from a population without no prior data splitting. This technique is usually done to evaluate the quality of the whole dataset without the need to split it into "training" and "testing" data sets. The essence of the cross-validation lies in computing sample statistics on the rst set of sample data and then applying it to the second set (Schumacker and Tomek, 2013). This type of validation involves randomly splitting a sample into two halves and then computing it. For the Qmin implementation, we ran the cross-validation at the preliminary and nal model evaluation during the pre-processing and model implementation stages (see Figure 1).
To build and evaluate the several classi ers' models, we applied an intermediate "train-test split" validation. This validation is helpful to verify the accuracy of predictions on some randomly taken subsample that the model has not seen previously. For that, we divided the data into two subsamples: one for training, with 70% of the samples, and the other with 30% of the samples, for testing in crossvalidation assessments. The nal model was implemented based on the training set. The rst accuracy was accessed based on the predictions made by the model on the test dataset. Then, the reference values are confronted with the predicted values. Finally, the accuracy and other parameters were calculated ( Figure 4).

Data processing protocol: new data input and output
The new data to be processed and analyzed must be in the form of a at le (spreadsheet), with the EPMA analyses organized by rows and the chemical elements (features) organized by columns. We designed a web interface to facilitate the input of the data, accepting both CSV and Excel spreadsheet formats.
The program receives the data from EPMA analyses and automatically detects the columns and whether they are expressed in element or oxide weight percentage. If needed, a conversion is done by weight distribution. All these steps simplify the data input by the user, reducing the dataset's manipulation before the input in the virtual assistant.
All data must be numerical (integer, double, or oat), and missing values (e.g., non-detected chemical elements) must be previously replaced by zero or other numerical values. If any entry considered in the predicting processing is not presented, the application understands that the referred feature value is equal to zero for all analyses and automatically applies an imputation.
After the data input, the Qmin web application automatically returns to the user a new downloadable data le encoded by a hash-unique value with the following newly added columns: Group classi cation, Quality control of group prediction, Mineral classi cation, Quality control of mineral prediction, second most probable mineral classi cation, Mineral formula (if the calculation is available) and several other columns related to the mineral formula calculation.

Quality control
To measure the quality of the predicted value, we use the Shannon Entropy function (Shannon, 1948), which measures the uncertainty of the prediction in discrete values based on the probability of each guess the model has made. The following equation shows how to calculate the values of Shannon entropy (H), where the variable is the probability of each guess the model has made.

See formula 1 in the supplementary les section.
To facilitate for the Qmin nal user, we classi ed the values of the uncertainty (i.e., E = 1 -H) in three categories: High Quality (E>0.7), Medium Quality (0.5 < E <0.7), and Low Quality (E < 0.5).

Blind Test
To test the robustness of Qmin, we did a blind test running the data from the Diamante Brasil Project (Cunha et al., 2017). This project was conducted by the Geological Survey of Brazil and contains more than 22,000 EPMA samples from different minerals. This data was analyzed and manually classi ed by specialized geologists before our blind test. Figure 5 summarizes the results using the exploratory graphics that Qmin presents for the user. The application achieved a 99.44% balanced accuracy in the classi cation across groups and minerals (see Table 2 for a summary).
Some cases of misclassi cation are, to a certain extent, due to the quality of input data (e.g., the sum of all analyses diverges considerably from 100%). This divergence is particularly prominent for complex hydrated minerals, such as the amphiboles.

Discussion And Conclusions
We presented a new open-source and free-of-charge application that helps to simplify and speed up EPMA mineral chemistry data processing, analysis, and interpretation by using machine learning techniques. The Qmin application can provide high accuracy predictions among several different mineral groups and specimens, with a mapped uncertainty for each classi cation. The performances found in this work for the training and test data are above the average reported in several publications in the eld (Borges and Aguiar, 2019;Gavish et al., 2018;Koch et al., 2019;Li et al., 2020).
Qmin has a great potential for use in the mineral prospection industry, where e ciency is determining. The tool helps simplify the work ow of a considerable part of the post-analytical stage of mineral prospecting campaigns or academic research that uses large volumes of EPMA data. In the study case presented, whose data covers several minerals used in the prospecting for diamonds on a national scale, conventional processing and analysis that could take days was reduced to a few minutes of work on a computer with standard processing capacity. The application is in constant development, and new and improved features are to be implemented in future iterations of the code, especially regarding new mineral entries and an improved amphibole formula calculation.
Finally, we would like to highlight some points of best practices to achieve good performances during the use of Qmin: • The mineral classi cation is based on mineral compositions available in the database fed to the models during the training stage. This setting implies that the models cannot recognize any mineral different from those implemented during the training. Thus, the algorithm will classify any new mineral as one already known to it, provided similarity within the created data multivariate space.
• The quality of the predictions and mineral formula are directly associated with the analysis' quality in terms of the analytical balance. Analyses whose sums of chemical elements are far from the total value (i.e. 100%) tend to render inadequate predictions.
• The database used for training, although robust, has inconsistencies such as those existing before any repository that compiles extensive historical series, for example, variation in the precision (i.e. detection limit) of the analytical methods. Inconsistencies have also been identi ed from discrepancies in the nomenclature pattern or incongruent allocation of minerals within groups (e.g., allocating tellurides along with sul des or some silicates along with carbonates). These problems affect the models' ability to make good predictions and, if removed, a reduction of uncertainties can be achieved.  Qmin development and utilization owchart with the three main stages: Pre-processing, Model Implementation, and Production. Each of these stages is related to chained processes that end with a quality evaluation and in the construction of the basis for the subsequent stages. The Application stage ends with the output results of external entry data.

Figure 2
Some examples of the results obtained from running the DBSCAN algorithm. Distance plot of samples ordered according to the k parameter (in both cases k is set to 5 nearest neighbors) and the optimum eps

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.