Unsupervised drill core pseudo-log generation in raw and ltered data, a case study in the Rio Salitre greenstone belt, São Francisco Craton, Brazil

We use in situ portable X-Ray Fluorescence data acquired in sawn drill core samples of rocks 28 from the Sabiá prospect, at the Rio Salitre greenstone belt, São Francisco Craton Brazil, for 29 pseudo-log automatic generation through running unsupervised learning models to group 30 distinct lithotypes. We tested the K-means and Model-Based Cluster (MBC) algorithms and 31 compared their performance in the raw and filtered data with a manual macroscopic log 32 description. From the initial 47 available elements, 20 variables were selected for modeling 33 following the criteria of presenting at least 95% of uncensored values. Additionally, we 34 performed a Shapiro-Wilk test that confirmed a non-parametric distribution by verifying the P- 35 value attribute less than the 5% significance level. We also checked if the dataset's distribution 36 was statistically equivalent to the duplicates with the assistance of a Kruskal-Walis test, which 37 would confirm the representativity power of the measurements at the same 5% significance 38 level. After this step, the pseudo-log models were created based on reduced dimension data, 39 compressed by a centered Principal Component Analysis with data rescaled by its range. representativity for the two drill cores' samples. All K-means and MBC models were able to 52 detect changes in lithotypes not described in the manual log. On the other hand, one lithotype 53 described by the experts was not detected by this methodology in any attempt. It was needed a 54 detailed investigation with thin section descriptions to determine the cause of this response. 55 Finally, compared with the manual log description, it is notable that the models built on filtered 56 data have better performance than those generated on raw data, and the MBC filtered model 57 had better performance than the others. Hence, this multivariate approach allied to filtering the 58 data with a moving average transformation can be a tool of great help during several stages of 59 mineral exploration, either in the creation of pseudo-log models prior the description of the drill 60 core samples or in the data validation stage, when it is necessary to standardize several 61 descriptions made by different professionals.


Abstract 27
We use in situ portable X-Ray Fluorescence data acquired in sawn drill core samples of rocks 28 from the Sabiá prospect, at the Rio Salitre greenstone belt, São Francisco Craton Brazil, for 29 pseudo-log automatic generation through running unsupervised learning models to group 30 distinct lithotypes. We tested the K-means and Model-Based Cluster (MBC) algorithms and 31 compared their performance in the raw and filtered data with a manual macroscopic log 32 description. From the initial 47 available elements, 20 variables were selected for modeling 33 following the criteria of presenting at least 95% of uncensored values. Additionally, we 34 performed a Shapiro-Wilk test that confirmed a non-parametric distribution by verifying the P-35 value attribute less than the 5% significance level. We also checked if the dataset's distribution 36 was statistically equivalent to the duplicates with the assistance of a Kruskal-Walis test, which 37 would confirm the representativity power of the measurements at the same 5% significance 38 level. After this step, the pseudo-log models were created based on reduced dimension data, 39 compressed by a centered Principal Component Analysis with data rescaled by its range. 40 Concerning to reduce the high-frequency noise in the selected features, we employed an 41 exponential weighted moving average filter with a window of five samples. By the analysis of 42 the Average Silhouette Width on sample space, the optimum number for K-means was fixed in 43 two, and then the first models were generated for raw and filtered data. From the MBC 44 perspective, the sample space is interpreted as a finite mixture of groups with distinct Gaussian 45 probability distribution. The number of clusters is defined by the analysis of the Bayesian 46 Information Criteria (BIC), where several models are tested, and the one in the first local 47 maximum defines the number of groups and the type of probabilistic model in the simulation. 48 For the data used in this work, the optimum group number for MBC is four, and the probabilistic 49 model type determined by the BIC is elliptical with equal volume, shape, and orientation. Thus, representativity for the two drill cores' samples. All K-means and MBC models were able to 52 detect changes in lithotypes not described in the manual log. On the other hand, one lithotype 53 described by the experts was not detected by this methodology in any attempt. It was needed a 54 detailed investigation with thin section descriptions to determine the cause of this response. 55 Finally, compared with the manual log description, it is notable that the models built on filtered 56 data have better performance than those generated on raw data, and the MBC filtered model 57 had better performance than the others. Hence, this multivariate approach allied to filtering the 58 data with a moving average transformation can be a tool of great help during several stages of 59 mineral exploration, either in the creation of pseudo-log models prior the description of the drill 60 core samples or in the data validation stage, when it is necessary to standardize several 61 descriptions made by different professionals. The "data-rich paradigm" is already a reality in mineral exploration. This scenario can be found 68 in several segments of the mineral industry, such as airborne geophysics, exploratory 69 geochemical surveys, mineral resources, and reserves analyses evaluation, studies of physical 70 7 metavolcano-sedimentary associations (e.g., Rio Salitre, Brumado, Boquira, and Riacho de 121 Santana, Barbosa, 2012 Salitre: metapelites, metagraywacke, and meta-arkose, interlayered with mafic-ultramafic and 126 felsic metavolcanic rocks and ii) Sobradinho: composed of banded iron formations, paragneiss, 127 phyllites, mica schists, metabasic and meta-ultrabasic rocks, calc-silicate rock, and quartzites. 128 These units were deformed and metamorphosed at low greenschist facies metamorphic grade. 129 The RSGB has potential for base metals (Cu-Pb-Zn) associated with VHMS and orogenic gold 130 type deposits (Barbosa, 2012). Several occurrences of Cu-Pb-Zn have been described in this 131 sequence, highlighting the Sabiá prospect (reserves of 10 million tons of massive sulfide, 132 Ribeiro et al. 1993), presenting pyrite-and pyrrhotite-rich levels in calc-silicate rocks with 133 tremolitization associated with the Sobradinho unit (Angelim, 1997). 134

135
The samples used in this work were collected from two complementary drill cores next to the 136 Sabiá prospect, that intercept all the stratigraphy of the Baixo Vale do Rio Salitre unit. On 137 average, the samples have 15 cm of length, and one sample was taken from about every 2 138 meters. 139 The described rocks are mostly fine-grained and show similar color and texture. These features 140 make manual logging, an essential task in mineral exploration, complicated and costly. Despite 141 similarities between the drill cores' lithotypes, we described at least four rocks types: calc-142 silicate rocks, carbonaceous phyllites with and without mineralization, metapelites, and

151
The instrument used in this work was a Thermo Scientific Niton XL3t Goldd+ XRF analyzer, 152 with 2W, 50kV Au anode tube, and a geometrically optimized large area drift detector. The 153 instrument offers three methods of analysis, and the method chosen in this work is named 154 "TestAll Geo," indicated when the concentration of interest elements is unknown. "TestAll 155 Geo" is a hybrid mode able to detect several major, minor, and trace elements: Ag, Al, As, Au, 156 Ba, Bi, Ca, Cd, Ce, Cl, Co, Cr, Cs, Cu, Fe, Hf, Hg, La, Mg, Mn, Mo, Nb, Nd, Ni, P, Pb, Pd, 157 Pr, Rb, Re, S, Sb, Sc, Se, Si, Sn, Sr, Ta, Te, Th, Ti, U, V, W, Y, Zn, Zr. The instrument was 158 coupled on a stationary test stand during the measurements, where the samples were placed. 159 Each measurement took 120s, with 60s of duration for each beam. 160 The measurements were performed on quartered and sawed drill core samples, using the "point 161 and shoot" or in situ assay mode. The QA/QC adopted procedures followed the suggestions of 162 The bulletins of the X-Ray Fluorescence data were gathered in a single spreadsheet. All 175 procedures of data preparation were performed in the R environment. The data management 176 step was performed using the concepts of "Tidydata" handling, with the "dplyr" package 177 (Wickham, 2014). The graphical analysis and all statistical diagrams shown in this work were 178 done using the package "ggplot2" (Wickham, 2016). 179 From an initial number of 47 elements, we selected 20 for exploratory data analysis following 180 the proportion of uncensured data. Only variables with at least 95% of valid results were taken. 181 Then, for the multivariate analysis, all missing values were replaced by half of the lower limit 182 of detection, as suggested by Farnham et al. (2002) and Kwak and Kim (2017). No outliers 183 were removed from the data, as outliers can represent some interesting samples in 184 mineralization. 185 Each variable was scaled to values between 0 to 1 regarding its range using a Min-Max Feature 186 scaling. This approach matches the ranges into given values but preserves the original data 187 distribution. The normalization is a mandatory step of data preparation in multivariate analysis 188 because of the variance's regularization, an essential parameter to several statistical procedures 189 and verifications (Grunsky, 2010). The Min-Max Feature Scaling is given as: 190 where (#) = (x1, …, xn), and (#) is the th normalized data. The nature of geological data in drill core samples allows the interpretation that neighbors 198 samples may have a particular correlation. Therefore, the EWMA filtering process is suitable 199 for reducing the high-frequency noise, which can either result from some specificity of a sample 200 (nugget effect) or is a product of accuracy issues of the pXRF analysis. 201 EWMA filtering is described as follows: 202 Where '# is the filtered value for the # sample, is the number of neighbors samples 204 considered in the filtering process, and is the weighting coefficient that decays exponentially 205 from the centered sample in the function of the distance and is defined as: 206 The EWMA filtering has some advantages, such as a fast performance with low computational 208 cost, strong smoothing properties, with the weights of the neighbor's samples automatically 209 given in function of the number of neighbors. EWMA requires the definition of some 210 hyperparameters, adjusted by the interpreter. The more neighbors' values are considered during 211 the filtering process, the smoother the resulting curve. In most cases, when many neighbors are 212 considered, it also adds an offset or lag incrementation to the curve's shape. Furthermore, it may 213 input loss of information as the first − 1 samples are lost in the filtering process. 214 For data used in this work, the optimum value of neighbors was defined as five, as there was 215 no visible offset added to data, and the algorithm still performed a considerable smoothing in 216 the pattern of the analyzed curve with minimal data loss (Figure 4). For comparison purposes, 217 filtered and raw data sets were treated separated in most of the further analyses.

237
All these groups and associations between elements represent the compositional variation of 238 the main constituent minerals of the analyzed lithotypes. In a traditional assessment, the 239 interpreter tries to link these associations to previously known lithotypes. However, due to the 240 high number of analyzed variables, interpretation can become challenging even for the most 241 experienced professionals. 242

Dimensionality Reduction
The most common dimensionality reduction method is the Principal Component Analysis 244 (PCA; Grunsky and Arne, 2020). PCA relies on a combination of linear transformations called 245 "basis change" that aims to maximize the data variance on several orthogonal axes, ordered by 246 the first to the last, based on the proportion of explained variance of the dataset. Thus, the first 247 components are often interesting for multivariate analysis, since they typically account for a 248 large proportion of the total variation, as the last components are usually discarded since they 249 may reflect noise rather than the systematic pattern (Forkman et al., 2019). PCA is particularly 250 util before running a cluster analysis because as many clustering methods rely on the "distance 251 concept" (Frey and Dueck, 2007), the PCA's space optimization helps in these processes. 252 For the dataset used in this work, PCA ran on filtered indicates that the filtering process 253 effectively assists on the noise reduction, as the first components explain a considerable amount 254 of data variance, compared to PCA ran on raw data ( Figure 6)  The BIC is based on Bayesian factors, which is the posterior odds for one model against 303 other assuming neither is favored a priori (Fraley & Raftery, 1998). The BIC is independent of 304 how the different models are built, changing some constraints about the clusters' shape, volume, 305 and orientation. After comparing the possible models through different numbers of clusters, 306 from 1 to , the first local maximum is considered to be the best suitable model (

326
Four clustering models were generated in raw data and filtered data, varying according to the 327 clustering method, K-means, or Model-based Clustering (Figure 9). 328 As the K-means Clustering depends on the predefined parameter given by the maximum 329 Silhouette Average Width (Figure 9a and b), the generated models indicate only two groups by 330 this method (Figure 9a and 8b). Generally, these two groups split those samples with an 331 association of Ni-Cu-Cr (negative values of PC1) from those with an association of Ba-Zr 332 (positive values of PC1). 333 The MBC show some differences in the number of groups between raw and filtered data. Only 334 three clusters were defined on raw data, with a superposition of groups on the space generated 335 by the two first dimensions (Figure 9c). While for the filtered data, four distinct groups with no 336 superposition on PC1-PC2 space were defined (Figure 9d). 337 Moreover, on the pseudo-log generation (in-depth constrained clustering analysis), the effect 338 of filtering becomes more evident as the high-frequency noise tends to decrease in both K-339 means and MBC pseudo-logs ( Figure 10). For all pseudo-logs generated, the MBC ran on 340 filtered data is closer to the manually described log. Furthermore, there are still notable 341 differences, such as an apparent greater detail in the APA2001 drill core (separating a class for 342 the more sulfide range). Besides, no clustering model was able to distinguish between the rocks 343 described as calc-silicate and metamafic. 344

352
We performed two types of Clustering Analysis in portable X-Ray Fluorescence data of drill 353 core samples from a VHMS occurrence hosted in an Archean/Paleoproterozoic greenstone belt: 354 The application of a filter to noisy data, such as that found in some pXRF assays, helps separate 357 groups of lithotypes. As the clustering models rely on geochemical contrast, the filtering 358 process allied to Principal Component Analysis helped increase the signal to noise 359 compensation and maximize the contrast between the clusters. In general, both models 360 developed using the filtered data showed clustering performance closer to that described by 361 professionals. 362 Both the Silhouette and elbow methods suggest only two clusters for the K-means, which 363 vaguely resembles separating the metapelites and calc-silicate rock, but this method could not 364 detect any other lithotypes. 365 On the other hand, the Model-Based Clustering analysis suggested four clusters, resembling 366 more with the manual log description. The model could assess this resemblance even then some 367 key-elements were not available for the analysis, like C, Mg, Ca, and K (some of them not 368 analyzed by the pXRF or not pass in the several data tests ran before the clustering model). 369 In some cases, the pseudo-log models consistently proposed changes on the rock type not 370 detected in the manual log. The method was able to detect small variations in composition amid 371 large rock packages that went unnoticed in the manual description. 372 However, one lithotype described as "metamafic rock" at the bottom of the APA3001 drill core 373 could not be detected by any models. A possible explanation is the lack of contrast between the 374 analyzed elements due to the similar mineralogy confirmed by thin-section petrography. 375 The samples from the core APA2001 have a higher level of difficulty fitting a model because 376 of the thin layers repeating rhythmically, as seen in the manual log description. The core 377 APA3001 is more homogeneous than the former, and even in raw data, the models had a certain 378 degree of convergence. 379 One issue to be considered is the spatial resolution of the measurements. This matter allied to 380 the geochemical contrast can contribute to the pseudo-log's resemblance to the manual 381 description, and for small resolution as the one taken in this work, only general discrimination 382 is expected. 383 After all, the results suggest that the MBC method had a better performance than the K-means, 384 mainly for the filtered data. The use of the methods described in this work, especially the 385 Model-Based Clustering combined with EWMA and PCA, can be applied as an important tool 386 in the mineral exploration industry. It highlights that its application can be of aid to solve 387 problems during the validation of geological models based on drill hole data.   An essential step in exploratory data analysis is the determination of the distribution type. If the 432 data is parametric, the mean and standard deviations are reasonable estimates for the center, 433 and the spread of data and several inference tools can be used to analyze and infer population 434 parameters. Otherwise, data must be treated differently with methods that do not rely on these 435 geometric parameters. 436 There are nearly 40 tests available for normality verification, but several authors ( (Shapiro and Wilk, 1965) was initially defined for small samples ( < 50), and then it was 440 improved by Royston (1982) that expanded the test for a greater range of values (3 ≤ ≤ 5000). 441 In this work, this test was performed for each selected element on the database, considering 442 only the first pXRF spot data. For a significance level of 5% and analyzing the P-Value 443 parameter, all the selected elements presented P-Value << 0.01 (Table B.  The window size of the portable X-Ray Fluorescence can be a struggle point for the analysis, 457 as just a small bit of the rock sample is measured by turn in "point and shoot" mode (Lemière, 458 2018). This fact could lead to taking values that are not representative of the whole sample, 459 even that the rock has small granulation. 460 Regarding this, we ran a double analysis for each sample in different spots to check its statistical 461 representativity. We tested the distributions' statistical equivalence on the two-spot approach 462 by each selected element with the assistance of the Kruskal-Wallis test (KW). KW assess if the 463 distribution of the measurements and duplicates are statistically equivalent (Figure B.2). If it is 464 acceptable that they have an equivalent distribution, we can accept the hypothesis that the data 465 measurements represent that portion of the samples. 466 The KW test (Kruskal and Wallis, 1952) is an extension of the Wilcox-Mann-Whitney for non-467 parametrical data and verifies if the same distribution for ranked values originates two or more 468 independent samples. The KW test is performed for each of the selected elements to check if 469 the first pXRF spot data distribution can be equivalent to the second pXRF spot data 470 distribution. The KW value test is defined as: 471 As is the total number of analysis for the samples, # is the number of observations of the th 473 group, #` is the rank of observation from group , ̅ is the average rank of #` and ̅ #. is the 474 average rank of the th group. 475 The verification of P-Value assesses the result of the KW for a given α defined as 5% (Table  476 B.1). Therefore, if P-Value is higher than 0.05 for a given element, it is considered that the 477 distribution of the first pXRF spot can be equivalent to the distribution of the second pXRF 478 spot. 479 All the 20 selected elements have respective P-Value higher than 5%, and thus are suitable for 480 the modeling. The density distribution for both groups for each element is shown in Figure B