K-means Based Soil Classication System Applicable to a Brazilian Mineral Province

: This article presents a geotechnical soil classification system proposed for application on soils of a tropical mineral province, located in Minas Gerais state, Brazil. The system was constructed using data mining techniques, i.e., principal component analysis and k-means cluster analysis, which were applied to a dataset composed of 101 geotechnical characterization laboratory test results of soils from the Province of Quadrilátero Ferrífero . The main objective of the proposed soil classification method was to establish a regional soil classification system, which encompass the interpretability of the main geotechnical parameters of soils by means of the classification, given the little explanatory capacity of the Unified Soil Classification System classification system for the performance of such task. It was possible to establish a chart for soil classification capable of explaining 81.68% of the variability of the analyzed parameters, being established the soil classes A, B and C for the studied soils. The characteristics of the samples that composes each cluster were analyzed in order to label each cluster, defining three material classifying regions in the Cartesia n space of principal components 1 and 2. Linear contours, α and β, given by Eq. 6 and Eq. 7, were proposed for the definition of the boundary between the soil classes. The proposed contours consist of linear limits that intercept the intersection points of the clusters. The soils classes were labeled as A, B and C classes. The proposed classification graph is shown in Fig. 7.


INTRODUCTION
Soil classification systems, including Unified Soil Classification System (USCS) (Das and Sobhan 2013) has high relevance in geotechnical engineering practice. They are highly widespread tools in geotechnical engineering for soil typification, used to the prediction and evaluation of different soil behaviors, such as evaluation of its applicability on earthworks and civil construction.
According to Fookes (1994), traditional geotechnical classification systems in general consider characteristics that are relevant in temperate soils, characterized by the absence of weathering present in tropical soils.
The Unified Soil Classification (USCS) system is largely applied to different types of soils. It takes into account, for soil classification, the granulometric and plastic characteristics. According to Fookes (1994), these variables are capable of satisfactorily typifying soils from temperate climates, being not equally sufficient to typify tropical soils, in which saprolitic or lateritic residual natures are constantly observed. These soils frequently present residual characteristics from the matrix rock and grain cementation. Geotechnical behaviors induced by these characteristics cannot be explained by means of plasticity and granulometry characteristics.
In the given context, Pinto (2006) emphasizes the relevance of regional soil characterization systems, such as the MCT (Miniature, Compacted, Tropical) system proposed by Nogami and Villibor (1981), which is applicable to classification of tropical natural soils. Besides, there are informal soil classification systems used in local character, such as "red porous clay", used in the city of São Paulo, Brazil. The great advantage of local classification systems use is its assertiveness, whereas the study area is restricted and, consequently, the variability of soil types grouped into classes is reduced.
Considering this discussion, this article presents a local soil classification system, in order to precisely understand the coverage of conventional soil classification systems. The classification system was developed for soils from mineral province of Quadrilátero Ferrífero, Minas Gerais state, Brazil. Laboratory tests were carried out on 101 natural and compacted soils samples from Quadrilátero Ferrífero and a dataset were constructed. It has as variables effective friction angle (φ '), cohesion (c '), plasticity index (PI), particle specific gravity (Gs) and fine content.
Principal component analysis (PCA) and k-means analysis were carried out in the dataset in order to construct the proposed classification system. PCA was applied in order to comprehending the variable interdependence, reducing the dimensionality of the data and graphically representing the data into two dimensions. The application of k-means clustering algorithm shows the occurrence of three well-defined groups of soils with similar behaviors. Then, a soil classification chart capable of typifying three distinct geotechnical behaviors was proposed. Carvalho and Ribeiro (2020) proposed a similar approach for the classification of partially saturated soils trough cone penetration test, suggesting that multivariate statistics is a useful tool for that purpose.
The use of data mining techniques (including on the techniques used in this research) is being successfully used to recognize patterns and purpose models in geotechnical and geological study topics, as shown the published research of Debnath and Dey (2017), Santos et al. (2018), Santos et al. (2019), Hou et al. (2020 and Do et al. (2021).
The present article is organized as follows. Topic 1 presents the introduction of the article. Topic 2 presents the dataset, the methodology used for the development of the proposed classification system. Topic 3 presents the background of techniques used to propose the soil system classification. Topic 4 shows the results and discussion and Topic 5 shows the conclusions of the research.

Data organization
The dataset used to the elaboration of the proposed soil classification system is composed by 101 samples in which fine content, cohesion, solid density, plastic index and friction angle were measured. Laboratory tests was carried out on these samples of natural and compacted soils from different areas located in province of Quadrilátero Ferrífero. The carriedout testes were granulometric characterization, CD and CIU triaxial compression tests, determination of solid density The samples used to compose the dataset are very varied in terms of granulometric, strength, density and plasticity characteristics, encompassing residual soils of phyllites, quartzites and itabirites, alluvial and colluvial soils, as well compacted embankments built with the same materials.
According to Das and Sobhan (2013), minerals with iron in its composition, mainly oxides, tend to have higher values of Gs when compared to silicate minerals, such as kaolinite and quartz. Stefanou and Papazafeiriou (2013) also discuss about the significant influence of the fine content and Gs on the strength of five types of soils, where a positive correlation between these two variables and the penetration strength was established.
Considering a preliminary geotechnical evaluation of a determined soil, it is kwon that the generalization of soils with higher granular contents (sands and gravels) tend to be harder than soils with higher plasticity and fine content (clay and silt) is not recommended, as stated by Das and Sobhan (2013). Then, this information should not be taken as a prerogative in design methods, as it is used in some empirical methods of highway design. Therefore, the dataset was built based on 5 variables, i.e., cohesion (c') and friction angle (φ'), responsible to representing the strength characteristics of the analyzed materials, and plasticity (PI), fine content and particle specific gravity (Gs), responsible to representing physical characteristics of the analyzed materials. Table 1 presents the first ten samples of the dataset. In order to carry out the propose the soil classification system, the data were standardized, as shown in Table 2.  The methodology for obtaining the proposed soil classification system was divided in four steps, see Fig  Eq. 1 Where is the transposed eigenvector i of the correlation matrix of the data and is the vector of standardized original variable.
The explained variability of the i th principal component is given by the i th eigenvalue ( ) associated to the i th eigenvector.
The number of principal components must to be retained is defined through Kaiser criterion (Kaiser 1960) and scree-plot (Cattell 1966). Kaiser (1960) suggests maintaining components corresponding to eigenvalues greater than 1, in the case of using the correlation matrix. Scree-plot consists in retaining the number of components defined by the inflection point of the graph.

K-means method
Cluster analysis is a statistical technique used to group elements by statistical similarity by means of a definition of a distance function used to determining the statistical distance between the samples.
K-means is a non-hierarquical method of cluster analysis, which consists of grouping n data points into predefined k clusters. The distance between the individuals of the same cluster must to be minimal, while the distance between clusters must be as high as possible (Sark and Gaben 2014). In general, the method can be described in the following four steps:  selection of k random points, which represent the center of each cluster;  allocation of each point of the input set into the nearest center k;  recalculation of the center of each cluster;  iterative repetition of steps 2 and 3, until distances obtained in each iteration converge to a constant value.
The most used distance function is the Euclidean distance, given by Eq. 3:

Eq. 2
Where: is the Euclidean distance between i th and j th individuals; is the variable vector of i th individual and is the variable vector of j th individual.

RESULTS AND DISCUSSION
Removal of multivariate outliers was performed through the analysis of Mahalanobis distance of the data points to the dataset centroid in order to obtain a realistic dataset through the exclusion of unreliable data. For that end, analysis of multivariate outliers was carried with the objective of determining the samples that should be taken from the dataset. 22 outliers were identified, of which 12 showed unreasonable behaviors for natural and compacted soils. Values of φ ' above 40° were discarded on predominantly fine soils and c' values above 35 kPa were discarded on sandy soils, seeking adherence to values present in the literature, such as Das and Sobhan (2013) and Maiolino (1985). The outlier extraction resulted in a dataset with fine contents between 26% and 94.37%, cohesion values between 0 and 35 kPa, friction angle values varying between 14.60º and 39.40º, PI values between 0 and 37% and GS values between 2.50 and 4.46. With the inconsistent data withdrawn from the dataset, 88 samples were considered for posterior analysis, which was carried out in order to propose a soil classification system. Table 3 presents the descriptive statistics of the variable dataset.  (Bartlett 1951) was performed and it was possible to conclude that the data present sufficient correlation to apply multivariate statistical techniques, with a p-value equal to 9.25x10 -53 .

Fig. 2. Scatter plot.
PCA via correlation matrix was carried out in order to reduce the dimensionality of the data. Table 4 shows the loadings (coefficients of linear combinations) of the five generated principal components. Kaiser criterion (Kaiser 1960) and the scree-plot analysis (Cattell 1966) were used to determining the number of principal components that should be kept in the analysis, see Fig. 3. They suggested the retention of only one principal component.
Although there is an indication of the retention of only one variable, the proposed soil classification system also considers the second principal component in its conception, once the second principal component has a high explanatory character of the geotechnical behavior of the data. Principal components 1 and 2 are given by by Eq. 6 and Eq. 7, which take standardized variables as input parameters.

Eq. 4
First principal component has a positive correlation with fine content, cohesion, GS and PI, and a negative correlation with the friction angle. It is also important to verify that the correlations present, in module, similar values, except for fines content, which is about half of the others. Second principal component has a very high correlation with fines content, which is the main factor controlling its behavior. The other variables present minor importance for the definition of this principal component. Fig. 4 resumes these information in a biplot graph.
Usage of these two components encompasses the importance of all variables, being capable of explaining 81.68% of original data variability.

Fig. 4. Biplot graph.
The clusterization of the data was performed through k-means method, see Fig. 5. The distance used to calculation of dissimilarity of the samples was Euclidean distance and Mahalanobis distance. The first one presented best performance and fit and was used to proposed the new soil classification system. The number of clusters adopted was equal to three, as recommended by the majority rule (Charrad et al. 2015), based on 30 index used to determine the optimal number of clusters, see Fig. 6. The validation of the cluster analysis was based visual inspection of the graph of the Fig. 5. The groups are well defined with a very small overlapping region Fig. 5. Result of the cluster analysis. Fig. 6. Application of majority rule to determining the optimal number of clusters for the data.
The characteristics of the samples that composes each cluster were analyzed in order to label each cluster, defining three material classifying regions in the Cartesian space of principal components 1 and 2. Linear contours, α and β, given by Eq. 6 and Eq. 7, were proposed for the definition of the boundary between the soil classes. The proposed contours consist of linear limits that intercept the intersection points of the clusters. The soils classes were labeled as A, B and C classes.
The proposed classification graph is shown in Fig. 7.
The description of the proposed soil classes of the classification system are presented below.
 Class A: soils of this classes are the one with the lowest values of PI and Gs. They present values of Gs lower than 2.82, which indicate the predominance of minerals with lower unit weight, such as quartz (Gs = 2.65) and kaolinite (Gs = 2.60). According to (Das and Sobhan 2013), these type of soils presents low activity values and, therefore, low plasticity. Moreover, the predominance of low unit weight minerals in class A, indicates low iron content and, consequently, low cementation. This information is corroborated by Stefanou and Papazafeiriou (2013) and Cruz et al. (2013), which discuss the positive influence of cementation on soil cohesion. Besides, this type of soil presents a range of fine content 26% to 85%, where 75% of the samples have fine content less than 67.25%. The fine content of these soil samples and the presence of minerals with Gs lower than 2.82 indicate the predominance of sandy characteristics of these materials. These characteristics lead to highest values of φ' observed (Das and Sobhan 2013). It is, therefore, possible to infer that class A soils correspond to materials with a sandy to inactive fine matrix, with concentration of low unit weight minerals and low cementation.  Class B: soils with transitory behavior between classes A and C. For these materials, intermediate values of Gs and PI are indicative of the occurrence of a higher percentage of active clay minerals and higher densitywhen compared to the observable values for class A. It is noticeable the occurrence of more expressive cohesion for these materials, indicative of cementation between the grains. The occurrence of moderate friction angles, however, corroborates the observation of granular fractions of not less than 16% in 75% of the observations. Class B soils, therefore, can be understood as transitive materials between classes A and C, with intermediate behaviors.
 Class C: Materials encompassed in class C present the opposite behavior to class A materials for the five variables analyzed. It can be seen, therefore, that class A tends to encompass predominantly fine materials with clayey and active characteristic, for which the design concerns often inherent to CH type materials, distributed between classes A, B and C, are applicable. The higher cohesion values observed for class C materials can be associated with pre-consolidation and cementation phenomena (Cruz et al. 2013), the latter derived from the presence of high-density minerals, such as hematite and goethite, justified by observation of Gs values greater than 3.05 in 75% of the analyzed observations. The high values of c' perceived for class C, however, are contrasted by materials of low geotechnical competence, for which the possibility of non-processing of the previously mentioned phenomena is indicated.
The samples of the dataset were classified according to the Unified Soil Classification System (USCS) and compared with the proposed soil classification system, see Fig. 8.

Fig. 8. USCS versus proposed soil classification system of the dataset samples.
Soils classified by USCS as silty sands (SM) were allocated in the region corresponding to class A of the proposed classifications system. Soils with high and low plasticity clays (CH and CL), high plasticity silts (MH) and clayey sands (SC), considering USCS, presented a significant dispersion in the proposed classification system. Each USCS class were allocated in at least two classes of the proposed chart, see Fig. 8. CH and CL soils were allocated in the region corresponding to class 3 of the proposed classification system. Therefore, the mentioned observations point out that the USCS is not efficient in the differentiation of the geotechnical behavior of the the analyzed soils, as, for example, high plasticity clays, high plasticity silts, low plasticity clays, and clayey sands have similar behaviors in the region between the α and β lines.
Regarding the dataset, boxplots of the A, B and C classes of proposed classification system and boxplots of SM, SC, MH, CH and CL classes of USCS system were drawn, considering φ ', c', Gs, PI and fine content variables. Fig. 9 to Fig. 13 present the boxplots.     A, B and C classes of the proposed classification system presented distinct geotechnical behaviors, which were well demarcated by the different ranges of φ ', c', PI, Gs and fine content for each class. In case of USCS classes, they present random scattered values of φ ', c', PI, Gs and fine content for each class, see Fig. 6 to 10. It is possible to note that clays, silts and sands are from different classes considering USCS, but they present similar characteristics, from strength and behavior point of view, indicating the non-applicability of UCSC for the soils of mineral province of Quadrilátero Ferrífero. The boxplots presented in Fig.6 to 10 are capable of demonstrate the capability of the proposed soil system in discriminating the soils according to its strength behaviors.

CONCLUSIONS
Multivariate statistical techniques are powerful tools for proposition of accurate methods and validation of techniques consolidated in geotechnical engineering practice. To propose the soil classification system applicable to Quadrilátero Ferrífero province, principal component analysis was used in order to reducing the dimensionality of the data and analyze the interdependence between the variables. It was also used k-means cluster analysis, which allows the definition of soil classes (generated clusters) and the boundaries α and β of the proposed classification system, which defines different behaviors from strength point of view.
The analyzes conducted in this article show that the USCS system is not efficient for the classification of soils in the Quadrilátero Ferrífero province, as by Fortes et al. (2002). Whereas, the proposed classification system is very efficient in classifying soils from Quadrilátero Ferrífero according to its geotechnical behaviors and it is capable of represent the variability of the soils from this mineral province.
The adoption of the proposed classification system for soil classification could allow the adoption of specific design guidelines for classes A, B and C, which would contribute to geotechnical engineering practices. This study is regional and not very restrictive, and it should be carefully used in engineering projects, due to the limited number of samples considered.
It is recommended the conduction of studies that correlates other geotechnical behaviors to the A, B and C classes and that allow the practical improvement of geotechnical engineering with its application.