Unsupervised Clustering of Continuous Ambient Noise Data to Get Higher Signal Quality in Seismic Surveys

Seismic interferometry has been shown to extract body wave arrivals from ambient noise seismic data. However, surface waves dominate ambient noise data, so cross-correlating and stacking all available data may not succeed in extracting body wave arrivals. A better strategy is to nd portions of the data in which body wave energy dominates and to process only those portions. One challenge is that passive seismic recordings comprise huge volumes of data, so identifying portions with strong body-wave energy could be dicult or time-consuming. We use spatio-temporal features, calculated with data recorded by all receivers together, to perform unsupervised clustering. Using data recorded by a dense seismic array in Sweetwater, TX we were able to identify ve clusters, representing a subsets of the complete dataset that contain similar features, and extract a 7 km/s body wave arrival from one cluster. This arrival did not emerge when we performed the same cross-correlation and stacking regimen on the entire dataset. We calculated spatio-temporal features that characterize data recorded by nodal seismic arrays. Unsupervised clustering successfully identied noise windows with strong body wave energy.


Introduction
Ambient noise seismic interferometry (ANSI) extracts waves travelling between two stations by crosscorrelating the time series data recorded by each station and thus estimates an Empirical Green's Function (EGF), the impulse response of the Earth's subsurface between the stations. One station acts as a "virtual source" and the other as a "receiver" (Dantas et  However, for virtual source gathers to be an accurate estimate of the EGF there must be 1) a dense and homogeneous distribution of uncorrelated noise sources and 2) an equipartitioned noise wave eld (Wapenaar et al., 2005). Unfortunately, these conditions are rarely met in real applications. To compensate for a non-uniform distribution of sources, correlations computed for short time windows (e.g., several minutes) are stacked over hours or days to produce a robust estimate of an EGF. To meet the second condition, sources are required to be located on the axis that connects the two stations. For the case of re ected waves, sources are required to be located in stationary phase zones (Draganov et al., 2013) or in a region in which a re ection between two receivers can be retrieved. In this study, crosscorrelation results for a single time window is referred to as a single "cross-correlation panel". Final EGF estimates result from stacking a set of cross-correlation panels.
Stacking correlations over a long period of time was proposed previously as a way to attain the broadest possible distribution of sources (Schimmel et al., 2011;Snieder, 2004) and increase the signal-to-noise ratio of arrivals by suppressing random noise. However, it has been challenging to establish how long correlations need to be stacked, or for passive seismic surveys to acquire data, to ensure that EGFs converge. The convergence rate of noise correlations has been studied by Sabra et al. (2005), who established theoretical relations between variance, recording length, frequency bandwidth and recorded energy. EGF convergence also depends on the choice of window length, processing parameters, and quality of data (Chamarczuk et  slowness to discriminate between the cross-correlation panels. There are few pre-correlation automated selective stacking methods (Draganov et al., 2013). Pre-correlation estimates can save signi cant computation time in computing cross-correlations and also would be less biased by pre-processing routines used in ANSI. In this study we show that it is possible to identify noise windows in which body wave energy dominates with the aid of machine learning methods, speci cally unsupervised clustering. . However, most of the above-mentioned studies can be classi ed as "supervised learning" approaches because they require a training dataset and information about the data beforehand. Supervised techniques may therefore work well in the lab, when applied to previously-acquired data, but applications in the eld, in real time, would be challenging or impossible. Ambient noise studies are often carried out on datasets for which little information is available beforehand, so "unsupervised clustering" or "unsupervised machine learning", if feasible, would be better suited to exploring ambient noise datasets. Unsupervised clustering has been used previously to discriminate between local and teleseismic arrivals (Mousavi et al., 2019), to automate microseismic event picking (Chen, 2020), and to detect seismic events (Li, Peng, et al., 2018a). However, few studies have applied and assessed the capabilities of unsupervised learning methods to identify signals recorded by a passive seismic array. Johnson et al. (2020) and Chamarczuk et al. (2020) report on applications of unsupervised learning methods that succeed in identifying classes of signals in seismic data. Identifying noise windows with dominant body wave energy in ANSI studies remains an important and often timeconsuming problem. This study assesses the prospects for unsupervised learning methods to save the time and effort required to separate these panels (Draganov et al., 2013;Vidal et al., 2014). We engineer features and tailor unsupervised learning to identify clusters in which body waves arrivals dominate cross-correlation panels, and ultimately selectively stack panels within clusters to reveal body wave arrivals with higher signal-to-noise ratios than is found by stacking all available panels. We also show that machine learning algorithms can help extract information required to label data for subsequent supervised learning studies.

Data
The data used in this study were recorded by a mixed array comprised of twenty Nanometrics Trillium Compact Posthole seismometers, ve Nanometrics Trillium 120 Posthole seismometers, and 2639 . The Zland nodes were divided into three subarrays, namely, the active source array consisting of 2322 nodes, the backbone array consisting of 292 nodes that spanned the study area for passive seismic data analysis and the outlier array that consisted of 25 nodes spaced 5 km around the study area. The Nanometrics Trillium stations recorded data with a sampling rate of 200 Hz; Fair eld nodes recorded at 500 Hz. The Sweetwater array collected data from 7th March to 30th April.
We use a 132-station subset to satisfy array processing assumptions, namely that departures from homogeneous structure are negligible. These 132 stations are con ned to the 9x9 km area shown in Fig. 1a. For unsupervised clustering we use data from 29-30 March 2014, a total of 48 hours. The array recorded a variety of sources, including airplanes, oil pump jacks, vehicles, Vibroseis trucks, and wind turbines (Fig. 1b), among others. Possible sources of distant urban noise sources are Snyder airport (38 km west of the Sweetwater array), Anson airport (60 km to the east), and Abilene (74 km to the southeast:  Multiple Vibroseis shots can be observed in Figure 2(d), which is a spectrogram of broadband station 6X542. The number and variety of anthropogenic noise sources, coupled with a lack of seismic activity, makes the identi cation of seismic arrivals at long offsets challenging. In addition to the noise sources we were able to identify, we also observed a constant signal at ~3 Hz in the spectrogram of uncertain origin. Other studies have identi ed wind turbines as sources of similar signals (Neuffer et al., 2021).
There are certainly turbines in the region, many of which were installed in 2014 or shortly beforehand, but efforts to determine the direction and distance to these sources, described below, point to locations of no known wind turbines, in addition to locations of dense groups of turbines.

Work ow
We apply unsupervised machine learning methods to continuous seismic data to nd clusters with high signal-to-noise body wave arrivals. Speci cally, we compute the seismic features beamforming, interferometric location (Chamarczuk et al., 2020) and frequency domain amplitudes in an effort to capture characteristics of ambient noise signals that can be interpreted in terms of direction, velocity, location and frequency content of seismic sources. Our preprocessing routine involves removing linear trends from each noise window and downsampling to 200 Hz using Lanczos resampling. To calculate features and perform subsequent clustering, we rst divide 48 hours of continuous seismic data into 20 s windows to compute cross-correlation panels. Window length is an important parameter in seismic interferometry studies; shorter noise windows have been shown to extract body wave arrivals more effectively from ambient noise seismic data (Chamarczuk et al., 2021b;Cheraghi et al., 2017;Draganov et al., 2007). Following this step, our work ow consists of three steps (1) feature calculation (2) unsupervised clustering using K-means, and (3) ambient noise processing and stacking of each cluster.

Feature Generation
We compute features for various durations of time windows (e.g., 20 s, 40 s, 60 s, etc.). Beamforming is a "delay and sum" technique for tuning an array of sensors to receive signals from a particular azimuth, while rejecting signals from other azimuths. Data from array sensors are combined in such a way that waves arriving from user-speci ed directions sum constructively, while waves arriving from other directions experience destructive interference. The "beamforming" feature is therefore sensitive to azimuth and apparent velocity of sources located around the array. We use Obspy's beamforming function (Beyreuther et al., 2010), which is a correlation approach (Ruigrok et al., 2016) where crosscorrelations of waveforms from different receivers are used as inputs for beamforming. Based on the node spacing of 800 m, the minimum resolvable wavelength would be 800 m if we consider a minimum frequency of 0.5 Hz (Löer et al., 2018). We calculate beamforming features between 0.5 to 10 and limit slownesses to -1.0 s/km to 1.0 s/km to avoid spatial aliasing. We calculate a total of 400 beamforming features and subsequently smooth and downsample to 40 features. The number is further reduced to 19 by imposing a 95% variance threshold on the feature correlation matrix. An example of unscaled beamforming features can be seen in Fig. 3.
Interferometric location is a spatiotemporal tool that scans within the con nes of an array to nd source locations . Interferometric location has been applied successfully to regional earthquake monitoring (Poiata et al., 2016), to studies of earthquake sequences (Baker et al., 2021), and to identifying persistent noise sources for subsurface monitoring . The inputs are crosscorrelations and a constant velocity, which are used to calculate the intensity of sources at each location. The feature for location (x, y), S(x, y) is given by where τ m and τ n are time differences between location (x, y) and station m, and location (x, y) and station n, respectively.
To calculate the interferometric location feature we rst divide the area covered by the array into 20x20 grids. After applying a bandpass lter to the data with corner frequencies of 1 Hz and 80 Hz and spectral whitening, we calculate all inter-receiver cross-correlations using CuPy. We then use the cross-correlations to calculate a total of 400 interferometric location features. Subsequently, we smooth and downsample the features to 100 features. An example of unscaled interferometric location features is shown in Fig.  3a.
Frequency domain amplitudes helps us capture variations in frequency content across the array. To calculate this feature, we calculate the amplitude spectrum for each station using a Fast Fourier Transform. We then grid the array area with 4x4 cells and sum the amplitudes of signals of receivers that fall within the same cell. Then we sum the signals within each cell between 1 -20 Hz, 20 -40 Hz, 40 -60 Hz, 60 -80 Hz, 80 -100 Hz, to produce a total of 80 features. We choose 4x4 grids because of the density of stations in the Sweetwater array. The frequency bands were chosen, after examining sample sources, to allow different types of sources to be localized in (i.e., separated into) different intervals. An example of frequency domain amplitudes can be seen in Fig. 4.

Unsupervised clustering using K-means
Clustering is an exploratory data analysis technique often used to identify subgroups in a dataset that share common characteristics. Unsupervised clustering is a way of identifying subgroups without any prior labels in the dataset. K-means is an unsupervised distance-based algorithm that iteratively tries to partition the data into distinct, non-overlapping subgroups (Jin & Han, 2010).
One of the hyperparameters that is required to be initialized for K-means is the number of clusters. To determine the optimal value for number of clusters we use silhouette analysis, which measures the similarity of points within clusters and their dissimilarity from points in other clusters. Supplementary S1 includes more details concerning silhouette analysis.
To evaluate the utility of each feature type we evaluate the K-means clustering performance for each type by performing silhouette analysis on a range of cluster numbers. We nd that only frequency domain amplitudes have good clustering performance. From silhouette analysis, we nd that ve clusters have the maximum average silhouette value, so the optimal number of clusters is ve (Fig. 5a). To understand the distribution of noise windows in each cluster, we plot the distribution of silhouette coe cient values (Fig. 5b). From Fig. 5b, we can see that Group Two has the highest silhouette coe cient values, while all groups other than Group Four have silhouette coe cient values greater than 0.35.

Ambient noise seismic processing and selective stacking
We selectively stack data within each cluster after applying ambient seismic noise processing. We detrend, demean, and downsample the data to 200 Hz for each of the 134 stations. We then bandpasslter the data with corner frequencies of 1 and 80 Hz, split the results into 20-s noise windows, and apply spectral whitening between 1 Hz and 80 Hz to balance the contribution between different frequencies. We cross-correlate the whitened traces for a maximum lag of 10 s, because limiting lag times saves computational time. Finally, the cross-correlation panels are stacked according to their cluster relationship.

K-means clustering results
We retrieve a 7 km/s arrival in the positive lag of Virtual Source Gather (VSG) 6X542 -Group 4 obtained by stacking correlations within Group Four (Fig. 6b), which is observed to ~9 km offsets. Additionally, in the positive lag, we retrieve arrivals with apparent velocities of 4 km/s and 1 km/s. In VSG 6X542 -Group 0 (Fig. 6a), obtained by stacking correlations within Group 0, we observe arrivals with velocities 1 km/s in both positive and negative lags, although they are more dominant in the negative lag. We interpret the 7 km/s arrival to be a body wave due to its lack of any dispersive character; we interpret the 1 km/s arrival to be a surface wave due to its clearly visible dispersive nature (Fig. 6b). It is important to note that the 7 km/s arrival (Fig. 6b) only becomes visible when a bandpass lter with corners at 1 and 5 Hz is applied. Its frequency content and a high velocity suggest a mid-crustal arrival, likely originating from the direction of Oklahoma city. Moreover, selectively stacking correlations in Group 3 ( Supplementary Fig. S3c) retrieves no coherent arrivals. Among other clusters, we observe 1 km/s arrivals in the negative lag of Group 1, however, we observe a lot of scattered energy in the negative lag as well (Supplementary Fig.  S3a). VSG 6X542 (Fig. 6c), obtained by stacking all data, retrieves only surface wave arrivals in the negative lag, thereby indicating that the dominant energy over the period of two days is contained in surface waves.
Upon selectively stacking within groups clustered by K-means, we nd that all groups except for Group 4 have high silhouette values. High silhouette implies good clustering performance, i.e., that noise sources in each cluster have high similarity in location and frequency content. To gain further insight into the clustering performance, we analyze the noise window distribution in each group (Supplementary Fig. S1). Both Group 1 (Supplementary Fig. S1b) and Group 2 ( Supplementary Fig. S1c) show that noise windows within this group fall mainly within night times, when noise levels are generally low. Group 1 (Supplementary Fig. S1c) noise windows fall largely within the days and daylight hours when most Vibroseis shots were shot within and around the array. Group 3 ( Supplementary Fig. S1d) majorly includes day-time activity on 30th March. Group 4 has an average silhouette value of 0.2, which suggests high variance within the cluster. High variance implies that sources in this cluster were strong and unique, i.e., not repeating in time or location. From interferometric location features we infer that a number of Vibroseis shots were shot within the array on 29th March. Regular moves of the Vibroseis sources to new shot locations, as is routine during surveys, may explain the high variance observed.

Identi cation of sources
Since the 7 km/s body wave arrival is only seen in the positive lag and at negative offsets, the noise source is very likely located east of the array. Also, the arrival is visible in the Virtual Source Gather only when a 1-5 Hz bandpass lter is applied, which makes it unlikely that Vibroseis shots are the sources. However, the observed surface waves appear to come from all directions in the frequency band 0.5 -10 Hz (Fig. 7). Consequently, we conclude that higher energy surface wave sources lie to the west of the array, because the surface wave amplitudes are higher in negative lags in VSG 6X542 -Group 0 (Fig. 6a).
Also, in beamforming plots at frequencies of 0.5 Hz -10 Hz (Fig. 7) we see high amplitudes at 290 0 azimuth with slowness values between 0.8 s/km and 0.9 s/km. This direction points toward the airport in Lubbock, TX and we note that airplane landings, at least, are e cient generators of surface waves.

Interferometric location and Beamforming feature clustering
Although beamforming and interferometric features do not help retrieve body wave arrivals in the case described here, and they suffer from poor clustering performance, useful information is nevertheless conveyed by these features. From the interferometric location features shown in Fig. 2d we can identify Vibroseis sources with varying strength, and note that interferometric location features change signi cantly as time windows move from low strength to high strength (Fig. 8). This suggests that at least two different Vibroseis trucks were used in the survey, although it should also be noted that the locations of sources are only approximate . Given the interferometric location feature's sensitivity to energy sources within the array, it can be used to identify strong, persistent noise sources. Examples of such noise sources are mining, drilling, or construction activities. These, and other, noise sources can be used in ambient seismic monitoring or tomography studies.
The beamforming feature, despite its poor clustering performance in our study, can help identify distant sources of energy. Attenuation and scattering tend to lower the frequency content of seismic signals recorded by distant seismic stations, so we compute beamforming features with the more sensitive broadband instruments, rather than nodes. The beamforming result in Supplementary Fig. S2 shows energy with slowness 0.1 s/km arriving from an azimuth of 290 0 degrees, which could be generated at the Lubbock airport, and energy with slowness 0.2 s/km coming from an azimuth of 30 0 degrees in the frequency band 0.01-0.1 Hz (Supplementary Fig. S2). In the band 0.1-3 Hz (Supplementary Fig. S2 In addition to these features, we can also use rms energy, kurtosis and skewnesss as features to extract more information from the array. For real-time applications, computational demands of some of these features may render them impractical or infeasible.

Conclusions
We investigate the utility of a K-means algorithm to cluster data recorded by the Sweetwater seismic array deployed in north-central Texas in Spring 2014 and we extract body wave arrivals from one of the clusters. Extracting body wave arrivals is often a challenge in ambient noise seismic interferometry because surface waves typically dominate seismic recordings. We cluster the data into ve groups based upon variations in the frequency domain amplitude feature, which measures energy distribution in userspeci ed frequency bands at locations within the array. We retrieve body waves with velocities of 7 km/s and 4 km/s by processing the data within just one high-amplitude cluster; neither of these waves was retrieved by processing all recorded data. We conclude that selectively stacking subsets of ambient noise records produces better results, in the sense that more waves can be identi ed and measured, than can be achieved by stacking all available data. Further, it appears that the K-means algorithm may be able to e ciently identify subsets whose stacking leads to better results.
Convergence in ambient noise seismic interferometry has been approached previously as a "brute-force", time-sequential problem. However, because surface wave energy associated with anthropogenic activities is so abundant, brute stacking may not succeed in extracting body wave arrivals. In such cases, identifying and re ning measures that identify time windows of the data in which body wave arrivals dominate could be an effective approach. Machine learning algorithms offer increasingly accessible and simple-to-implement strategies for identifying data subsets by clustering portions of the data that share common characteristics with the help of spatiotemporal features computed for the whole array. In the case described here, one such shared feature led to constructive stacking that revealed previously unseen body waves. Perhaps most importantly, we retrieved body wave arrivals with no prior information about the subsurface.
Unsupervised clustering is easy to automate, to perform periodic processing of data recorded by nodal arrays. Additionally, unsupervised clustering can e ciently determine optimal ambient seismic noise processing parameters. Consequently, these techniques can also be used for monitoring applications in which recurring sources, if available and if identi ed, can be used to investigate changes in subsurface structure.   Vibroseis sweeps recorded by broadband station 6X542 (rectangle).