Experiments design
In this study, Apis cerana was used as the experimental object. The research was carried out in a mulberry field at the Sericultural and Apicultural Research Institute Yunnan Academy of Agricultural Sciences in Honghe Hani and Yi Autonomous Prefecture, Yunnan Province, China (Fig. 1). Four bee colony hives, each hive with approximately 9000 bees, were placed in an open area in a mulberry field (103.39°E, 23.52°N). Acetone and ethyl acetate were purchased from Sinopharm Chemical Reagent Co., Ltd. (Shanghai, China). Sound acquisition, transmission, and storage devices use the same iPhone-8 model with 4G mobile network services.
Data Acquisition
The iPhone-8 was positioned above the beam cabinet and separated from bees by steel net, which was used to prevent interference from bees. Audios of beehive sound were recorded by the recording function of iPhone-8 in the default settings. The sound data are recorded in mono and MPEG-4 file formats. The audio sample rate is 22 kHz with 16 bit resolution.
To collect accurate audio data from the hive, we trained honeybees to the feeding station, which is 30 meters away from the beehive (Fig. 1), before the experiment. During the experiment, a sugar feeder was placed in the feeding station, which contained 500 grams of sugar water, in which the weight ratio of sugar was 50%. The treatment of syrup with a chemical compound added acetone or ethyl acetate to the weight ratio of 0.1%. In this way, syrup was divided into three categories: PS, SA and SE. Before the experiment, we randomly fed colonies PS, SA or SE to collect beehive sound for data mining and building ML models. The fragments of each sound last approximately 30 minutes. Then, we collect another set of sound data in the following experiments and use it to test the ML model predictions.
In the first step of the experiment, PS was placed in the feeder at 8 in the morning. Figure 2(a) presents the process of recording sound of each type of syrup. In each process of collection, we did not start an audio recording until the bees had foraged to the feeder for 10 minutes. The process of collecting audio data was conducted for at least 30 minutes without interruption in each hive. Then, we removed the feeder in the next 20 minutes to restrict the bees foraging or visits the feeding station. In the second step, we placed SE in the feeder and replicated the aforementioned process. In the next step, we placed SA in feeder. At last, PS was used to collect sound.
When the first phase was completed, we stopped the experiment for at least three days to dissipate the chemical compounds from beehives. Then, we started the next phase at the same time on another day. In the second phase, we changed the order between SA and SE. Next, we replicated the process of first phase. The specific process of the experiments is shown in Fig. 2(b).
Data Mining
In modern society, big data are generated and stored every day, which promotes the development of data mining approaches. Data mining usually refers to the process of finding hidden information from a mass of data (Li et al., 2017). With the development of computer technology, researchers apply intelligent methods to extract data through data mining (Barati et al., 2011). At present, data mining has been widely used in many fields, such as medicine, ecology and genomics (Chen et al., 2011; Vizcaino et al., 2014; Han et al., 2020). In addition, classification algorithms are used for data mining (Yasodha & Prakash, 2012; Anitha & Kaarthick, 2021). In this study, the steps of data mining were divided into the following stages: (1) preprocessing of data; (2) filtering and sorting of importance of features; and (3) validation of the classification model.
After the audio was captured, all of the audio files were moved to a computer, and each of the original files was converted from MPEG-4 to waveform (wav) by Python. After that, audio of all lengths was cut to a 30-min audio file. Then, they were divided into 10-s samples without overlap. Next, the R programming language (TeamRCore, 2013) was used to extract common signal characteristics, which included the low-level signal features, 13 MFCCs (Nolasco & Benetos, 2018) and 12 chroma vectors (CVs) (Müller et al.,2005), from all 10-s samples.
In these features, MFCCs are one of the most widely used feature extraction means of sound. Through the parameter set, which is based on a mel-frequency cepstrum, Davis & Mermelstein (1980) demonstrated the superior performance of MFCCs in the recognition of short-term audio spectra. In the field of voiceprints, MFCCs stand out in terms of artificial features. From speech recognition to bridge health monitoring (Lin et al., 2014; Mei et al.,2019), MFCCs have been widely used in scientific research. As described by Logan (2000), after inputting the sound signal, the following steps were taken: pre-emphasizing, framing, windowing, carrying fast Fourier transform (FFT), mel-frequency warping, calculating the filter bank and finally taking the discrete cosine transformation (DCT).
Before building the ML model, random forest (RF) was used to estimate the feature performance and the importance of different features in modeling. In the rest of the data, 80% of the 10-s samples were randomly chosen to be the training group, and the other samples were used as the test group. The features of audio samples were grouped into PS, SA and SE. We built a binary classification model based on characteristics of the audio sample in the training group by RF. Through the trained RF model, we evaluated the importance of every predictor variable by the mean decrease in accuracy (MDA) and mean decrease in Gini (MDG) (Calle & Urrea, 2011). For subsequent calculations, we chose MDA as the indicator to predict the importance of features. After unimportant features were discarded, the remaining features were used for the establishment and prediction of subsequent training ML models.
Building Models
The k-nearest neighbor (KNN) is a basic classification and regression method (Cover & Hart, 1967; Tan et al.,2006). In December 2006, KNN was identified as one of the top 10 data mining algorithms by the IEEE International Conference on Data Mining (ICDM) (Wu et al.,2008). The input of KNN is the test data and training sample dataset, and the output is the category of the test sample. During the test, the distance between the test sample and all training samples is calculated and forecasted by the majority vote based on the category of the nearest K training sample. The three elements of the KNN are distance measurement, k size, and classification rules. Recently, some improved algorithms based on the KNN have been used in a variety of studies, such as clustering for large-scale data and human activity recognition (Chen et al., 2019; Tan et al.,2021). In this paper, KNN was used as a supervised learning model to classify data.
RF is a combination of tree predictors. In the RF, each tree relies on the value of a random vector sampled independently, and all trees in the forest are distributed in the same way (Breiman, 2001). In fact, each decision tree is a classifier. Therefore, for an input sample, the classification results are as many as the trees in the RF. For this reason, RF integrates all classified results, specifying the category with the most votes as the final output. Because of this feature, RF has been widely used in various research fields, such as computational toxicology (Mistry et al., 2016) and visual image classification (Xu Y. et al., 2018). Based on RF, Xia et al. (2018) proposed a method of detecting acoustic events using contextual information and bottleneck characteristics. In this study, the trained RF model categorized the data of each audio sample in the test group to determine which category it is the most likely to belong to.
SVM is a classification technique based on the optimal margin in ML. It was proposed as a training algorithm to maximize the margin between the training mode and the decision boundary (Boser et al., 1992; Cortes & Vapnik, 1995). Due to its superior performance, SVM has widespread application demands in many fields, such as intelligent monitoring, human-computer interaction and virtual reality (Yang & Gao, 2020). Anwar et al. (2019) proved that SVM cubic kernels with MFCC achieved approximately 96.7% accuracy for amateur drone detection. The SVM trained model can be considered hyperplane, with samples separated into two classes, which can divide data correctly and spaced at the largest interval with each sample (Rai et al., 2016). In this research, there were three SVM models that used three kernel functions to classify beehive sounds.
The formula for the linear kernel:
$$K\left({x}_{i},{x}_{j}\right)={x}_{i}^{T}{x}_{j}$$
1
The formula for the polynomial kernel:
$$K\left({x}_{i},{x}_{j}\right)={(x}_{i}^{T}{x}_{j}{)}^{d}$$
2
where d is the power parameter.
The formula for the RBF kernel:
$$K\left({x}_{i},{x}_{j}\right)={e}^{-\gamma { \Vert {x}_{i} - {x}_{j}\Vert }^{2}}$$
3
where ‖xi − xj‖2 means the ethic distance between the two feature vectors, and γ is the width parameter of the RBF.
To improve the accuracy of classification of SVM, we optimized parameters C and γ in the kernel function. We set some values for C (0.1, 1, 10, 100, 1000) and γ (0.01, 0.1, 1). Then, we tried to find the optimal parameters through different kernel models. Finally, we chose a group of parameters (C and γ) with the highest accuracy, and they were used in SVM models.
In this study, SVM, RF and KNN were used to separately establish a classification model. Then, we compared the accuracy between the SVM, RF and KNN models. Finally, we chose the best model and used the data extracted at the beginning to test the accuracy of identification under different treatments.