Sound recording and data preparation
Species selection and sound sources. Echolocation calls from bats are primarily composed of constant frequency (CF) components and frequency modulated (FM) components. Social calls are composed of CF, FM, and noise-burst (NB) components. FM calls have short pulse durations and wide bandwidths. As such, they overlap with social calls less in time but more in frequency. In contrast, CF calls have long pulse durations and narrow bandwidths. They overlap with social calls more in time but less in frequency. In consideration of the varied overlapping patterns found in bat calls, we selected both CF bats (Rhinolophus ferrumequinum, Hipposideros armiger, and Rhinolophus pusillus) and FM bats (Vespertilio sinensis, Myotis macrodactylus, and Ia io) to test the separation capabilities of the proposed network, including six different species to test method generalizability.
Source sound files from V. sinensis, M. macrodactyllus, R. ferrumequinum, R. pusillus, and H. armiger were collected from previous studies in our lab (Appendix S1). Sound files for Ia io were selected from unpublished data as follows. Bats captured from the field were housed in a husbandry room with abundant food and fresh water. During each sound recording experiment, 4–5 bats were transferred to a temporary cage. Sound recordings were collected using the Avisoft UltraSoundGate 116H (Avisoft Bioacoustics, Berlin, Germany) and a condenser ultrasound microphone (CM16/CMPA, Avisoft Bioacoustics). The sampling frequency was set to 375 kHz at 16 bits. The recording experiment lasted five days in order to acquire a sufficient number of recordings, beginning at 18:00 and finishing at 6:00 the following morning. Appendix S1 shows sample numbers and locations for the bats, as well as the total duration of sound files selected for the study.
Sound analysis. The total duration of recorded sound files (i.e., original recording files) used for each bat species were shown in Appendix S1. We employed Avisoft-SASLab Pro (Version 5.2.12, Avisoft Bioacoustics, Berlin, Germany) to manually identify non-overlapping and overlapping syllables in echolocation and communication calls. These syllables and calls were described and classified following the nomenclature developed by Kanwal, Matsumura [29] and Ma, Kobayasi [30]. The recorded non-overlapping calls were used for preparing training files of each call type and the recorded overlapping calls were used for separation.
Data preparation. Supervised machine learning algorithms use training samples to “learn” how to complete a task. The training phase in this study involved preparing clear and non-overlapping echolocation and communication calls, selected from original recording sounds. In this process, the BLSTM network learned features found in both call types.
Training samples consisted of randomly selected non-overlapping syllables in echolocation and communication calls from each bat species (in the original recordings), with signal-to-noise ratios (SNRs) above −20 dB. The echolocation training files contained 1,300–6,240 pulses and the communication training files contained 780–1,800 syllables (Appendix S1). Although the quantity of selected syllables varied between studies, the data was sufficient for model training. Efforts were made to include roughly equivalent quantities of each syllable type. Time intervals between syllables in the training files were consistent with those of the original recordings. The lengths of training files for echolocation and communication calls were the same for each bat species (Appendix S1).
Model training and call separation
Model structure and training stage. We developed a network with four BLSTM layers, followed by one feedforward layer (Fig. 1). Each BLSTM layer included one forward and one backward basic LSTM layer, both of which were added with dropout functions (tensorflow.nn.rnn_cell.DropoutWrapper). Each BLSTM layer contained 300 hidden cells and the feedforward layer corresponded to the embedding dimension (i.e., a 3D matrix with depth N=40 in this experiment). Stochastic gradient descent with a momentum of 0.9 and a fixed learning rate of 10−3 was used for training. The tanh activation function and the Adam optimizer were adopted to support adaptive learning rates and faster convergence. The structure and hyper-parameters for the model were designed based on the work of Hershey, Chen [20].
Fig. 1 The BLSTM model architecture and workflow graph.
The model was trained using the files for one bat species in each trial. Echolocation and communication call training files were loaded using the librosa (version 0.6.2) Python package. Frames from the two sound files were read and added together to create sound mixtures. Sound features used for training (log spectral magnitudes) were extracted from this mixture. The extraction process was completed using a short-time Fourier transform (STFT) with a Hamming window (length of 512 and shift of 256).
The mixture from each bat species was then segmented into 100-frame samples, all of which were divided into a training set and a validation set using a ratio of 2:1 (see Appendix S1 for detailed sample quantities). The training set, validation set, and indicator labels were combined and input to the model. The validation set was used to optimize tuning parameters and evaluate call separation performance. Indicator labels were set to 0 or 1, representing the two types of calls in the mixture. Ideal binary masks were used to train the network and gradients were calculated using shuffled mini-batches (batch size of 128) from larger segments.
The output of this model was a set of embeddings that included learned features for both echolocation and communication calls. In this framework, the deep network assigned embedding vectors to each time-frequency bin in the spectrogram. The network then minimized the distance between embeddings dominated by the same call type in each bin while maximizing the distance between embeddings dominated by different call types. The output was then compared with the validation set and indicator labels to calculate loss, which was back propagated from the output to the input through each layer. Model weights and parameters were then updated based on the calculated loss and training was completed after sufficient iteration epochs.
Separation stage. In this stage, overlapping echolocation and communication calls were randomly selected from the original recordings to create a sound file of test sets, used for separation. The log spectral magnitudes of the overlapping calls were then extracted, combined into samples, and input to the trained model. The phases of calls extracted from the sound files were also saved for use in sound reconstruction. The trained model then output embeddings for each segment (100 frames) in a process similar to the training stage. Embeddings were clustered using the k-means method from Scikit-learn (Version 0.20.0) to produce time-frequency masks. The number of clusters corresponded to the number of call types in the mixture (2 - echolocation and communication). These masks and the clustering method were then used to determine which parts of each segment in the overlapped calls would be preserved or neglected based on their correspondence to each call type. For example, if the maximum magnitudes were more likely to belong to echolocation calls, the related mask values were set to 1 and the others were set to 0, allowing the echolocation calls to be separated correctly. Finally, output calls were reconstructed using the inverse fast Fourier transform (IFFT) function numpy.fft.ifft in NumPy (Version 1.15.1). The IFFT transformed the magnitude into a wave using phase information saved at the beginning of the separation stage. The model produced two waveform files, each containing one call type. Additional detail concerning the sound separation algorithms can be found in the work of Hershey, Chen [20].
Model evaluation
The quality of reconstructed echolocation and communication calls was assessed by comparing their temporal-spectrum parameters to the non-overlapping calls selected from the original recording files (excluding training data). Avisoft-SASLab Pro was used for automatic parameter measurements of duration, bandwidth, peak frequency, minimum frequency, maximum frequency, starting frequency, and ending frequency. A t-SNE (t-distributed stochastic neighbour embedding - R3.6.1 package) analysis was adopted for dimensionality reduction. Two dimensions were extracted from these seven parameters for original and separated syllables and compared with one-way ANOVA (aov in R3.6.1) or two-sided Wilcoxon signed-rank tests (wilcox.test in R3.6.1), depending on their fit to a normal Gaussian distribution. The significance level was set to 0.05 for all tests. We adopted the root mean square error (RMSE) to measure and avoid obscuring individual variations between reconstructed and original calls. Clustering analysis was conducted using the reconstructed echolocation calls from the six bat species, to assess whether the separated calls could be further used in species classification.