Recurrent Convolutional Neural Networks for large scale Bird species classiﬁcation

We present a deep learning approach towards the large-scale prediction and analysis ofr bird acoustics from 100 different bird species. We use spectrograms constructed on bird audio recordings from the Cornell Bird Challenge (CBC) dataset, which includes recordings with background noise, of multiple and potentially overlapping bird vocalizations per audio. Our experiments show that a hybrid modeling approach that involves a Convolutional Neural Network (CNN) for learning the representation for a slice of the spectrogram and a Recurrent Neural Network (RNN) for the temporal component to combine across time-points leads to the most accurate model on this dataset. We show results on a spectrum of models ranging from stand-alone CNNs to hybrid models of various types obtained by combining CNNs with CNNs or RNNs of the following types: Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRU) and Legendre Memory Units (LMU). The best performing model achieves an average accuracy of 67% over the 100 different bird species, with the highest accuracy of 90% for the Red crossbill. We further analyze the learned representations visually and ﬁnd them to be intuitive, where we ﬁnd that related bird species are clustered close together. We present a novel way to empirically interpret the representations learned by the LMU-based hybrid model which shows how memory channel patterns over time change with spectrograms.


Introduction
Recent reports of shrinking bird populations world-wide 1, 2 have emphasized the importance of monitoring of wild bird populations and protecting biodiversity. With this increasing need, automated audio recorders enable systematic recordings of environmental sounds and have recently opened new opportunities for ecological research and conservation practices. As many bird species have high vocal activities, bioacoustics has become one of the ideal ways to study them. Passive acoustic monitoring (PAM) of biological sounds can provide long-term and standardized data of the composition and dynamics of animal communities. Many bird species produce clear and consistent sounds, thus making acoustic surveys a reliable method to estimate the abundance, density, and occupancy of species 3,4 . Further, visual monitoring is difficult for many small and elusive birds, for cryptic species 5 , and for species found in ecosystems difficult to reach for ecologists 6 . Besides, acoustic monitoring of birds is also helpful for other conservation activities, such as measuring forest restoration 7 , and studying the impact of wild fires 8 .
With the increasing volume of available audio recordings and the development of machine learning algorithms, autonomous classification of animal sounds has recently attracted a wide range of interests. Before deep learning gained wide-spread popularity, prior work had focused on the feature extraction from raw audio recordings and followed by some classification models, such as Hidden Markov Model 9, 10 , Random Forest 11 , and Support Vector Machines 12 . While these methods demonstrated the successful use of machine learning approaches, their major limitation has been that most of the features need to be manually identified by a domain expert in order to make patterns more visible for the learning algorithms to work. In comparison, deep learning algorithms try to learn high-level features from the data in an incremental manner, which eliminates the need for domain expertise and hard core feature extraction efforts. Deep learning networks do not require human intervention, as multiple layers in neural networks place data in a hierarchy of different concepts, which ultimately learn from their own mistakes.
The use of deep learning for sound detection has spanned multiple domains, ranging from music classification to animal classification/detection (for example, marine species, avian, etc.). Among the related call detection and species classification works in the bioacoustics field, most of them adopted the methodology of using Convolutional Neural Networks (CNN) to classify the spectrograms or mel-spectrograms extracted from raw audio clips. These works achieved great success and the deep learning models performed well with high classification accuracy to detect the presence or absence of calls from a particular species, or to classify calls from multiple species. While this method works well by transforming the raw audio into a spectrogram and then treating it as an image classification task, it does not take into consideration of the underlying temporal dependence characteristics of the species calls. It is worth noting that, different from the images with real objects, the x-and y-axis of spectrograms have specific implications (i.e., time and frequency, respectively, see Figure 1), and the time component embedded in the acoustics data shall contain important information for the corresponding classification tasks. Besides, some commonly used data augmentation techniques for image classification, such as rotation and flipping, may not make intuitive sense when applying to spectrograms generated from the acoustics data.
Our work makes the following contributions: (1) we propose a hybrid deep learning model that incorporates the benefit of convolutional and recurrent neural network models, capturing both spatial and temporal dependence of the bioacoustics data (2) our models achieve a better performance than previous ImageNet-based models that have been popular in prior work (3) our models have 7 times fewer parameters than stand-alone convolutional neural networks such as VGG16 (4) we present a novel empirical way to interpret the memory channels of the temporal component of our model (5) we present a way for ecologists to visualize the learned representations on different bird species. The raw audio signal is transformed using the Fourier transform into a mel-spectrogram image. The frequency on the y-axis is in the mel scale.

Dataset
The bird call classification 'Cornell Bird Challenge' (CBC) dataset 13 along with its extension, is used which consists of a total of 264 bird species with around 9 to 1778 audio samples per species. For the challenge, CBC obtained the data from xeno-canto.org. The raw audio samples vary in length from 5 seconds to 2 minutes. Since some classes have very few samples, we take 100 classes of birds by picking classes that had the highest numbers of samples and ensuring that each class has at least 100 samples and are close to balanced. Due to the variable length of the audio samples, we used a fixed-length: the first 7 sec of each audio clip as input and ignore audios that are shorter, resulting in a total of 15,032 samples across the 100 classes. We settled on the heuristic of taking the first 7 sec based on the criterion used for data curation by xeno-canto.org which requests bird audio contributors to trim the non-focal sounds and ensure that the specific bird species (focal sound) is heard within the first few seconds of the audio. For the purpose of training machine learning models, we split the dataset into 80% training, 10% validation, and 10% test examples. The raw audio clips are transformed to a mel-spectrogram based representation (see Figure 1 and Methods) using the librosa 14 package.

Comparing Models
We train several variants of hybrid models and compare their average test accuracy using 5-fold cross-validation to that of baseline models. Specifically, we compare (i) the ImageNet models VGG16 15 , ResNets 16 trained on a single spectrogram of the entire audio clip which we term as 'stand-alone' models. Next, (ii) hybrid models with window slides of the raw audio, and then spectrogram of each slide as an input using convolutional neural network (CNN) for representation and either CNN or recurrent neural network (RNN) for temporal correlation (see Methods). In Table 1 we show the test accuracy for stand-alone models as well as hybrid models. For the definitions of CNN and TCNN see section Methods. The ImageNet based models (stand-alone) lag behind the hybrid model in test accuracy which shows that explicitly using the temporal component in the models helps bird sound classification. We can make the following conclusions from the results in Table 1 For most of the models, one or two layers result in the best performance across all RNNs. Overall, the temporal block with the Gated Recurrent Unit (GRU) units achieves the best accuracy, while using GRU and Legendre Memory Units (LMU) together also gives a similar accuracy to the best model but with less trainable parameters. We discuss the aspect of trainable parameters for each model later in this section.
In Table 1 we compared the test accuracy of the models, which gives us information about the prediction, i.e the maximum value of the softmax outputs. Now, we compare the softmax distribution of the models in Figure 2 in the following manner.    vectors are then projected along the two dimensions with the maximum variance by performing Principal Component Analysis (PCA). We observe that the hybrid models with CNN for both representation and temporal components are clustered together with the stand-alone models, and different from the hybrid models that use RNNs for the temporal component. The hybrid models with RNNs that have gating mechanisms like Long Short-Term Memory networks (LSTM) and GRU are very close to each other in the PCA plot. The hybrid models with LMU are clustered together and are away from LSTM and GRU. For reference, we also show the two corner cases of (i) 'true', which is the actual one-hot label of the test samples, and (ii) 'random' which assigns equal probability to all the classes.   Table 1), the percentage of classes that each model has prediction accuracy in the given shaded brackets.
For different models, we also show the model complexity in terms of total trainable parameters in Table 2. We conclude that on the CBC dataset, the stand-alone ImageNet-based models with higher trainable parameters do not deliver higher test accuracy. The hybrid models offer dual advantages in terms of less model complexity as well as higher test classification accuracy. Next, we compare the class-wise prediction accuracy of the best stand-alone model (VGG16), and the best GRU, LMU model from Table 1 in Figure 3. We see that GRU, LMU has more number of classes in higher prediction accuracy bands as compared to VGG16.

Visualizing the learned representations
We now analyze the representations learned by the trained models for different bird species. For each audio sample, we obtain the representation by taking the output of the penultimate layer of the model, and in Figure 4 we show the t-SNE embeddings in two dimensions for 1522 test samples over 30 bird species. The 30 bird species with the most number of samples are picked from total of 100 species data. The embedding for two different models CNN3+(LMU, GRU) with a hidden size of 512 is shown in the left and right plots, respectively. For both models, we see that the bird species like Red Crossbill, Northern Raven and House Sparrow that have distinct calls appear in tight-knit clusters (for birds code see Supplementary Table 3, and for further related information we refer 17  projected close together by both methods due to their similar calls, whereas Carolina Wren and Bewick's Wren are farther.
Using the embedding plot we can further identify the clustered species and the species that are close to each other which could provide insights to the bird ecologists. The complete embedding plots with all the species is provided in the Supplementary materials.

Analyzing Memory
The deep learning models like the ones we have seen in the previous section deliver good performance. But understanding their mechanism i.e. interpreting what the models have learned, is still difficult. The gating mechanisms employed in LSTM and GRU are difficult to interpret w.r.t how they act upon different input signals like sounds. On the other hand, an RNN like LMU is based on an entirely different machinery that employs a state-space model and updates the memory channels using the dynamical equation (3) with matrices A, B in (3) constructed using Legendre polynomials. Another interpretation for the LMU memory mechanism which makes more sense is that: LMU memory equation (3) projects the entire input signal history into a fixed number of orthogonal Legendre polynomials 18 in an online fashion. The projection is made at each time-step, and to avoid the repeated computation of projections, the dynamical equation in (3) is used (see Methods). We demonstrate this projection behavior of the LMU in Figure 5. We see in Figure 5a that the trained LMU model starts to populate the memory channels upon the first arrival of the pulse in the spectrogram. For the later time points, the memory channel values are transformed to register the signal history. In Figure 5b, we demonstrate this behavior by simulating a pulse input and projecting the signal history at any time t onto 64 orthogonal Legendre polynomials but without using the dynamical equation (3). Before the t=2 sec time-point, the projections are zero as there is no signal history. We then see the patterns of memory channels (similar to Figure 5a) as the pulse arrives. A similar behavior is shown for the bird spectrogram with two pulses in Figure 5c and a simulated version of two pulses in Figure 5d. We see that the arrival of the second pulse changes the evolution pattern of the memory channels. The LMU memory channel values with time are compared for three different bird species samples in the Figure 6. We see that, irrespective of the different bird species, the memory starts populating when the significant energy in the spectrograms is first detected. Some misalignment exists between the beginning of spectrogram pulses and the corresponding response in the memory channels due to the granularity of the chosen stride parameters (W s , H s ) (see Methods for more details). We make the following two conclusions: (i) for a pulse-like behavior where the spectrogram has energy concentrated in a short-time duration, the memory channels have fading in a smooth fashion as we see in Figure 6(b). While for the spectrograms with energy spread out in time, we see more frequent changes in the memory channels with circular patterns in Figure 6(a). Next, (ii) compared to the double pulse example, as we see in Figure 5(c), where the spectrogram has energy in a narrow frequency range of 6-7 KHz, the case where energy is scattered in a wider range of 4-9 KHz in Figure 6(b) and 8-10 KHz in Figure 6(c) has different response for the memory channels.

Spectrograms
The frequency transformation of a time-domain signal using mel-spectrograms have been shown to be better than short time Fourier transform (STFT), mel-frequency cepstral coefficients (MFCCs) 19 in the works 20,21 . We compute mel-spectrogram using librosa 14 for the 7 sec clipped audio signals. The audio is re-sampled at 32KHz and a total of 128 mel filter banks were used. The Fast Fourier Transform (FFT) length is taken to be 2048, and the hop-length for computing spectrogram is taken as 512.

Models
Stand-alone: The ImageNet models, for example, VGG16, ResNet are used as classifier using spectrograms as the 2dimensional input. The neurons in the final layer are selected as per the number of classes in the dataset. For CBC, since we are taking 100 classes, the output layer has 100 neurons. Hybrid: The hybrid models use a sliding window mechanism for the input. The raw audio clip is traversed via a sliding window of length W s and hop length H s . Each hop of window results in a clipped audio of length W s which is transformed to frequency domain using mel-spectrograms. The values of (W s , H s ) used in this work are (500, 250) msec. For a 7-second audio clip, a total of 26 slides are made with the used values of W s , H s . After input, the hybrid models have three parts, (i) Representation, (ii) Temporal correlation, and (iii) Classification. The representation block uses a CNN to generate representative features from the input slides. After concatenating the representative feature vectors from multiple slides, the resulting 2-dimensional array is used as an input to the next Temporal correlation block. The schematic for hybrid models is shown in Figure 7. The output from the temporal correlation block is fed to the final classification block to produce the softmax outputs.

filters. Every
Convolution filter layer is followed by a Batch normalization layer and ReLU operation. The MaxPool is set to downsample with the factor of 2.
Temporal Models: The temporal block either uses CNN (as shown in Figure 7A), or RNN (as shown in Figure 7B). In this  The hybrid models with temporal block using RNN has three variations in this work, namely LSTM, GRU, and LMU. The LSTM uses a hidden state h and also maintains a cell state c. The recursive update equations for the LSTM are shown in (1). The GRU has a compact gating mechanism compared to the LSTM and has two gates. The update equations for the GRU are stated in (2). The LMU uses a memory concept and updates the memory using projections onto Legendre polynomials. The

7/11
update equations (as shown in (3) are less expensive in terms of trainable parameters due to the fixed values ofĀ,B matrices. We refer the reader to original work 22 for more details.
Finally, the output of the temporal block is used as an input to the Classification block which implements fully-connected multi-layer perceptron (MLP). The classification block has one layer of 512 neurons with ReLu non-linearity followed by dropout (with probability 0.5) and output layer of neurons according to the class size of dataset. In the case of temporal block being RNN, the outputs at all time-steps are summed before feeding to the classification block. (1)

Analyzing Memory
The other interpretation for LMU mechanism, apart from the state-space representation, is projecting the memory onto a fixed set of orthogonal basis. Hence, the LMU works by the repeated projection of the entire history of hidden states h t and the input x t , t ≥ 0 onto a fixed number of Legendre polynomials. The Legendre polynomials are a class of orthogonal polynomials with the following property.
For a signal f (t), its projection along the mth degree Legendre polynomial is defined as In Figure 5(b),(c) we use (8) to show the projection coefficient variations over time with the maximum degree of 64. Directly evaluating the projections at each time-step t using (8) is not computationally feasible, especially when the time-horizon is large. However, due to the recurrence properties of the Legendre polynomials (5),(6) a dynamical equation like (3) can be constructed to update the projection coefficients recursively.

Related Work
During the past decade, deep convolutional neural network (CNN) architectures have demonstrated great potential in classification problems as well as other tasks, such as object detection and image segmentation. Some well-known CNN architectures include VGG16 15 , ResNet 16 , and DenseNet 23 , among others. These models can successfully extract complex features from the images and differentiate a high number of potentially similar classes, and have recently gathered popularity in the field of bioacoustics as well. For example, there are some works using CNN, either based on the well-known architectures or customized architectures, to detect and classify the presence of whale acoustics 24,25 , or classify calls from different bird species 26,27 .
While CNN models usually include millions of parameters, training such a model typically requires a sufficiently large amount of data in order to achieve good performance. However, it is a time-consuming and expensive endeavor to obtain a manually labeled dataset in bioacoustics, and it may also be very challenging to collect enough labeled data in practice, especially if a species rarely calls or if a species is rare. Given this scenario, some bioacoustics research works used other
Existing literature in recurrent and convolutional neural networks has extensively explored the classification task on the sequence and time-series datasets. While not explicitly modeling the temporal dependencies, fully convolutional networks, and ResNet architectures are shown to perform well for time-series classification in 33 . Vanilla recurrent neural nets were designed to capture temporal dependencies for sequence data 34,35 . However, they suffer from vanishing/exploding gradients 36 . As a remedy, more sophisticated recurrent neural net units that implement a gating mechanism, such as a long short-term memory (LSTM) unit 37 and gated recurrent unit (GRU) 38 are proposed in the literature. For the audio classification task, a gated Residual Networks model that integrates ResNet with a gate mechanism was shown to be promising 39 . To efficiently handle the temporal dependencies, the Legendre Memory Unit (LMU) was proposed as a novel memory cell for recurrent neural networks with theoretical guarantees for learning long-range dependencies 22,40 . It dynamically maintains information across long windows of time using relatively few resources via orthogonalizing its continuous-time history.
Hybrid models leverage the strengths of both convolutional and recurrent neural networks for learning from temporal or sequence data. They use convolutional layers to extract local patterns at each time-point and then couple the learned representations over multiple time-points using a recurrent component. As compared to the models that use another CNN layer to aggregate the representations across time-steps, the use of a recurrent structure allows them to better capture long-term dependencies in the input. Various choices of recurrent components have been tried, such as LSTMs, GRUs. A one-dimensional CNN coupled with a GRU was proposed in 41 , 42 use an LSTM coupled to a CNN for audio classification, 43 develop a recurrent structure that is based on GRUs, with temporal skip connections to extend the temporal span of the information flow for modeling multi-dimensional time-series. A variety of CNN and RNN models are explored in 44 where superior performance of deep nets compared to some traditional machine learning models is demonstrated for automatic detection of endangered mammals species based on spectrograms. Hybrid models have shown improvements in accuracy over the baseline CNN-only models on various sound detection tasks in the recent literature 45,46 . Further, for the task of music tagging, Choi et al. 47 show that their convolutional recurrent neural network (CRNN), that also involves a GRU, does better in terms of training time and the number of parameters compared to the purely CNN-based prior architectures.

Conclusion
We have presented a comprehensive study of the deep learning models on a large bird acoustics dataset Cornell Bird Challenge (CBC). The deep learning models offer high prediction capability and at the same time lead to a design of a more automated pipeline. Although the Imagenet models are successful on the image classification and are also applied for the sound classification through spectrograms, they lack the temporal component. We found out that for sound dataset (CBC) hybrid models with an explicit temporal layer help. The hybrid models compared to the Imagenet models offered two-fold advantage of reduced model size as well as higher test accuracy. We also found out that larger models do not always result in the better test accuracy. In the context of RNN, in most cases, one or two layers were sufficient and resulted in more accurate models. In addition to the gating mechanisms based RNNs like Long-Short term memory (LSTM), and Gated recurrent units (GRU), we also present a novel hybrid model utilizing Legendre memory units (LMU). The LMU works on a different mechanism of orthogonalizing memory and offers the further advantage of long-range dependence as well as reduced model parameters. We have presented an empirical analysis of how LMU memory channels behave with time for different spectrogram inputs.
We have also analyzed how models are representing different bird species sound samples through the embedding plot. We found out that the birds with distinct calls (for example, Red crossbill, Northern raven, etc.) are packed together and are distant from each other. Some bird species with assorted calls are spread across other species representations.
The hybrid models with a built-in temporal layer have an additional requirement of a longer time sequence. For shorter time-series, learning dependencies across time components was found out to be difficult through RNNs. We have also found out that adding the attention mechanisms to the hybrid models with RNN does not help with the CBC dataset. Part of the reason could be that the bird call location in the input audio is very uncertain, even in the clipped version. In future work, we would extend the current models to detect multiple species of bird calls, and also applying the same analysis to different sound datasets, for example, marine animals detection.