In this section, we first compare the performance of our model for audio-text retrieval (Text->Audio) and text audio-retrieval (Audio->Text) on both AudioCaps and Clotho datasets. We then conduct ablation experiments for each module in our model.
4.1 Datasets
AudioCaps
AudioCaps is a dataset for generating natural language descriptions for any type of audio data. The dataset consists of 46K pairs of audio clips and text descriptions, where the audio is mainly sourced from Audioset. The length of each audio clip is approximately 10s. The training set contains 49274 audio clips, where each audio clip corresponds to a text description, the validation set contains 494 audio clips, and the test set contains 957 audio clips, where each audio clip corresponds to five different text descriptions in the validation and test sets.
Clotho
Clotho is an audio captioning dataset consisting of 4981 audio samples, and the duration of each audio segment is 15-30s. The dataset is divided into a training set, a test set and a validation set. In Clotho v2, there are 3,839 audio clips in the training set and 1,045 clips in the validation and test sets. Each audio clip corresponds to five different text descriptions and each paragraph of text is 8 to 20 words in length.
4.2 Implementation details
In this section, we use the retrieval metrics of R@K (higher is better), median (MedR) and mean (MeanR) ranking (lower is better) to evaluate the performance of our model in retrieval tasks. R@K denotes the percentage of correct results retrieved in the top-K results, MedR denotes the median of the first correct result retrieved, and MeanR denotes the median of the first median of correct results retrieved.
During our experiments, the batch size in our experiments is set to 32, num_wokers is set to 6, the learning rate is 0.2 and the epoch in our experiments is set to 50 on AudioCpas. The batch size is set to 24, num_wokers is set to 8, the learning rate is 0.2 and epoch was set to 50 on Clotho.
4.3 Results
Our audio-text retrieval model is retrieved on AudioCpas and Clotho. We extract audio features using the pre-trained audio model ResNet38 in PANNs and text features via the pre-trained Bert model in HuggingFace, aligning the feature vectors of both modalities to 1024 latitude via a fully connected layer. We pass the audio and text feature vectors through the collaborative attentive module, using cross-modal contrastive learning to align features between the two modalities, and learning more effective single-modal features through inter-modal contrastive learning. We fine-tune the pre-trained audio and text encoders on the training set and select the model with the best combined performance on all retrieval metrics on the validation set to be applied to the test set. We perform two retrieval tasks, including audio retrieval by text and text retrieval by audio. We compare our model with current state-of-the-art audio-text retrieval models and the results are shown in Table 1 and Table 2.
Table 1
Models for audio-text retrieval on AudioCaps
Model
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
CNN14 + NetRVLAD[5]*
|
29.3 ± 0.3
|
65.2 ± 0.5
|
79.3 ± 1.0
|
/
|
3.0 ± 0.0
|
/
|
CE[13]
|
23.6 ± 0.6
|
56.2 ± 0.5
|
71.4 ± 0.5
|
92.3 ± 1.5
|
4.0 ± 0.0
|
18.3 ± 3.0
|
MOEE[13]
|
23.0 ± 0.7
|
55.7 ± 0.3
|
71.0 ± 1.2
|
93.0 ± 0.3
|
4.0 ± 0.0
|
16.3 ± 0.5
|
Ours
|
33.4 ± 0.4
|
68.8 ± 0.1
|
81.9 ± 0.3
|
96.8 ± 0.2
|
3.0 ± 0.0
|
10.0 ± 0.3
|
Model
|
Audio–>Text
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
CNN14 + NetRVLAD[5]*
|
33.3 ± 0.5
|
67.6 ± 0.5
|
80.6 ± 0.8
|
/
|
3.0 ± 0.0
|
/
|
CE[13]
|
27.6 ± 1.0
|
60.5 ± 0.7
|
74.7 ± 0.8
|
94.2 ± 0.4
|
4.0 ± 0.0
|
14.7 ± 1.4
|
MOEE[13]
|
26.6 ± 0.7
|
59.3 ± 1.4
|
73.5 ± 1.1
|
94.0 ± 0.5
|
4.0 ± 0.0
|
15.6 ± 0.8
|
Ours
|
42.3 ± 0.6
|
74.0 ± 0.7
|
85.3 ± 0.3
|
98.0 ± 0.2
|
2.0 ± 0.0
|
7.2 ± 0.3
|
Table 2
Models for audio-text retrieval on Clotho
Model
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
CNN14 + NetRVLAD[5]*
|
13.1 ± 0.2
|
33.1 ± 0.6
|
45.1 ± 0.2
|
/
|
13.0 ± 0.0
|
/
|
CE[13]
|
6.7 ± 0.4
|
21.6 ± 0.6
|
33.2 ± 0.3
|
69.8 ± 0.3
|
22.3 ± 0.6
|
58.3 ± 1.1
|
MOEE[13]
|
6.0 ± 0.1
|
20.8 ± 0.7
|
32.3 ± 0.3
|
68.5 ± 0.5
|
23.0 ± 0.0
|
60.2 ± 0.8
|
Ours
|
12.7 ± 0.3
|
34.5 ± 0.7
|
47.1 ± 0.2
|
77.5 ± 0.4
|
12.0 ± 0.0
|
51.6 ± 1.3
|
Model
|
Audio–>Text
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
CNN14 + NetRVLAD[5]*
|
13.0 ± 0.2
|
32.9 ± 0.7
|
45.4 ± 0.8
|
/
|
13.0 ± 0.0
|
/
|
CE[13]
|
7.0 ± 0.3
|
22.7 ± 0.6
|
34.6 ± 0.5
|
67.9 ± 2.3
|
21.3 ± 0.6
|
72.6 ± 3.4
|
MOEE[13]
|
7.2 ± 0.5
|
22.1 ± 0.7
|
33.2 ± 1.1
|
67.4 ± 0.3
|
22.7 ± 0.6
|
71.8 ± 2.3
|
Ours
|
14.3 ± 1.1
|
35.1 ± 1.0
|
48.1 ± 1.6
|
79.8 ± 0.6
|
11.3 ± 0.9
|
42.3 ± 2.2
|
Note:The * indicates that the relevant source code is missing the results of this experiment are from the original paper, and / indicates that the metric is not given in the original paper.
Our work achieves superior retrieval results on the audio-text retrieval relative to previous work. On AudioCaps, our model improves R1 by 10%, R5 by 13%, R10 by 10%, R50 by 13%, MedR by 1% and MeanR by 6% on the task of text retrieval for audio relative to CE and MOEE. For the task of audio retrieval of text, R1 improved by 12%, R5 by 15%, R10 by 11%, R50 by 14%, MedR by 2% and MeanR by 7%. Our model has a combined improvement of over 3% on the text retrieval audio task and nearly 7% on the audio retrieval text task for the method of Ref.[5]. We believe that the main reason for the significantly higher improvement in the task of retrieving text by audio compared to retrieving audio by text is that the contrastive learning between audio modalities learns richer audio features, so that our "questions" are described more specifically in the retrieval process to the extent that our "answer" needs to be more precise.
On Clotho, our model improves R@1 by 6%, R@5 by 13%, R@10 by 14%, R50 by 8%, MedR and MeanR by nearly 10% compared to CE and MOEE on text retrieval audio task. On the audio retrieval text task,R@1 increases by 7%, R@5 increases by 13%, R@10 increases by 14%, R50 increases by 12%, and MedR and MeanR increase by close to 11%. Compared with the work of [5], our model reduces R@1 by 0.4% on the text retrieval audio task, and improves other indicators by about 2%. On the audio retrieval text task, R@1 increases by 1%, and other indicators increase by nearly 3%. Compared with AudioCaps, the Clotho dataset is more complicated to process, and the improvement of our model on Clotho is relatively reduced, but it still has a significant improvement compared to existing models. Then we evaluate the experimental results of fine-tuning and freeze on AudioCaps and Clotho, as shown in Table 3 and Table 4.
Table 3
Experimental results of freeze and fine-tune for our model retrieval on AudioCpas
AudioCaps
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
Freeze
|
19.7 ± 0.3
|
51.8 ± 0.3
|
68.2 ± 0.1
|
92.7 ± 0.3
|
5.0 ± 0.0
|
16.6 ± 0.2
|
Fine-tune
|
33.4 ± 0.4
|
68.8 ± 0.1
|
81.9 ± 0.3
|
96.8 ± 0.2
|
3.0 ± 0.0
|
10.0 ± 0.3
|
AudioCaps
|
Audio->Text
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
Freeze
|
23.6 ± 0.2
|
56.8 ± 0.9
|
72.0 ± 1.3
|
95.1 ± 0.3
|
4.0 ± 0.0
|
12.5 ± 0.2
|
Fine-tune
|
42.3 ± 0.6
|
74.0 ± 0.7
|
85.3 ± 0.3
|
98.0 ± 0.2
|
2.0 ± 0.0
|
7.2 ± 0.3
|
Table 4
Experimental results of freeze and fine-tune for our model retrieval on Clotho
Clotho
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
Freeze
|
8.4 ± 0.1
|
25.7 ± 0.6
|
38.1 ± 0.6
|
72.0 ± 0.1
|
18.3 ± 0.5
|
56.2 ± 0.5
|
Fine-tune
|
12.7 ± 0.3
|
34.5 ± 0.7
|
47.1 ± 0.2
|
77.5 ± 0.4
|
12.0 ± 0.0
|
51.6 ± 1.3
|
Clotho
|
Audio->Text
|
R1↑
|
R5↑
|
R10↑
|
R50↑
|
MedR↓
|
MeanR↓
|
Freeze
|
10.2 ± 0.8
|
27.8 ± 1.4
|
39.7 ± 1.3
|
71.8 ± 0.9
|
17.0 ± 0.8
|
64.5 ± 3.3
|
Fine-tune
|
14.3 ± 1.1
|
35.1 ± 1.0
|
48.1 ± 1.6
|
79.8 ± 0.6
|
11.3 ± 0.9
|
42.3 ± 2.2
|
When fine-tune of the pre-trained audio and text encoders on the training sets of the AudioCpas and Clotho, the retrieval accuracy of our model is significantly improved. Using pre-trained models and fine-tuning them on downstream tasks can significantly improve task performance, and fine-tuning is widely used in the fields of computer vision and natural language processing.
4.4 Ablation experiments
In the ablation experiments, we follow the implementation details in 4.1. Since fine-tuning would substantially increase the training time of the model, in this section none of the pre-trained encoders in our experiments are fine-tuned on the training set. We sequentially evaluate the effects of audio augmentation, collaborative attentive mechanism and inter-modal contrastive learning on audio-text retrieval in comparison experiments. The baseline model we use is the model with all three components of audio augmentation, collaborative attentive mechanism and inter-modal contrastive learning removed from our model.
4.4.1 Effect of audio augmentation
In the field of deep learning, data augmentation has been an important tool to improve the performance of a task. In the field of computer vision, strategies for image augmentation are found everywhere. In audio-text retrieval, our introduction of audio augmentation not only expands the dataset, but also provides a solution for contrastive learning between audio modalities in our follow-up work. We add audio augmentation module to the baseline model. The audio augmentation methods include adding Gaussian noise, pitch shift and time shift, then we combine the three augmentation methods two by two, and finally the three augmentation methods are combined. We add each of the different audio augmentation methods to the baseline model. The impact of the different audio augmentation methods on the experiments are evaluated on AudioCpas, as shown in Table 5.
Table 5
Different audio augmentation methods for audio-text retrieval on AudioCpas
Augmentation
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No augmentation
|
18.9 ± 0.5
|
50.4 ± 0.3
|
66.2 ± 0.2
|
5.0 ± 0.0
|
19.2 ± 0.5
|
Add Gaussian noise+
|
19.2 ± 0.2
|
51.3 ± 0.3
|
67.4 ± 0.6
|
5.0 ± 0.0
|
17.7 ± 0.6
|
Time shift+
|
19.1 ± 0.1
|
51.1 ± 0.2
|
66.8 ± 0.2
|
5.0 ± 0.0
|
19.1 ± 0.2
|
Pitch shift+
|
19.2 ± 0.3
|
51.2 ± 0.5
|
67.0 ± 0.2
|
5.0 ± 0.0
|
17.6 ± 0.1
|
Add Gaussian noise+ Pitch shift+
|
19.2 ± 0.2
|
51.5 ± 0.2
|
68.0 ± 0.1
|
5.0 ± 0.0
|
17.2 ± 0.2
|
Add Gaussian noise+ Time shift+
|
19.7 ± 0.4
|
51.5 ± 0.4
|
68.0 ± 0.2
|
5.0 ± 0.0
|
17.3 ± 0.2
|
Time shift+ Pitch shift+
|
19.1 ± 0.1
|
51.7 ± 0.2
|
67.6 ± 0.1
|
5.0 ± 0.0
|
17.4 ± 0.1
|
Mix+
|
18.6 ± 0.2
|
51.4 ± 0.1
|
68.3 ± 0.2
|
5.0 ± 0.0
|
16.7 ± 0.3
|
Augmentation
|
Audio->Text
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No augmentation
|
20.3 ± 0.7
|
53.0 ± 0.8
|
69.6 ± 0.8
|
5.0 ± 0.0
|
16.9 ± 0.6
|
Add Gaussian noise+
|
21.3 ± 0.8
|
54.2 ± 0.5
|
70.7 ± 1.0
|
5.0 ± 0.0
|
14.3 ± 0.4
|
Time shift+
|
21.5 ± 0.1
|
54.0 ± 0.8
|
70.1 ± 0.7
|
5.0 ± 0.0
|
15.7 ± 0.1
|
Pitch shift+
|
20.9 ± 0.5
|
54.1 ± 0.6
|
70.1 ± 0.3
|
5.0 ± 0.0
|
15.4 ± 0.2
|
Add Gaussian noise+ Pitch shift+
|
20.9 ± 1.6
|
53.1 ± 1.4
|
70.4 ± 0.8
|
5.0 ± 0.0
|
14.9 ± 0.4
|
Add Gaussian noise+ Time shift+
|
21.6 ± 0.6
|
54.7 ± 0.8
|
71.4 ± 0.2
|
5.0 ± 0.0
|
14.2 ± 0.2
|
Time shift+ Pitch shift+
|
20.8 ± 0.3
|
53.0 ± 0.6
|
69.5 ± 0.8
|
5.0 ± 0.0
|
15.1 ± 0.4
|
Mix+
|
20.5 ± 0.8
|
53.3 ± 0.4
|
69.8 ± 0.9
|
5.0 ± 0.0
|
14.7 ± 0.4
|
We can observe that whether a single audio augmentation method is used or a combination of different audio augmentation methods, the performance improvement for the retrieval task is roughly the same for all methods, improving the retrieval metric by 0.5–2%, but the combination of adding Gaussian noise and time shift works best in relative terms. Mixing the three augmentation methods instead reduced the metric of R@1, and we believe that overly complex audio changes to the original audio were instead detrimental to its feature learning. It is worth noting that we found during the training process that the method of adjusting the audio pitch increases the time overhead substantially and does not lead to better enhancement. We think that adjusting the pitch is more altering to the original audio compared to the other two audio augmentation methods, which will change the frequency of the original audio itself increasing the time overhead, so if the audio augmentation is used in related tasks by subsequent scholars, it can be discard this method.
4.4.2 Effect of the collaborative attention mechanism
The biggest challenge faced in cross-modal retrieval tasks is how to address the heterogeneity divide. Existing approaches have worked to reduce the disparity between different modalities. We introduce a collaborative attentive mechanism with reference to the attention mechanism in the audio-text retrieval, where information from the audio modality is used to guide feature learning in the text modality, and information from the text modality is used to guide feature extraction in the audio modality. We hope that this interaction of information between different modalities can appropriately reduce the variability between different modalities.
We add the collaborative mechanism to the baseline model. During the experiments, we use multiple heads of attention, with heads set to 2, 4 and 8, and dropout in the attention mechanism set to 0.2. We evaluate the effect of the collaborative attentive mechanism on AudioCpas and assess the variability of the different heads in collaborative attentive mechanism in the experiments. The boost of collaborative attentive mechanism relative to the audio-text retrieval task is approximately 0.5%, with the best results achieved when the heads equal 4, as shown in Table 6.
Table 6
Audio-text retrieval on AudioCpas for different sizes of heads in the collaborative attention mechanism
AudioCaps
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No Co-attention
|
18.9 ± 0.5
|
50.4 ± 0.3
|
66.2 ± 0.2
|
5.0 ± 0.0
|
19.2 ± 0.5
|
Co-attention heads = 2
|
19.3 ± 0.1
|
50.8 ± 0.6
|
66.7 ± 0.3
|
5.0 ± 0.0
|
18.1 ± 0.2
|
Co-attention heads = 4
|
19.5 ± 0.2
|
51.1 ± 0.4
|
67.1 ± 0.6
|
5.0 ± 0.0
|
18.0 ± 0.5
|
Co-attention heads = 8
|
19.7 ± 0.3
|
51.2 ± 0.2
|
66.4 ± 0.3
|
5.0 ± 0.0
|
18.4 ± 0.2
|
AudioCaps
|
Audio->Text
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No Co-attention
|
20.3 ± 0.7
|
53.0 ± 0.8
|
69.6 ± 0.8
|
5.0 ± 0.0
|
16.9 ± 0.6
|
Co-attention heads = 2
|
21.6 ± 1.0
|
53.3 ± 0.1
|
69.8 ± 0.5
|
5.0 ± 0.0
|
14.7 ± 0.5
|
Co-attention heads = 4
|
20.8 ± 0.9
|
54.2 ± 1.0
|
70.0 ± 0.2
|
4.6 ± 0.4
|
14.8 ± 0.6
|
Co-attention heads = 8
|
21.1 ± 0.7
|
54.4 ± 0.6
|
70.1 ± 0.3
|
5.0 ± 0.0
|
15.2 ± 0.4
|
4.4.3 Effect of the intra-modal contrastive learning
Referring to the experimental results in Table 3, we chose the audio augmentation method with the highest overall performance improvement, combining the strategy of adding Gaussian noise and time shift to augment the original audio. We first evaluate the effect of the intra-modal contrastive(IMC) learning module on the experiments on AudioCaps, as shown in Table 7.
Table 7
Effect of the intra-modal contrastive learning for audio-text retrieval on AudioCpas
AudioCaps
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No IMC
|
18.9 ± 0.5
|
50.4 ± 0.3
|
66.2 ± 0.2
|
5.0 ± 0.0
|
19.2 ± 0.5
|
IMC
|
19.1 ± 0.7
|
51.3 ± 0.3
|
67.9 ± 0.5
|
5.0 ± 0.0
|
17.5 ± 0.1
|
AudioCaps
|
Audio->Text
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No IMC
|
20.3 ± 0.7
|
53.0 ± 0.8
|
69.6 ± 0.8
|
5.0 ± 0.0
|
16.9 ± 0.6
|
IMC
|
23.0 ± 0.6
|
57.1 ± 1.1
|
71.1 ± 0.2
|
4.0 ± 0.0
|
14.2 ± 0.4
|
Contrastive learning within the audio modality has a relatively large improvement on the audio-text retrieval, particularly for the audio retrieval text task, where the maximum improvement can be up to 4%. Observing the complexity of the Clotho dataset, in addition to the comparison module within the audio modality, we also included a comparison module within the text modality, where we return two texts at a time from a given set of five texts for comparison learning. We conduct experiments on Clotho to evaluate its effectiveness, as shown in Table 8.
Table 8
Effect of the intra-modal contrastive learning for audio-text retrieval on Clotho
Clotho
|
Text–>Audio
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No IMC
|
8.2 ± 0.2
|
25.2 ± 0.2
|
36.5 ± 0.1
|
20.0 ± 0.0
|
61.3 ± 0.5
|
Audio IMC
|
8.1 ± 0.1
|
25.1 ± 0.2
|
36.4 ± 0.1
|
19.7 ± 0.5
|
58.6 ± 1.4
|
Audio + Text IMC
|
7.9 ± 0.2
|
25.2 ± 0.1
|
37.0 ± 0.1
|
19.0 ± 0.0
|
57.8 ± 0.8
|
Clotho
|
Audio->Text
|
R1↑
|
R5↑
|
R10↑
|
MedR↓
|
MeanR↓
|
No IMC
|
9.8 ± 0.3
|
27.8 ± 0.4
|
38.7 ± 0.6
|
18.7 ± 0.9
|
68.8 ± 0.5
|
Audio IMC
|
9.7 ± 0.2
|
27.1 ± 0.8
|
38.9 ± 0.5
|
18.0 ± 0.0
|
67.7 ± 0.8
|
Audio + Text IMC
|
9.8 ± 0.1
|
28.0 ± 1.0
|
40.0 ± 0.6
|
17.3 ± 0.5
|
67.1 ± 1.6
|
We found that the method of contrastive learning in the audio modality has little improvement for retrieval tasks on the Clotho dataset other than MedR and MeanR. We carefully studied the data of the Clotho data set, as shown in Table 9. In the Clotho training set, each audio corresponds to five text descriptions, which are different in length and content. Coupled with the small number of samples in the Clotho dataset, it was difficult for us to effectively learn valid audio and text features, and even though we expanded the number of audios through audio enhancement, the fact that the text matched with them was not the same text each time greatly makes our retrieval more difficult. Our attempts to get a better feature representation using contrastive learning within the text modality, where five different texts learn from each other. In the task of text retrieval audio, R@10 increases 0.5%, and in the task of audio retrieval text R@10 improves 1.3%. We will follow up with further research on how to handle such difficult datasets.