Due to the advantages of low storage cost and fast retrieval efficiency,
deep hashing methods are widely used in cross-modal retrieval. Images
are usually accompanied by corresponding text descriptions rather than
labels. Therefore, unsupervised methods have been widely concerned.
However, due to the modal divide and semantic differences, existing
unsupervised methods cannot adequately bridge the modal differences,
leading to suboptimal retrieval results. In this paper, we propose CLIPbased
Cycle Alignment Hashing for unsupervised vision-text retrieval
(CCAH), which aims to exploit the semantic link between the original
features of modalities and the reconstructed features. Firstly we design
a modal cyclic interaction method that aligns semantically within intramodality,
where one modal feature reconstructs another modal feature,
thus taking full account of the semantic similarity between intra-modal
and inter-modal relationships. Secondly, introducing GAT into crossmodal
retrieval tasks. Consider the influence of text neighbour nodes
and add attention mechanisms to capture the global features of text
modalities. Thirdly, Fine-grained extraction of image features using the
CLIP visual coder. Finally, hash encoding is learning through hash functions.
The experiments demonstrate on three widely used datasets that
our proposed CCAH achieves satisfactory results in total retrieval accuracy.
Our code can be found at :https://github.com/CQYIO/CCAH.git.