In cross-modal retrieval research, hashing methods have received a lot of attention from scholars because of high retrieval efficiency and low storage cost.However, the similarity matrix is constructed in these methods with little attention to the similarity information within the modalities.In this paper, we propose a new cross-modal hashing method called Deep Hashing Similarity Learning for Cross-modal Retrieval (DHSL), which introduces relation networks into the hashing method to achieve a two-by-two matching of image and text, effectively bridging the heterogeneity difference between image and text, while focusing on the similarity information within the image and text modalities, and generating a hash with modal similarity information.The hashing similarity matrix with inter-modal similarity and intra-modal discriminant is generated.Considering that the process of converting high-dimensional features to the hash codes loses a lot of semantic information, we set a feature selector to achieve feature enhancement. It selects discriminative features from the original features and joins and fuses them with low-dimensional features to supplement the semantic information. In addition, we introduce weighted cosine triplet loss and quantization loss to constrain the hashing codes in Hamming space to learn high-quality hash codes. In this paper, we select the MIRFlickr25K dataset and the NUS-WIDE dataset as experimental data and conduct extensive experiments. The results show that the DHSL method outperforms several current more advanced cross-modal hashing methods.