Detrimental to individuals and society, online hateful messages have recently become a major social issue. Among them, one new type of hateful message, named “hateful meme”, has emerged and brought difficulties in traditional deep learning-based detection. Because hateful memes were formatted with both text and image to express users’ intents, they cannot be accurately identified by singularly analyzing embedded text or images. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capability. In this study, we move closer to this goal by feeding a triplet by stacking the visual features, object tags, and textual features of memes generated by the object detection model VinVL and the optical character recognition (OCR) technology into a Transformer-based Vision-Language Pre-Training model OSCAR+ to perform the cross-modal learning of memes. After fine-tuning and connecting to a random forest classifier (RF), our model (OSCAR+RF) achieved a 0.768 AUROC score on the hateful meme detection task in a public dataset, which was higher than the published baselines. In conclusion, this study has demonstrated that Vision-Language PTMs with the addition of anchor points can improve the performance of deep learning-based detection of hateful memes by involving a more substantial alignment between the textual and visual information.