Online hateful messages, more commonly known as “hate speech”, have recently become a major social issue. Many studies have shown them to be detrimental for both the individuals and the society. Many online platforms have employed legions of moderators to manually identify and remove these messages, yet such practices are time-consuming, expensive, and often causing mental illness among the reviewers. As a solution, computational methods are applied to automatically identify and remove hateful messages. However, as online discussions are now often dominated by memes, a format that leverages both text and image to express users’ intents, many textual moderation methods have become obsolete. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capa- bility. In this work, we move closer to this goal by compositely using a Visual-Language Pre-Trained Model, an object detection model and a random forest classifier to achieve a 0.77 AUROC score on the hateful meme detection task, an improvement of 0.15 compared to the best baseline method.