Scene text recognition is an important part of scene understanding systems, hence lately the problem has gotten increasing attention. The popularity of this issue can be attributed to the numerous potential applications and future challenges. However, the scientific community has long concentrated on the CNN (Convolutional Neural Network) models in training phase, disregarding those that are preliminarily processes, such as the annotation of datasets. The objective of this work is to develop a reinforcement learning semi-automatic Arabic text annotation system in natural scene images based on genetic algorithm. In this paper, we use reinforcement learning to enhance CNN network based annotation and we validate our system on our TSVDR (Tunisia Street View Dataset for Arabic Word Recognition) dataset, EvArEST dataset and the synthetic data. Our system is shown to reduce significantly the number of required training samples and to minimize the annotation time to less than 1/4.