Sign Language is the primary mode of communication for the hearing impaired community and between the community and the outside world. This paper proposes a vision-based sign language gesture recognition model to identify the sign gesture (word) from a hand gesture video. The proposed model consists of three modules: Pre-processing, Convolutional Neural Network, Recurrent Neural Network. Pre-processing module is used to extract the frames, segment the region of interest, and convert them into a grayscale image. Convolutional Neural Network is used to extract the spatial features for each frame and each video is represented by a sequence of spatial features. Recurrent Neural Network recognizes the gestures based on spatio-temporal relation in the sequence of features. The evaluation of the proposed model is done on two different datasets. With Argentinian Sign Language (LSA 64), the model is able to achieve an accuracy of 100%. The model achieved an accuracy of 97.70% with the data set: Indian Sign Language (ISL) for Emergency Situations.