Medical image segmentation is important for extracting desired objects among complex human structures to enable further analysis. In the case of lingual ultrasound, it is important to extract tongue contour to understand the language behaviour, which enables lingual ultrasound to act as a biofeedback. In order to segment tongue from ultrasound images, we need to train the deep-learning model on a large dataset, which made it challenging to generalize it using a wide variety of images as it is difficult to collect this huge data. In this research, we are proposing a strategy and generalized model that can work effectively using a well-managed small dataset. This article presents a hybrid architecture using UNet, Vision Transformer (ViT) and Contrastive loss to build a foundation model cumulatively. The process starts with building a reference representation in the embedding space using human experts to validate any new input for training data. UNet and ViT encoders are used to extract the input feature representations. The contrastive loss was then used to compare the new feature embedding with the reference in the embedding space. The UNet-based decoder is used to reconstruct the image to its original size. Before releasing the final results, a quality control process is used to assess the value of the segmented contour, and if rejected, the algorithm requests an action from a human expert to annotate it manually. The results show an improved accuracy over the traditional techniques and can be generalized as it contains only high-quality and relevant features related to the tongue in the embedding space.