In the ever-evolving domain of food computing, Named Entity Recognition (NER) presents transformative potential that extends far beyond mere word tagging in recipes. Its implications encompass intelligent recipe recommendations, health analysis, and personalization. Nevertheless, existing NER models in food computing encounter challenges stemming from variations in recipe input standards, limited annotations, and dataset quality. This article addresses the specific problem of ingredient NER and introduces two innovative models: SINERA, an efficient and robust model, and SINERAS, a semi-supervised variant that leverages a Gaussian Mixture Model (GMM) to learn from untagged ingredient list entries.
To mitigate issues associated with data quality and availability in food computing, we introduce the SINERA dataset, a diverse and comprehensive repository of ingredient lines. Additionally, we identify and tackle a pervasive challenge---spurious correlations between entity positions and predictions. To address this, we propose a set of data augmentation rules tailored for food NER. Extensive evaluations conducted on the SINERA dataset and a revised TASTEset dataset underscore the performance of our models. They outperform several state-of-the-art benchmarks and rival the BERT model while maintaining smaller parameter sizes and reduced training times.