Background on zero-shot learning versus supervised learning
ZSL and supervised learning differ in how the programmer specifies what images belong to each class. For ZSL, natural language text is used to describe each class after the weights of the model have already been learnt and fixed. In contrast, for supervised learning, example images are used to describe each class as the images are used to adjust the model weights during training. As exemplified by results presented in this paper (see Results section), extensive phrase engineering can measurably improve ZSL performance.
To do phrase engineering for ZSL, a small, labelled validation set is required. In contrast, supervised learning additionally requires a large, labelled training dataset which takes extensive manual effort to assemble and annotate. Furthermore, supervised learning usually requires a machine learning developer to write code which organises the data and trains a model. Model hyper parameters need to be tuned by repeatedly training models until the best hyperparameters are found, which requires expensive GPU resources in addition to expertise.
Procedure and dataset
In this paper, we compared the ZSL performance of CLIP against supervised learning (using ABIDLA2) on the dataset introduced in the ABIDLA2 12 paper. The dataset consisted of eight beverage categories consisting of seven alcoholic beverage categories and the "others” category.
Upon closer inspection of the original ABIDLA2 dataset (which we will call ABD-2022), we found a substantial proportion of the images in the “others” category of the test set contained alcoholic beverage categories such as gin or vodka that were not included in ABIDLA2. Hence, to delineate images more clearly without alcohol-related content, we relabelled the “others” category in the test dataset manually using two different annotators and only kept the images that both annotators agreed belongs in the non-alcohol related “others” category. In this modified dataset (called ABD-2023), we removed 1,177 alcohol-related images from the “others” category and replaced them with 1,177 Google images using the following search terms: “sports cars”, “architecture”, “seascape”, “villas”. These added images were manually checked to ensure that they belonged in the non-alcohol related “others” category. The images in the remaining test dataset categories remained unchanged from ABD-2022, as did all the training and validation examples.
Table 1 shows the number of images in the training, validation, and testing datasets for the ABD-2023 dataset that we used for the comparison between ABIDLA2 and ZSL. To maintain a uniform testing set distribution there were exactly 1,762 images per class.
** Insert Table 1 **
Zero-shot learning model
We used the pre-trained CLIP 13 model to implement ZSL on the test dataset. Figure 1 shows how we used the image encoder and text encoder of the CLIP model 13 to perform zero-shot classification. First, we represent each class using a single phrase or a group of phrases. For example, the phrases used for the beer bottle class can be a single phrase (such as “beer bottle”) or a group of phrases that describes a context in which the beverage is portrayed (such as “photo of a person drinking a bottle of beer” and “photo of a bottle of beer on a table”). Then we feed each phrase into the text encoder of the CLIP model 13 to generate a vector representation for each phrase, which is a condensed sequence of numbers that represents the semantic content of the phrase. Next, an input image is fed into the image encoder of the CLIP 13 model to generate a vector representation for the image, which is comparable with the vector representations of phrases. The vector representing the image is then multiplied by each phrase vector to arrive at a similarity measure between the image and each phrase. We then select the class associated with the phrase with the highest similarity measure as the predicted class.
** Insert Figure 1 **
Phrase engineering for zero-shot learning
Recent artificial intelligence (A.I.) models such as ChatGPT 14 and Stable Diffusion 15 that have attracted a widespread userbase have implemented a technique called “prompt engineering”. Prompt engineering is the deliberate act of users wording input prompts in a specific way such that the A.I. model produces more desirable results. For example, users have found that including terms such as “4k resolution” and “award-winning photography” in their input prompts led to generate higher quality images. Similarly, the ZSL performance of models is sensitive to the specific phrases used to represent each class and we refer to the act of carefully selecting such phrases for ZSL as “phrase engineering”. For example, using the term “beer bottle” instead of a phrase “photo of a person drinking a bottle of beer” may lead to worse results in identifying images of beer bottle in a social context since the ZSL models (like CLIP) are usually pretrained on descriptive captions of images rather than one- or two-word phrases (in this case contextless class names). Hence it is important to find appropriate descriptive phrases that represents each class. We have therefore used our labelled validation set of 12,519 images for finding the locally optimal set of descriptive phrases for each class. This is done by evaluating model performance using various descriptive phrases until the locally optimal set of descriptive phrases that yield best performance for each class were found. Note that only the validation dataset was used for phrase engineering, and the test dataset was completely hidden from the ZSL model until the final evaluation.
To investigate the sensitivity of ZSL to the phrases used to represent each beverage category, we tested two different approaches. The first approach just uses the beverage names and their containers themselves exactly as they were referred to in ABIDLA2 12 as class labels, such as “Beer/Cider Cup”, “Wine”, and “Whiskey/Cognac/Brandy”. We call these the name-based phrases.
In the second approach, multiple descriptive phrases were used to represent each beverage category. For example, the “beer/cider bottle” class was represented by the following descriptive phrases: “photo of a person drinking a bottle of beer” and “photo of a bottle of beer on a table”. So, if either of these two phrases match the image then the image is predicted to be in “beer/cider bottle” class. Using multiple phrases to describe the same class should give better results since alcoholic beverages can appear in different settings, e.g., sometimes a person is actively drinking from a beer bottle and other times a beer bottle is just sitting on a table. Having phrases that better match the setting will likely mean the image will be more strongly associated with the phrase and less likely to match an unrelated phrase instead. However, it is important to note that it is not necessary (and a virtually impossible task) to enumerate all possible settings an alcoholic beverage can appear in, since in general the type of beverage (e.g., beer versus wine) should still be a predominant factor in determining where the phrase vector is positioned in the vector space. For similar reasons we did not find it necessary to enumerate all types of alcoholic beverages within a category (i.e., cider in addition to beer; cognac and brandy in addition to whiskey). Due to the visual similarities among the types of alcoholic beverages (such as beer and cider) within a category the additional phrases were not found to increase performance. For example, the phrase “photo of a person drinking a bottle of beer” matches sufficiently to images of people drinking cider.
While performing phrase engineering, we found it particularly challenging to create the descriptive phrases to capture the entirety of the “others” class, since the “others” class effectively represents any image that has no alcoholic beverages. For example, if we just use the phrase “others” to represent the “others” class and are given an image of someone drinking from a coke bottle, it may be the case the image will be associated with the phrase “A person drinking from a beer bottle” because most of the content of the image will match the bottle drinking part of the phrase. Whereas the “others” phrase which is more generic may be mapped to somewhere further away in the vector space. A way of thinking about this is that in this case the “others” phrase is just a single point in a huge vector space, so it is hard to ensure all non-alcoholic images are closest to this single point rather than the set of points representing all the other classes. It is for this reason that we opted to create a very extensive list of phrases for the “others” class when using descriptive phrases. Table 2 shows the set of name-based phrases and descriptive phrases used to represent each class.
** Insert Table 2 **
Data analysis
Using our test dataset, we created three separate tasks for evaluating the performance of the ZSL vs ABIDLA2. Task 1 is to classify any given image into one of the eight specific categories: Beer/Cider Cup, Beer/Cider Bottle, Beer/Cider Can, Wine, Champagne, Cocktails, Whiskey/Cognac/Brandy, Others. Task 2 is to classify any given image into one of four broader categories: Beer (Beer/Cider cup, Beer/Cider Bottle, Beer/Cider Can classes merged); Wine (Wine and Champagne classes merged); Spirits (Cocktails and Whiskey/Cognac/Brandy classes merged); and Others. Task 3 is a binary classification problem with the following two classes: Alcoholic Beverages, and Others. We compared the performance metrics of ABIDLA2, ZSL using name-based phrases, ZSL using descriptive phrases across the three tasks. In addition, we also computed three separate confusion matrices to analyse the performance of each of ABIDLA2, ZSL using name-based phrases, ZSL using descriptive phrases vs annotators labels.
We report results for the following two metrics: unweighted average recall (UAR) and per class recall. We report the UAR metric instead of accuracy since for Task 2 and 3 the class distributions are skewed as a result of merging; hence accuracy would be dominated by how well the model predicts the majority class (Beer for Task 2 and Alcoholic Beverage for Task 3).