AGAR a microbial colony dataset for deep learning detection

The Annotated Germs for Automated Recognition (AGAR) dataset is an image database of microbial colonies cultured on agar plates. It contains 18000 photos of five different microorganisms as single or mixed cultures, taken under diverse lighting conditions with two different cameras. All the images are classified into"countable","uncountable", and"empty", with the"countable"class labeled by microbiologists with colony location and species identification (336442 colonies in total). This study describes the dataset itself and the process of its development. In the second part, the performance of selected deep neural network architectures for object detection, namely Faster R-CNN and Cascade R-CNN, was evaluated on the AGAR dataset. The results confirmed the great potential of deep learning methods to automate the process of microbe localization and classification based on Petri dish photos. Moreover, AGAR is the first publicly available dataset of this kind and size and will facilitate the future development of machine learning models. The data used in these studies can be found at https://agar.neurosys.com/.


Introduction
In recent years, data-driven artificial intelligence (AI) methods have dominated automated pattern search. In particular, deep learning (DL) approaches successfully reduce the need for feature engineering by leveraging a large amount of data to optimize the model's parameters and find the most important data features 1 . DL is more and more broadly used in many scientific disciplines. Deep neural networks trained on huge datasets are irreplaceable in modern imaging diagnostics in medicine and allow for fast and precise clinical decisions and medical procedure implementations [2][3][4][5][6] . Another example is the application of DL in cancer research and diagnostics 7-10 , where it improves automation and opens new avenues for artificial intelligence-assisted precision oncology and medicine in general. Overall, the implementation of AI provides a new quality of medical tools and diagnostics enabling faster and more efficient healthcare procedures.
Another area where AI could be implemented is microbiology with regard to standard procedural requirements in the pharmaceutical, cosmetic, or food industries. The manufacturing processes are subjected to strict policies and regulations listed, for instance, in Pharmacopoeia [11][12][13] or European Medicines Agency (EMA) guidelines 14 , including extensive regulations governing microbiological purity in industrial areas 15,16 . The above requirements obligate manufacturers to perform constant microbiological monitoring, which means thousands of samples analysed by experienced microbiologists. In most production plants, microbiological procedures are entirely manual -starting from sample collection and culturing to the final plate evaluation with microbial colony count and identification. Large companies start implementing laboratory automation systems to speed up the operating process 17,18 . Taking into account the current industry demand, we aim to develop an effective automatic analysis of microbiological samples based on DL algorithms, performing an efficient and precise microbial sample analysis.
The existing solutions for microbial colony analysis provide only some level of automation. Commonly, the process is limited to generating the photo of a microbial culture grown on a Petri dish, which is further investigated and manually curated by microbiologists. However, the full automation of microbial sample analysis might be accomplished by the implementation of deep neural networks providing robust and accurate algorithms for microbial colony localization, classification, and counting. DL-based solutions can generalize different acquisition setups and new microbe species easier than taditional computer vision algorithms, which strongly depend on hand-crafted feature selection 19 . As DL requires well-balanced and diverse data to optimally fit the model's parameters, there is a need to create huge publicly available datasets for neural network training and, consequently, achieving their best performance.
The main objective of our work was to develop a DLbased methodology to identify and count microbial colonies based on the photo of a standard agar plate culture. At first, we prepared the Annotated Germs for Automated Recogni-tion (AGAR) dataset consisting of over 330k labeled microbial colonies (Staphylococcus aureus, Bacillus subtilis, Pseudomonas aeruginosa, Escherichia coli, and Candida albicans) localized on 18k annotated Petri dish photos. Secondly, we evaluated the selected deep neural networks, such as Faster R-CNN 20 and Cascade R-CNN 21 architectures for object detection, with several different backbones -ResNet 22 , ResNeXt 23 , and HRNet 24 -to set benchmarks for the task. The whole process of collecting and annotating the data as well as the deep learning analysis is presented in Fig.1.

AGAR dataset preparation
Designing a dataset requires several decisions to be made. In the case of data imaging for microbial colony counting and identification, one needs to select microorganism species to study, provide proper growth conditions (applied medium, incubation temperature and time, culture dilution factors, etc.), and choose an appropriate photo acquisition setup (light source location, external light handling, or camera type). Such decisions for the AGAR dataset were based on the goal of making a diverse dataset for automatic colony counting on agar plates which could be easily reproduced in any microbiological laboratory.
The selection of standard microorganisms for the AGAR dataset preparation was based on Pharmacopoeia 11-13 regulations and ATCC guidelines 25 . Five representatives from different bacterial groups (S. aureus subsp. aureus ATCC 6538, B. subtilis subsp. spizizenii ATCC 6633, P. aeruginosa ATCC 9027, E. coli ATCC 8739), and C. albicans ATCC 10231 as a yeast strain were chosen. A series of 10-fold dilutions of refreshed cultures were made and 100 ul were inoculated onto Trypticase Soy Agar (TSA) plates in five technical replicates. The microbial cultures were then incubated at 37 o C for 18-24 hours. For data acquisition, single and mixed cultures (S. aureus & P. aeruginosa/E. coli/C. albicans; P. aeruginosa & E. coli/C. albicans; E. coli & C. albicans) were used. The images of colonies grown on agar plates were taken using one of two cameras. Thus, the whole photo dataset was divided into two major subsets: higher-resolution (4000 x 6000 px) and lower-resolution (2048 x 2048 px). Moreover, three different subgroups were distinguished in the former subset: bright, dark, and vague based on different illumination conditions in which the photos were taken. For visualization of plates captured in the various setups see Fig. 3, whereas details on microbial colony growth conditions and acquisition setups are provided in Supplementary Sections 1.1 and 1.2.
A web application was developed for microbiologists to upload and annotate agar plate culture photos. Each sample recorded in the database was marked as countable, uncountable, or empty ( Fig. 2(d)). It was important to establish a countable range for manual counting, which is generally between 30 -300 colonies for a standard 100 mm Petri dish 26,27 . Every countable sample was manually labeled by microbiologists using a bounding box and species label per colony. Samples were classified as uncountable if more than 300 colonies were present, or if no discrete colonies were seen due to their agglomeration or blurred shape.
In total, 18 000 photos taken under diverse lighting conditions were collected in the AGAR database. Altogether, 336 442 colonies of 5 standard microbes were annotated. The summary of the AGAR dataset statistics is presented in Fig. 2(b). The AGAR dataset achieves a good balance in the number of instances for different microbes (see Fig. 2(a)), which is important for building a robust deep learning model. The interquartile range for all 12 271 countable samples is between 4 and 38 colonies, and around 84.4% of these samples have less than 50 colonies. The histograms of the number of annotated colonies per countable image are presented in Fig. 2(c). Additional information of provided annotations and bounding box size distribution among microbial classes can be found in the Supplementary Sections 1.2 and 1.3.

Deep Learning Analysis
To verify the suitability of the AGAR dataset for building deep learning models for image-based microorganism recognition, and to set a benchmark for the task, we evaluated the performance of the two neural network architectures for object detection: Faster R-CNN 20 and Cascade R-CNN 21 , with four different backbones: ResNet-50 22 , ResNet-1010 22 , ResNeXt-101 23 , and HRNet 24 . The implementation from the MMDetection 28 toolbox was used. The backbones' weights were initialized using models pre-trained on ImageNet 29 , available in the torchvision package of the PyTorch library. More technical details are provided in Supplementary Section 4.1.
The quantitative results for each subset of the AGAR dataset analyzed separately are reported, as well as the results for models trained on the whole dataset. Each subset was split randomly into approximately 75% samples, which constituted the training set, and 25% for the validation and testing set. Due to the relatively high resolution of all images, the samples were divided into patches of 512×512 px to avoid scaling down, which would make the recognition of small colonies more challenging. If necessary, zero paddings were applied to get a proper aspect ratio. As the goal of this study was to establish a baseline for the microbe detection task, the neural network architecture was not modified, thus the patches were resized to match the default input layer size. More details on the data split and patch preparation can be found in Supplementary Section 2.1.
Several approaches have been undertaken for data augmentation, including the addition of Gaussian blur, salt and pepper noise, changing the image color space to LAB and HSV, preparing histogram equalization, rotating images in the [-45, 45] degree range, or cropping around the annotated bounding boxes, but always keeping a visible microbe object. However, the best results were obtained through splitting the image into patches and normalizing it using means and standard deviation per channel values, the same as for the Common Object in Context (COCO) dataset 30 .
The evaluation metric for colony detection was based on the  The analysis for each of the major subsets of the AGAR dataset (higher-and lower-resolution) was performed independently. The vague subset was excluded from the studies. It was later used as a different domain representation in the studies on the models' ability to generalize to new acquisition setups, see Supplementary Sections 4.7 and 4.8 for details. The results for the baseline model (Faster R-CNN with Figure 3. Examples of Cascade R-CNN with HRNet predictions for different types of microorganisms grown on TSA (Petri plates) captured with different image acquisition setups. Entire plates preserving the photoscale are presented in the first column. The other columns represent 20×20 mm fragments of the plate. Well suited bounding boxes were found for organisms forming smaller, dense colonies, like S. aureus and C. albicans. In the case of P. aeruginosa growing in large translucent colonies, some were omitted, especially in the vague subgroup, where the edges of colonies could not be clearly defined. The vague subgroup does not include B. subtilis and C. albicans at all.
ResNet-50), along with more complex, multi-stage detectors (Cascade R-CNN with HRNet), are presented in Fig. 4. The visualisation of predictions made by Cascade R-CNN with HRNet backbone is shown in Fig. 3. Detailed descriptions of results achieved by other tested detectors are placed in Supplementary Sections 4.2 and 4.3.
Our main outcomes in terms of mAP are presented in Fig. 4 (a). In general, Cascade R-CNN performs slightly better with mAP scores equal to 52.0% for the higher-and 59.4% for the lower-resolution subset, compared to Faster R-CNN, which achieved 49.3% and 56.0%, respectively. A comprehensive study showed that the increase of backbone complexity gives just slightly better results (see Supplementary Section 4.2), which suggests that, in this class of architectures, the model's capability is reached and extending backbone networks further might not lead to much better performance.
The average precision calculated per microbe is higher for microbes forming smaller colonies, namely S. aureus and C. albicans. The rest of the studied microorganisms tend to aggregate or overlap, especially in low dilution samples (more colonies on the plate surface). Moreover, larger colonies tend to have blurred edges with lower contrast with regard to agar substrate. This all leads to less accurate microbe detection in terms of mAP. The mAP score measures the quality of detection using IoU, which is related to true and predicted bounding boxes overlapping. For the colony counting, the recognition of microbes is crucial, but the exactness of their location is of secondary importance. Therefore, the sMAPE and MAE metrics were used for tuning the algorithms applied in the post-processing stage to merge predictions for individual patches into the whole test image. The values of two parameters, namely classification probability threshold and NMS threshold for the modified soft non-maximum suppression algorithmm 31 , were established on the training dataset and then used to eliminate the excess of bounding boxes (e.g. appearing twice through the edges of neighbouring overlapped patches) for the test data.
The results of microbe counting are presented in Fig. 4(b). Overall, Cascade R-CNN with HRNet predictions are more accurate with sMAPE equal 4.86% for higher-and 3.81% for lower-resolution subsets. It is worth noting that the most lightweight of the considered models (Faster R-CNN with ResNet-50) is also very effective with 5.32% and 4.68% sMAPE for the same subsets. As these metrics do not include the error coming from microorganism misclassification, cMAE being a sum of MAE calculated for every microbe separately is also provided (in brackets in altogether row). For cMAE, only properly recognized microbes are taken as a correct count. The small difference between MAE and cMAE (e.g. 1.57 vs 1.76 in case of Cascade with HRNet for lowerresolution subset) proves the great ability of the models to distinguish between different species. The selected models' predictions for microbial counting on test AGAR subsets are presented in Fig. 4(c). The straight black lines represent perfect results, while the blue ones indicate a 10% error to highlight the acceptable error range. The detectors tend to underestimate the exact number of colonies for more populated samples. This is largely due to the nature of the sMAPE metric, which weights errors inversely to the number of instances. Moreover, the AGAR dataset is dominated by plates with less than 50 colonies (see Fig. 2(c)), which impacts the values of the probability and NMS thresholds for post-processing. The extension for higher populated samples using double thresholding approach is described in Supplementary Section 4.6.
Although empty and uncountable plates come with no colony annotations, qualitative studies were carried out to evaluate the performance of the trained models. Experiments on uncountable samples showed that many plates labeled by microbiologists as having more than 300 colonies were, in fact, within the countable range, as recognized by the detectors. As discussed above, the models are more accurate for samples with less than 50 colonies. However, they give very good estimates for plates with hundreds or even thousands of instances, correctly identifying single colonies in highly populated samples, with the maximum number of predicted colonies on one plate equal to 2782. It is worth noting that it takes seconds for the deep learning system, while it could take up to an hour in the case of manual counting.
The analysis of samples marked by microbiologists as empty presented the great capability of deep learning methods for microbiological quality control. The detectors were able to recognize colonies difficult to see and missed by human specialists. Overlooked microbes were usually small and located either at the edge of a Petri dish or on a plate factory marking. This is a strong indication that human-AI interaction can lead to a higher quality of microbiological analyses.
Further discussion on empty and uncountable samples can be found in Supplementary Section 4.5. The complementary analysis of the impact of the models' initialization parameters are provided in Supplementary Section 4.4. The detectors' ability to generalize to different AGAR subsets with transfer learning technique is described in Supplementary Section 4.7. The models' performance on every subset when trained on the whole dataset is provided in Supplementary Section 4.8.

Discussion and Conclusions
There is a growing interest in applying computer vision algorithms to microbiological analyses, especially in industry and pharmaceutical branches, but many existing approaches are based on traditional image processing techniques 32-36 . Although deep learning methods have recently become more commonly used 17, 37-41 , the proposed procedures are usually not end-to-end solutions and cannot be used independently to process an image of an entire Petri plate. The lack of effective DL-based approaches for microbiological purposes possibly results from the poor availability of huge datasets needed for deep neural network training. To fill this gap, the AGAR dataset is introduced with 18 000 photos of agar plate cultures (Petri dish) and 336 442 annotated colonies of five microorganisms (staphylococci, Gram-positive and negative bacilli, and yeast) most commonly used according to the Pharmacopoeia guidelines.
An earlier study by Ferrari et al. 17,40,41 presented the creation of the MicrobIA, a smaller dataset composed of about 29 000 labeled segments (fragments of the Perti dish with single agglomerates) of bacterial colonies grown on blood and chromogenic agar plates. The segments were extracted with the WASPlab system from photos of whole Petri plates and manually assigned to one of seven classes (segments containing from 1 to 6 colonies and outliers). The MicrobIA dataset was then used to train a convolutional neural network to classify images into one of the six colony counts or the outliers category (bubbles, dust or dirt on the agar). Authors reported the per-colony error of 28% measured on the individual segments 17 . Another study using the MicrobIA dataset 42 explored the possibility of image generation for agar plates' segmentation and investigated the domain shift problem. It was proved that synthetic data can be applied to train a deep neural network for agar plate image segmentation.
In contrast, our study demonstrates the possibility of exploiting deep learning-based detectors to build end-to-end solutions for multiclass microbial colony recognition and counting. The provided benchmarks of the chosen DL models prove that the AGAR dataset can be used to build robust models, which adjust well to real data achieved in various acquisition setups, resulting in different illumination and resolution of a Petri plate photo. The exhaustive analysis of eight different deep neural networks for object detection was performed on two AGAR subsets referred to as higher-and lower-resolution. The best performing model used in our study (Cascade R-CNN with HRNet backbone) achieved the low counting error of 4.92% and 3.81%, respectively on both subsets. This is a significant improvement over previous reports describing the microbial colony counting with computer vision algorithms 17 . In the case of detection, the mAP scores within the range from 49.3% to 59.4% were achieved for different detectors, which is an excellent result compared to other reports (44.6% for Cascade R-CNN and 36.7% for Faster R-CNN [16]) done with the same architectures on the COCO dataset.
In summary, the selected R-CNN models perform very well in detecting microbial colonies. This is likely because the detected instances have similar shapes and all species of microbes are well represented in the training data. Moreover, the results obtained with base Faster R-CNN and more complex Cascade R-CNN do not differ much. For that reason, exploring different state of the art AI approaches, such as Transformers 43 , seems to be the right future direction for the implementation of deep learning in microbiology.
There is a visibly increasing demand for artificial intelligence in numerous human activities, including social interactions, industry, agriculture, and medicine. The AGAR dataset, which compiles a huge variety of Petri dish culture photos, is a great data source for further exploration. The use of generative models on AGAR might, for example, improve the performance of colony detection on a wider variety of images from other domains, as well as with a larger number of microbial species. Additionally, the large variety in the distribution of colony sizes shows the potential of the AGAR dataset for detailed studies of colony growth dynamics, which can, for example, help us better understand antibiotic resistance of bacteria.

The AGAR dataset in details
The AGAR is a color image dataset created for developing methods of microorganism colonies detection and counting. It contains 18 000 annotated photos of Petri dishes, which gave 336 442 annotated colonies in total. The images were taken under diverse lighting conditions with two cameras. More information about microbial colonies inoculation, and data acquisition setups can be found in Section 1.1, and Section 1.2. The image annotations procedure is described in detail in Section 1.3.

Data acquisition
A simple image acquisition system consists of a Nikon D3500 camera with 60 mm lens and 24 Mpx CCD and a plexiglass stand for Petri dishes, with LED source light mounted below. Data collected with this setup is labeled as higher-resolution (4 000 x 6 000 px). In order to provide some diversity, small changes in the setup during the data acquisition process were introduced. Thus, three subgroups of higher-resolution photos are distinguished: bright, dark, and vague (see Fig. 1). The former two were made with the whole setup closed in a box to eliminate the influence of ambient light. The difference between them lies in a color of plexiglass used for a stand-white for bright subgroup, or black for dark one. The last subgroup, vague, was exposed to available light, which resulted in low contrast images, hard to annotate even by a professional microbiologist. A different setup was used to collect the part of the dataset referred to as lower-resolution (2048 x 2048 px). The system includes a holder designed to mount a Petri dish on (visible on photos), a monochrome camera (IDS UI-5370CP-M-GL with 4.19 Mpx resolution), and a set of adjustable light sources below and above a holder (their parameters were changed during data acquisition process). RGB images are obtained as a composition of three monochromatic photos made with different optical filters, corresponding to red, green, and blue colors. Example photos of agar plate cultures taken under different conditions are presented in Fig. 1.

Image annotations
The design of a robust annotation pipeline was critical due to the desire to label 18 000 Petri dish photos, which led to over 330 000 labelled colonies. The image acquisition setup was equipped with a digital counter controlled by Raspberry Pi to display a sample identification (ID) number. This allowed us to verify if the appropriate photos were uploaded for the corresponding entry in the database. Please note that images are automatically cropped (based on their histograms) during the first stage of preprocessing to keep only a part with a Petri dish (and stored in this form in AGAR dataset), however, in some cases a part of the counter can be still visible.
A web application was developed for microbiologists to upload and annotate agar plate culture photos. Each sample recorded in the database is marked as countable, uncountable, or empty. There are some indications 5, 6 of the optimal number of colonies to perform manual counting. Based on them, samples with more than 300 colonies are treated as uncountable. Please note, however, that some samples may also be marked as uncountable if there were problems with clear identification of colonies (e.g. due to low contrast in the lower-resolution subset). Similarly, for samples with colonies number range between 50 and 300 instances, some colonies agglomerated, and it was difficult to identify their boundaries (see Fig. 2). In case of samples identified as countable, microbiologists annotated every colony with its location (by drawing a bounding box) and class. There are 7 possible classes: 5 microorganisms (S. aureus, B. subtilis, P. aeruginosa, E. coli, C. albicans), defects (crack marks on an agar surface), and contamination (microbial contamination like environmental microorganisms and sometimes fungi).
The annotations are stored in JSON files provided for every sample, with the same basename as the corresponding image file. They all share the same data structure presented in Listing 1. Each file contains a series of fields, including: the background category name (bright, dark, vague, or lower-resolution), the list of names of microbe species that were inoculated on a Petri dish (note that sometimes they have not grown), the total number of colonies (excluding defects and contamination), the list of bounding boxes, and the sample ID. Provided bounding boxes give information about pixel coordinates of the top-left corner, its width and height, microbe class, and annotation ID. Please note that in the case of uncountable samples colonies_number key gets -1 value. The file structure is presented in the Listing 1.

2/17
For convenience, annotations are also provided in the well known COCO format 7 , with an additional key describing background category. + Figure 2. Examples of images of agar plate cultures with different number of colonies.

Statistical data analysis
The summary of AGAR dataset statistics showing the distribution of images over different background categories is presented on a pie chart in Fig. 3(left). Each color represents different background category, while different shades indicate sample classes (empty, countable, or uncountable). The samples from dark subgroup constitute about half of the whole dataset and they include photos with the best quality in terms of contrast between grown microorganisms and an agar surface. At the very beginning of the data collection process, the full range of dilutions was used to collect the variety of samples, as in the case of higher-resolution subgroup. However, then a low dilution level was used less often to reduce the number of uncountable plates. That is why the lower-resolution group, acquired later, includes much less of them than the other subgroups.
In general the AGAR dataset achieves a good balance in the number of instances of different microbe species, which is important for building a robust deep learning model. However, given subgroups may be partially unbalanced as presented in Fig. 3(right). The vague subgroup does not include B. subtilis and C. albicans at all, but it has more than 80% of photos with two different microbe species, in contrast to 15% for higher-resolution (excluding vague) and 40% for lower-resolution. It is also not as numerous as other subgroups, because images from vague subset are difficult to annotate. B. subtilis is also underrepresented in the lower-resolution subset, though, it is still enough to train a detector to recognize this species. Fig. 4 shows the size variability of annotations per category for the whole dataset. Two basic box size distributions can be distinguished: 0-128 px for C. albicans and S. aureus, and 16-512 px for P. aeruginosa, B. subtilis, and E. coli. In total (excluding defects and contamination), there are 154 630 bounding boxes with squared root of area bellow 128 px, 180 173 within 128-512 px, and 1 639 above 512 px. This wide range of sizes makes the detection task more challenging because models have to be flexible enough to handle the variety of instances' dimensions.
Another interesting aspect is the distribution of growing colonies on the plate, as it is presented in Fig. 5 (only for higherresolution background category, excluding vague samples). C. albicans and S. aureus form smaller colonies, which are more spatially separated on a Petri dish. The rest of considered species mainly cover a middle ring, which can be related to the method of microbes inoculation on an agar medium.
As mentioned before, the countable class includes samples with less than 300 colonies. The histograms shown in Fig. 6 present the distribution of countable samples in terms of the number of annotated colonies. The histograms of number of annotated colonies per image divided into microbial species are also included. In general, the interquartile range for the whole dataset is between 4 and 38 colonies per image. In the case of B. subtilis, higher number of instances can be observed outside this range. Also, the distribution of vague part of dataset shows slightly different behaviour where median is a bit higher.

Data processing
AGAR images are very large in size compared to regular datasets like COCO 7 . The original image resolution in the dataset ranges from about 2048 × 2048 to about 4000 × 6000 pixels, while most images in common datasets (e.g. COCO) are no more than 1000 × 1000 pixels. The annotations prepared by microbiologists were made on resized images, so, during pre-processing, the pixel positions of bounding boxes have to be recalculated. Also, during pre-possessing, the visible parts of electronic counter (which helped us to distinguish the samples from each other during the photo collection process) were cut out from the images. Nevertheless, such big images cannot in raw state go into neural network, thus pre-processing to split images into smaller parts should be applied first. Our pre-processing methods are described in Subsection 2.1. Post-processing methods that allowed us to count colonies from entire photos are presented in the Subsection 2.2.

Data pre-processing
In order to ensure that the training and test data distributions approximately match, 3/4 of the original images were randomly selected as the training sets and the rest were kept as validation sets for each background category separately. Please note that the same image subset is used for validation and test sets, however patches for the both sets are generated in different manner, as explained below. The list of image IDs of all original images used for training and validation set is publicly provided, but exact prepared patches are not.
To control the balance between empty patches and the ones containing microbes, first only training patches with at least one colony are generated and then the appropriate number of empty ones (few percent of number of generated patches with 5/17 colonies) are added. The algorithm to divide images consists of a few subsequent steps. Firstly, a binary mask of all area of bounding boxes present on given image (white pixels make up the colony instances, black-the background) is generated. In the following steps, one box is chosen randomly, and a patch of size 512x512 containing the whole box (the left-top corner selection is also random) is cut out, and all other boxes (if any) on the patch are also marked as chosen. The above steps are repeated until every box at the image is marked. This procedure is illustrated as the red path in Fig 7. On the other hand, extracting test data patches is done without exploiting annotations. A sliding window with an overlap being equal to 1/8 of the width of the patch is used to get adjacent images parts (blue path in Fig 7). The overlapping is necessary to get the final prediction for the whole image, and not to miss any colonies which accidentally lie in a division area. Removing redundant predictions is done in the post-processing of output given by a neural network.

Data post-processing
The post-processing consists of gathering together predicted bounding boxes on every patch belonging to one image and recalculating their exact position (rescale and use an appropriate offset taking into account given patch position relative to the whole image). After that, excessive boxes are filtered out using: (1) the probability of the detected microbe class and (2) the variation of the Soft Non-Maximum Suppression (NMS) algorithm 8 . Our modified version of soft NMS based not on the biggest bounding box confidence level, but on the biggest area of prediction. The thresholds for both filters are tuned on training data (using simple grid search for each model separately) in a way to minimize metrics for microbe counting, defined in Section 3.

Models evaluation metrics
Two stages of the task are considered: detection -localization and classification (Section 3.1), and then counting the colonies of microorganisms (Section 3.2). As an evaluation metric for colonies detection, we rely on the mean Average Precision (mAP) established for the COCO competition 7 . The Average Precision (AP) is the precision integrated over the whole recall range for precision-recall (PR) curve at given Intersection over Union (IoU) threshold. IoU describes the level of overlapping between truth and predicted bounding box. The mean value of AP for different IoU thresholds and/or classes gives mAP. To measure the effectiveness of colony counting we provide three separate metrics, namely Mean Absolute Error (MAE), its cumulative extension called cMAE, and Symmetric Mean Absolute Percentage Error (sMAPE) evaluated for different models on every data subset.

Detection metrics
Two neural network architectures for two-stage object detection are evaluated to recognize and localize microbial colonies. Their performance is mainly measured using AP@[0.50:0.95] (mAP over different IoU thresholds, from 0.5 to 0.95, step 0.05).
IoU measures how well predicted bounding boxes fit the real location of an object. It is defined as where the intersection and union refer to true and predicted bounding boxes. To decide if a bounding box proposed by a model is valid or not, one should choose appropriate threshold value. For the value of selected threshold we can distinguish:

6/17
• True Positive (TP) detection if IoU is greater than or equals the threshold value, • False Positive (FP) detection if IoU is less than the threshold value, • False Negative (FN) detection if a model fails to detect the proper object, • and True Negative (TN) in case when model correctly did not predict any object.
With the above definitions, precision and recall are given by expressions Precision shows how accurate predictions are, while recall measures how good a model is at finding objects. The precisionrecall curves used in this study show the trade-off between precision and recall. Usually the increase of one metric causes the decrease of the other. High recall but low precision reflects the increase of the number of predicted bounding boxes, but classified incorrectly in most cases. On the other hand, high precision but low recall is just the opposite situation, a model returns very few results, but most of its predicted labels are correct.
Finally, the Average Precision for a given microbe class is calculated by integrating the are under the precision-recall curve (using e.g. the trapezoidal rule). It is usually done for different IoU thresholds. The mAP score is the mean over all thresholds, typically AP@[0.50:0.95] is considered 7 , and/or all microbe classes.

Counting metrics
Counting by detection is supposed to be the most natural approach to get the number of objects on an image. Even if localizing colonies itself is not necessary for this task, it is very useful to investigate the achieved results (especially to understand the cases in which model makes mistakes). To evaluate counting task the MAE, cMAE, and sMAPE were used.
MAE measures an average error between paired observations: ground truth count of all annotations and the number of predicted bounding boxes, for the whole image of a Petri dish. As the metric do not include the error coming from the misclassification of microorganisms, cMAE being a sum of MAE calculated for every microbe separately is also provided (given in brackets in altogether row in . For cMAE only properly recognized microbes are taken as a correct count. sMAPE is similar to general MAE, but it weights particular samples errors inversely with the number of colonies on a plate. The intuition behind this metric is as follows. In a real world scenario, humans tend to perceive counts in the logarithmic scale. That means, that a mistake of 1 for an empty dish might seem intolerable but the same mistake for a ground truth count of 50 might seem acceptable 9,10 .
For this reason, the thresholds for the microbe class probability and NMS are selected for each model to minimize first the sMAPE metric, which is defined as where N is a number of all countable samples, x i is total count of instances present on i-th image, andx i is predicted number of microbe instances. In the above formula, if x i =x i = 0, then the i-th term in the summation is 0 (calculated error is zero, because the number of predicted colonies is correct). Please note that the denominator favors counting errors for samples with a small number of colonies, ensuring that they are equally important as those with more microbe instances. Moreover, the applied symmetrization favors neither underestimated (small recall) nor overestimated (low precision) counts by a model. If the value of a few lowest sMAPE values (for different thresholds) slightly differs (less than 0.1%), we compare achieved MAE value calculated using the formula and choose such thresholds pair that gives the lowest MAE. Additionally, sum of MAE calculated for every microbe separately, called cumulative MAE, is given as where K = 5 is a number of all microbe species, and MAE m indicates MAE calculated for m-th microbial class solely.

7/17 4 Deep learning models evaluation and comparison
In this section detailed analysis is presented to verify the suitability of the AGAR dataset for building deep learning models for image-based microorganisms recognition. To set a benchmark for the task, the performance of two neural networks architectures for object detection: Faster R-CNN 11 and Cascade R-CNN 12 , with four different backbones: ResNet-50 13 , ResNet-101 13 , ResNeXt-101 14 , and HRNet 15 is evaluated. The implementation from the MMDetection 16 toolbox was used. Backbones' weights were initialized using models pre-trained on ImageNet 17 , available in the torchvision package of the PyTorch library. The quantitative results for each subset of AGAR analyzed separately, as well as the results for models trained on the whole dataset, are reported. Each subset was split randomly into approximately 75% samples constituting the training set, and 25% for the validation/testing set. Due to the relatively high resolution of all images, they were divided into patches of 512x512 px to avoid scaling them down, which could make it difficult to recognize small colonies. If necessary, zero padding was applied to get a proper aspect ratio. As the goal of this study is to establish a baseline for the microbe detection task, neural network architectures were not adjusted, so the patches were resized to match default input layer size.
Several approaches to data augmentation were tested. During training, Gaussian blur and salt and pepper noise were added, image color space was changed to LAB and HSV, histogram equalization was applied, images were rotated in the [−45, 45] degrees range, and cropped around the annotated bounding boxes, such that there is always a visible microbe object. However, the best results were obtained through splitting the images into patches, and normalizing them using means and standard deviation values per channel as for the COCO dataset 7 .

Training and validation details
As it was mentioned earlier, patches for validation and testing are prepared from the same photos (1/4 of the given subset), albeit in a different manner: patches for testing are cut out from images evenly with some padding applied -see the blue path in Fig. 7. It turns out that to filter out unwanted bounding boxes during the post-processing stage, two parameters need to be adjusted: detected colony classification probability and NMS threshold. Both of them are fitted to each subset separately. However, to ensure methodological correctness during tests, they are optimized on the subset of training data, and the values which minimize sMAPE is chosen.
The AGAR dataset is relatively large and the top deep learning models used in this studies are of complex structure (e.g. the Cascade R-CNN model with the HRNet has got 956 different layers in total). Therefore, the training process lasts from about 2 days for the simplest (and smallest) Faster R-CNN with ResNet-50 backbone to about 5 days for the biggest one, Cascade R-CNN with the HRNetV2W48 backbone. Calculations were performed on NVIDIA Tesla V100 GPUs.
During training, the Stochastic Gradient Descent (SGD) method was used for optimizing the network weights, and models were trained over 20 epochs with a batch size of 3-10 patches (depending on the model complexity, i.e. GPU memory consumption), and an adaptive learning rate was initialized with 0.0001. After about 20 epochs loss values saturate, and longer training reduces neither the training nor the validation error.

Precision of detection
Generally speaking, the precision of detection relates to overlapping between predicted and ground truth bounding boxes, which is measured by the IoU. The prediction is valid if IoU is above given threshold (and a microbe is properly classified), as described in Section 3.1. Precision and recall are then calculated (as defined by Eq. 2) for every colony in a subset giving a precision-recall curve. Fig. 8 presents the PR curves at different IoU thresholds levels, ranging from 0.5 to 0.95, for two selected models on lowerand higher-resolution data subset separately. The black curve is the average over all thresholds. These curves shows two characteristic features. Firstly, precision competes with recall in a sense that requirement of high detection precision results in low recall, and vice versa. Moreover, the higher IoU threshold (so higher overlapping requirements), the lower precision values. In the ideal case, precision equals one for all the values of recall. Secondly, for high recall values, precision is greater for the lower-resolution background category.
The averaged PR curves (over IoU thresholds in the range [0.5 : 0.95]) for different models are presented in Fig. 9. Altogether, eight models are considered as the all possible combinations of two detector architectures (Faster R-CNN and Cascade R-CNN) with four different backbones (ResNet-50, ResNet-101, ResNeXt-101, and HRNetV2W48). The smallest model (Faster R-CNN with ResNet-50) is chosen as a baseline and the relative performance of all other models is also show in Fig. 9. The best result are obtained for Cascade R-CNN with ResNeXt-101 in the case of higher-resolution background category, and with HRNet in the case of lower-resolution subset. However, the small difference between the models performance suggests that the models capability (at least in this class of architectures) has been reached, and further models complicating (e.g. by adding more layers in the backbones) may not lead to much better performance.
The average precision values at different IoU thresholds are presented in Table 1 for the higher-resolution subset, and in Table 2 for the lower-resolution one. The last row is mean AP (mAP) over the whole range of thresholds, i.e.  The results presented in Figures 8-9 and Tables 1-2 are averaged over the all microbe classes. Per-microbe precision is shown in Tab. 3 for the baseline model compared to Cascade R-CNN with the HRNet backbone, on the higher-resolution subset. In general, detector performs better for microbes that form smaller colonies, i.e. S. aureus and C. albicans. The lower precision values for bigger colonies are mainly due to their tendency to aggregate and overlap, especially in higher populated samples. Such overlapped colonies make it harder to detect individual colonies. Moreover, larger colonies tend to have blurred edges (with the lower contrast relative to agar substrate), which affects IoU.

Microbe counting results
When it comes to counting, the recognition of microbes is crucial, but the exactness of their localization (i.e. true and predicted bounding boxes overlapping) is of secondary importance. In the post processing stage, the sMAPE and MAE metrics are used to tune the probability and NMS thresholds on the AGAR training set, as described in Section 3.2. The thresholds are used when the predictions for individual patches are merged into the whole test image prediction (to get rid of the excess of bounding boxes, e.g. appearing twice through the edges of neighbouring patches). The performance of Faster R-CNN and Cascade R-CNN models with ResNet-50 and HRNet backbones for microbial counting is presented in Fig. 10(left) for both major background categories of the AGAR dataset, and Fig. 10(right) for all five bacteria species separately. The straight black line represents the identity function (meaning perfect prediction), while cyan lines indicate 10% error to highlight the acceptable range. Starting from a around 50 colonies, the detectors clearly underestimate the their exact number. This is largely due to the nature of sMAPE, which give more weight to errors coming from less populated samples. Moreover, the AGAR dataset is dominated by the plates with less than 50 colonies. Therefore, the probability and NMS thresholds for post processing are tuned to favour these samples. Microbe counting metrics for various models for the higher-and lower-resolution subsets are presented in Tables 4-5. While the best results are obtained with Cascade R-CNN with HRNet, it is worth noting that the most lightweight model (Faster R-CNN with ResNet-50) is not much worse. In general, the errors for the lower-resolution subset are smaller, which is likely caused by the presence of more difficult samples in the higher-resolution subset. While the lower-resolution subset is more homogeneous, therefore it is simpler. Moreover, the counting performance is better for the smallest microbial colonies (C. albicans and S. aureus), which is related to the highest mAP for these samples (as explained in Sec. 4.2).

Anchor tuning
Hyperparameters are the variables which determines the network structure itself (e.g the number of hidden layers) or its training process (e.g. the learning rate or the number of epochs). The models tested in this study are already provided with the default (pre-optimized) learning hyperparameters values 16 and network architectures, which is specific for each model and backbone used. However, it may be interesting to verify if proper tuning of some parameters, to adjust them to our specific dataset, gives us some better results in terms of lower counting error metrics. One of the crucial parameter that impact object detection performance is the anchor boxes setup. Anchors are some initial guessed bounding boxes that are assumed at the start of the detection procedure. The proper selection of anchors is a broad topic, currently investigated by many researchers. For example, automatic anchor selection techniques, i.e. which learn meta-anchors during network training, have been proposed recently 18,19 .
Here, basic approach is applied by simply variating the anchors' parameters and training the Cascade R-CNN with HRNet model on higher-resolution subset for each setup separately. In Table 6 there are presented results (counting errors) for the various anchors' parameters variations: (1) the aspect ratio (i.e. x vs y size), (2) the list of scales, and (3) the stride modification. While the anchors setup has an impact on the detection performance, it is hard to significantly overcome results for default settings. This suggests that chosen pre-tuned models are well adjusted to work on the AGAR dataset. The only significant impact comes from modifying anchors' strides, causing noteworthy decrease of MAE with slightly higher sMAPE. This means that the tuning of the stride improves the results for high populated samples. Moreover, narrowing the anchors' ratio range from [0.5, 2.0] to [0.75, 1.5] improves sMAPE from 4.86% to 4.68%. It is expected as a typical bounding box for microbe colony has a square-like shape.

Empty and uncountable plates
The performance of trained Cascade R-CNN with HRNet backbone was evaluated additionally on the empty and uncountable samples of the AGAR dataset, for both major background categories: lower-and higher-resolution (without the vague subset). In this case only qualitative results are provided, because ground truth annotations are available only for countable plates.
Testing empty samples allows us to investigate the false positive predictions (the model finds colonies on plates without microbes grown). The most common FPs are: discoloration and substrate loss, plate edge stains, formed blisters and water droplets. Also note that in some situations correct colonies are found, but they are hardly visible, so not marked by annotators.  On the other hand, tests performed on uncountable samples enable us to check how the detector (trained on annotated data with max. 300 colonies) performs with very crowded plates. Here, the obvious problem for the detector is the so-called microbial lawn -meaning individual colonies fused together and spread evenly over the agar surface ( Fig. 12(a)). Another similar challenge is recognizing (and consequently counting) individual colonies in growth bacterial streak, as presented in Fig. 12(b). Very small colonies (of a size of few pixels) are difficult to detect, however, due to their large separation, the model shows a great potential to roughly estimate the number of colonies even for very crowded plates -at best, it detects 2782 colonies (Fig. 12(c)). The last example ( Fig. 12(d)) presents the sample with two different species grown on one agar plate. The model is able to detect even hardly visible colonies, although it also recognizes some discolorations as a microbe.
The model performs quite well for detecting colonies in both edge cases. However, it handles empty plates slightly better. The false positive predictions occur only for about 6-7% of all empty dishes. Note, however, that in some cases there are actually some formed colonies missed by the annotators (like in the examples in Fig. 11). In the case of uncountable subset, the number of colonies are very often underestimated. The model predicts less than 300 colonies for 41% and 85% samples for higher-and lower-resolution, respectively. However, only 7% and 4% are estimated to have less than 100 colonies. Note that it was harder to annotate the lower-resolution images (because of low contrast), so some samples labeled as uncountable may be actually within the defined countable range. Moreover, as discussed before, the sMAPE metric used for threshold optimization causes the model to be more accurate for samples with smaller number of colonies on a plate. Detailed distributions of counted colonies can be found in Fig. 13.

Double thresholding approach
Looking at the obtained results for microbe counting (see Fig. 10), one can conclude that the trained models underestimate the number of colonies for more numerous samples. This is due to the distribution of our data, presented in Fig. 6, with more than 84% of samples consisting of photos of plates with less than 50 grown colonies. To remedy this, a double thresholding approach was investigated. Two sets of the probability and NMS thresholds are established based on training data: general 13/17 Figure 13. Distribution of number of predicted colonies for higher-and lower-resolution photos in both empty and uncountable edge cases.
(tuned to all samples) and auxiliary (tuned to samples with more than 50 colonies). The mixed approach uses general thresholds by default, however, if the model predicts more than 50 colonies, the auxiliary ones are applied for a given sample. The results for different thresholds settings are presented in Table 7. As one can expect, auxiliary thresholds taken individually increase sMAPE for test data, but they also decrease MAE. However, mixed approach gives a good balance, and works slightly better than single general thresholding for all samples in the AGAR dataset. Its performance for microbe counting is shown in Fig. 14.   Figure 14. Performance of microbial colony counting on two images subgroups using Cascade R-CNN with HRNet for double thresholding approach.

Transfer learning
Transfer learning is a technique used to improve learning a new task by transferring the knowledge from related problems 20,21 . It can be used to adjust the model trained on the AGAR dataset for new type of samples (different microbes, lighting condition etc.), unseen by the detector, if the new data is too small to perform efficient training solely on it. Here, the ability of a pre-trained model to learn new features from photos taken in different data acquisition setup is checked.
A transfer learning experiment is proposed by using vague (the smallest and the most difficult) subset to mimic unseen data, and assigning (much larger) bright+dark subsets as a primary dataset used to initially train the model. In the first approach, the Cascade R-CNN with HRNet model pre-trained on the primary data is taken and then additional learning of the whole network using solely the vague samples is performed. This leads to a new model, which (just after a few epochs) perfectly recognizes microbes on the vague subset. However, it loses the performance on the primary dataset.
To overcome this, i.e. make a model to learn new features but keep its ability to recognize the primary ones, one can update only a few last layers of the neural network, keeping the rest frozen 22 . The results of this technique is presented in Table 8. With the increasing number of unfrozen layers, the counting metrics (MAE and sMAPE) significantly improve for the vague subset, but get slightly worse for the primary data. Unfreezing more than 128 layers does not further improve the performance on the vague photos, but makes the model less capable of recognizing the samples from bright and dark subsets. Table 8. Transfer learning for the model trained on the bright+dark subsets to detect microbes on the vague one, by updating only part of last layers i.e. all other weights are frozen. We observe that with increasing the number of unfrozen layers, the counting metrics for the vague subset improves, while for the bright+dark one-decreases.

Training on the whole dataset at once
Above, the results of transfer-learned version of Cascade R-CNN with HRNet on the AGAR dataset are reported. The conducted studies showed that the use of such transfer-learning methods to overcome some deficiencies introduced to our dataset (e.g. by removing the vague part) might be not sufficient to prepare a model capable for general predictions. To improve the model's ability to generalize, the model was trained on the entire dataset (combined training subsets of lower-and higher-resolution images). Metrics evaluated for this case (see Table 9) show that both Cascade R-CNN and Faster R-CNN models have great ability to capture different datasets at once. This was verified by testing such generalized model on test data for each subset separately, as presented in Table 9. Let us compare the best result for the transfer technique presented in Subsection 4.7 (Table 8), i.e. sMAPE of 11.38% (bright+dark) and 5.34% (vague), with the results from the Table 9 for Cascade R-CNN (HRNet) giving about 4.98% for the bright+dark, and 1.96% for the vague subset. The models trained on all combined subsets at once performed better than these obtained using transfer of learning presented in the subsection 4.7. However, the transfer learning technique gives the possibility to quickly adopt a model for a new subset, without having access to the primary data the model was trained on. Table 9. Results for the Faster and Cascade R-CNN models with ResNet-50 and HRNet backbones trained on the whole dataset (higher-and lower-resolution), and tested on different subsets respectively. Evaluated metrics show that the both models have great ability to capture various datasets simulateously.