Computational metadata generation methods for biological specimen image collections

Metadata is a key data source for researchers seeking to apply machine learning (ML) to the vast collections of digitized biological specimens that can be found online. Unfortunately, the associated metadata is often sparse and, at times, erroneous. This paper extends previous research conducted with the Illinois Natural History Survey (INHS) collection (7244 specimen images) that uses computational approaches to analyze image quality, and then automatically generates 22 metadata properties representing the image quality and morphological features of the specimens. In the research reported here, we demonstrate the extension of our initial work to University of the Wisconsin Zoological Museum (UWZM) collection (4155 specimen images). Further, we enhance our computational methods in four ways: (1) augmenting the training set, (2) applying contrast enhancement, (3) upscaling small objects, and (4) reﬁning our processing logic. Together these new methods improved our overall error rates from 4.6 to 1.1%. These enhancements also allowed us to compute an additional set of 17 image-based metadata properties. The new metadata properties provide supplemental features and information that may also be used to analyze and classify the ﬁsh specimens. Examples of these new features include convex area, eccentricity, perimeter, skew, etc. The newly reﬁned process further outperforms humans in terms of time and labor cost, as well as accuracy, providing a novel solution for leveraging digitized specimens with ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories world-wide by generating accurate and valuable metadata for those repositories.


Introduction
Advances in computing, imaging, and cyberinfrastructure, along with the growth of digital libraries and repositories, have allowed many natural history institutions to digitize their image specimen collections [1].The National Science Foundation's Advancing Digitization of Biodiversity Collections (ADBC) program is one exemplary program supporting the digitization and curation of hundreds of thousands of biological specimens [2].Digital collections provide researchers, educators, students, and the general public with the capacity to study biological specimens on a scale that was previously unattainable.In addition, the availability of digitized specimen images allows for the application of machine learning (ML), which should lead to new scientific discoveries.
Although there is increased interest in applying ML to digitized specimen images, researchers have found that the potential scientific advances are, unfortunately, hindered by poor image quality [3].Poor quality images (e.g. with low contrast, inadequate lighting, out-of-focus or cluttered visual arrangements) are unsuitable for automated image analysis by ML algorithms and lead to inferior computational results.Image quality problems associated with digitized specimens are further compounded by poor quality metadata or even the lack of pertinent metadata.Many natural history collections use the Darwin Core (DwC) metadata standard, which includes a core set of 19 descriptive properties [4].Metadata for digital images is frequently created manually by technical staff or students and is subject to human error.Additionally, although richer metadata extensions exist and curators may provide more extensive morphological metadata, it is too costly to acquire manually.
Despite existing image quality and metadata limitations, the extensive availability of digitized specimen collections still offers new opportunities for scientific study.These challenges have motivated members of Drexel University's Metadata Research Center together with Tulane University's Biodiversity Research Institute to explore computational approaches for analyzing fish image quality and extracting specimen metadata.A key impetus has been engagement of both teams in the NSF Biology Guided Neural Networks (BGNN) project, which is developing a novel class of artificial neural networks that aims to exploit machine readable, predictive knowledge associated with specimen images, phylogenies, and anatomical ontologies.Initial research successfully demonstrated computational approaches for creating image quality metadata [5]; and, further, that by combining ML and image informatics, researchers automatically determine image quality and metadata, such as fish quantity, location and orientation, and image scaling based on ruler identification and measurement [6].
The research reported in this paper extends the methods reported in Pepper et al. [6], with the aim to increase the accuracy and scope of the generated metadata.Another key goal is to demonstrate that our approach is extensible to other specimen image collections, beyond the Illinois Natural History Survey (INHS) collection analyzed in our first study.Previously, object detection was performed with five detection classes (fish, fish eyes, rulers, and the twos and threes found on rulers) from 7244 INHS images.We have augmented this dataset to include 4155 images from the University of Wisconsin Zoological Museum Collection (UWZM) [7].
Additionally, we have trimmed the INHS dataset to 7013 after adding some images to the training set, as well as excluding certain images, resulting in a test set of 11,168 images.Figure 1 presents typical images used in our study from each collection.
The enhancement of our computational methods has produced improved automatic metadata generation results.These enhancements include augmentation of the training set, applying contrast enhancement, upscaling small objects and refinement of our processing logic.Together these new methods improved our overall error rates from 4.6 to 1.1%.Procedures for computing additional image-based metadata properties have also been implemented.These new properties provide supplemental features and information that may also be used to analyze and classify the fish specimens.Examples of these new features include convex_area, eccentricity, perimeter, skew, etc.
The rest of the paper is organized as the following.Section 2 describes relevant previous work in metadata for image specimen collections, metadata generation and fish image analysis.Section 3 outlines the goals and objectives of our research.Section 4 describes in detail our computational methods.Section 5 includes the results of our computational experiments on two fish image collections, with Sect.6 providing a discussion of our results.Section 7 concludes with comments on possible future extensions for the current work.

Metadata standards and approaches for natural history digital image specimen collections
Metadata used in digital image specimen collections supports resource description access.The Darwin Core (DwC) [8] and the Audubon Core [9] are two of the most popular metadata standards applied to digital specimen images.Curators often use content value standards, such as taxonomies, geographic codes, and other ontologies, when working with a descriptive metadata standard.Although these standards are digitally accessible, the metadata creation task is still primarily a manual, labor-intensive task and prone to human error.Moreover, image quality metadata is generally absent.These limitations have become increasingly prevalent as researchers seek to automatically leverage metadata and digitize specimen images for scientific research, which is an aim of the BGNN initiative.The biodiversity community has acknowledged this challenge and advocated for data fitness standards [10].This point is also emphasized by Wieczorek et al. [11] in their report on the variety of DwC metadata extensions needed to meet growing community concerns and requirements, including data quality and fitness.This point is addressed in detail by Leipzig et al. [5], drawing from Tulane University's manual curation of 22 metadata properties that characterize digitized specimen image quality, and further motivates the research reported in this paper.

Automatic metadata generation
Automatic metadata generation advances covering both descriptive and technical metadata are relevant to the research presented in this paper.Automatic metadata generation of descriptive bibliographic data has been a research focus for close to twenty years [12][13][14][15].Researchers have applied support vector machine (SVM) approaches [16] and associated networks to address sparse and incomplete metadata [17], and various successes are integrated into day-to-day workflows.Heidorn, et al. [18] demonstrated the use of optical character recognition (OCR) to extract specimen information from the original typed and often hand-annotated labels that are digitized along with herbarium collection holdings.The extracted information was encoded in the DwC metadata associated with the specimen's digitized rendering.There has also been some success with extracting descriptive cartographic information from maps [19].While there's been some advances with automatic metadata generation, the application of these methods for specimen images is still limited.In an effort to address this limitation, image informatics offers an opportunity to advance metadata generation approaches by validating existing metadata, and to create additional metadata that was previously not recorded by curatorial staff.

Fish image analysis
Image analysis has been utilized to examine and process images of fish for well over two decades [20,21].It is an important application of technology for marine science, for the seafood industry, in the study of aquatic species, habi-tats and ecosystems, in the development of automated fish sorting and grading systems, as well as for fisheries management.Many of these computational analyses focus on the recognition and classification of the fish present in an image.The computational methods employed for fish image analysis have followed the general trends in the AI field.Hu et al. [22] presented a method of classifying species of fish based on color and texture features and a multi-class support vector machine (MSVM) [23].Li and Hong [24] computed eleven shape and color features from fish images and derived a linear model that could discriminate between four different fishes.Rodrigues et al. [25] explored several combinations of feature extractions, input classifiers and clustering algorithms to produce a method that could distinguish between 10 different types of fish with 92% accuracy.Hernández-Serna and Jiménez-Segura [26] perform image preprocessing and extract geometric features which are fed into an artificial neural network (ANN) to predict the species of fish and other biological specimens with an accuracy around 90%. Salman et al. [27] employed a deep Convolution Neural Networks (CNN) [28] together with classification based on K-Nearest Neighbor and Support Vector Machines trained on the features extracted by the CNN.They achieved 90% accuracy when identifying 15 different fish species in challenging underwater digital images.Utilizing texture, anchor points, and statistical measurements, Alsmadi et al. [29] implemented fish classification through a meta-heuristic algorithm known as the Memetic Algorithm.They were able to classify 24 fish families with 90% accuracy.Iqbal et al. [30] used a modified AlexNet [31] model to classify six different fish species with 90% accuracy.Yu et al. [32] segment fish images and measure fish morphological features using Mask R-CNN.Petrellis [33] employs image processing and deep learning to calculate a small set of geometric features from in-the-wild images of fish.Hao et al. [34] provide an excel-lent review of fish measurement efforts that utilize machine vision.
While our efforts are motivated by and build upon previous work, our work stands apart in that it generates a wider range of geometric and image features than previous methods, in order to enrich the metadata for fish image collections.Most previous work examined and analyzed images of fish in the wild.Working with the structured images of museum collections allows us to compute a wider variety of metadata with higher accuracy.

Goals and objectives
Digitized specimens accessible in open-access repositories provide a rich, extensive data source for ML and scientific discovery.These resources, however, remain largely untapped due to image quality issues and metadata limitations.Image informatics tools and techniques offer an opportunity to address this limitation and advance the state of metadata associated with digital image specimens collections.Moreover, accelerating metadata generation may facilitate further scientific study in morphology, and related areas.The overall goal of our work is to develop advance metadata generation approaches and build on work previously reported in [6].The specific aims of the research presented in this paper are to:

Methods
Our initial process for metadata generation can be divided into three steps: object detection with Facebook's Detec-tron2 ML library [35], image processing at the pixel level, and calculations on the results of the previous steps to determine higher level metadata properties.We have extended the computational process with optimizations in metadata generation by replacing self-implemented code with more library calls, as well as further modularization that supports GPU parallelization.Along with additional geometric computations and error reduction techniques, our current metadata generation process has been expanded to: • Apply contrast enhancement and equalization on the training and test sets.• Train a model and perform object detection with Facebook's Detectron2 ML library (referred to as detectron).• Select fish of highest confidence in case of multiple fish detection.• Upscale and rerun if fish is detected without an eye.
• Adjust fish mask with pixel-level image processing.
• Compute specific geometric and statistical computations with skimage and scipy.• Utilize the results of the previous steps to determine higher level metadata properties.

Refined fish collection criteria
Our automated metadata generation methods were developed for a specific subset of images from both the INHS and UWZM Fish Collections.Our algorithms are based on assumptions about the content and structure of the specimen images.Criteria were specified that define the properties of acceptable images for analysis.The images used in our study were evaluated to ensure that they meet these requirements.
The criteria used to select the study images are: • Must contain a fish (no eels, seashells, butterflies, seahorses, snakes, etc).• Must contain only one of each class (except eyes).
• Specimen body must lie in-plane in a side view.
• Ruler must be consistent and one of two types.
• Fish must not be obscured by another object.
• Whole body of fish must be present (no heads, tails, or standalone features).• Fish body must not be folded or have extreme curvature.
Applying these criteria, 216 images from the previous subset of 7244 INHS images were removed from the testing set.Additionally, 15 images were moved from the test set and added to the training set, resulting in a testing subset of 7013 images.While the original UWZM collection contained 4602 images, after removing 79 images for the training set, 368 images were filtered out based on the criteria, yielding a testing subset of 4155 images.

Object detection
A prerequisite task to performing metadata property generation is finding the specimens (and other relevant objects) within the collection images.Object detection has been a  broadly active field of study in recent years, and has resulted in a number of well-tested, purpose-built architectures.We elected to use Facebook AI Research's (FAIR) detectron tool [35] (specifically its implementation of the Mask R-CNN architecture) for object detection in our work, given its many flexible and robust capabilities.Most importantly, following a review of the literature and available tools, we determined that there were no other machine learning packages that returned pixel-by-pixel masks over detected objects in a comparable fashion.detectron is built on pytorch [36] and provides a relatively straightforward method for training on COCO [37] format datasets.It is able to handle any number of object classes, and can classify an arbitrary number of objects within a given image.We chose detectron for its relative ease of use compared to lower level libraries, and its implementation of powerful architectures developed by FAIR.We use it to identify five object classes: fish, fish eyes, rulers, and the numbers two and three on rulers, as shown in Fig. 2. Objects with a 30% confidence score or higher are maintained for analysis.
Table 1 lists the number of instances for each class used in our aggregate training dataset.Tables 2 and 3 show the training set contributions from the INHS and UWZM datasets respectively.500 rulers used in our previous study were removed from the INHS dataset due to an oversight.These 500 rulers were originally a part of the J.F. Bell Museum of Natural History (JFBM) dataset [38] , and hence, were not relevant to our current objective.All of the training data was labeled by hand using makesense.ai[39] on images from the INHS [40] and UWZM [7] Fish Collections.Using detectron's default training scheme, the model was trained for 15,000 epochs.Testing has shown that this default number of epochs provides optimal object detection results.All instance types were included in a single object detection model, in other words, the model is akin to a one-vs-all detector, with all five classes being detected by the same model.

Error reduction techniques
Four enhancements were implemented and applied to the combined datasets in order to reduce the error rates that we experienced in our initial study.These enhancements include augmenting the training set, applying contrast enhancement, selection of the highest confidence fish, and image (up)scaling.

Augmented training set
Initially, we had 64 examples of each class from the UWZM collection in the training set.One issue that we encountered was the lack of catfish (notorus genus) in the training set, which led to a high count of undetected eyes in the testing set.Visually it is difficult even for humans to determine the location of catfish eyes given that they are either very close to the color of the skin or do not look like normal fish eyes.Thus, 15 catfish images from each image dataset were added to the training set. Figure 3 presents examples of images in which the catfish eyes are difficult to detect.Fig. 3 A catfish where the eye is not easily detected (left) and a catfish with an eye that does not look like a normal eye (right)

Contrast enhancement
It is apparent that there are differences in lighting, saturation, and contrast within and across specimen image collections.This causes the detection model to miss the ruler, numbers, fish, or eye in images that are either too washed out or too dark, errors which have been seen in other object detection applications [41].After investigating various image processing techniques to equalize the color, we found that current research in our area utilizes Contrast Limited Adaptive Histogram Equalization (CLAHE) [42,43].We applied this technique to all our images using the Python image processing library OpenCV [44].CLAHE is frequently used in applications like underwater photography, traffic control, astronomy, and medical imaging [45,46] to improve image quality.
The drawback of standard Histogram Equalization is that the equalization of the contrast is performed on a global level, which is not ideal given possible varying contrast ranges in an image.Adaptive Histogram Equalization addresses this issue by computing several histograms, each corresponding to a distinct section of the image, and uses them to redistribute the lightness values of the image.This also, however, has issues in that it may oversharpen contrast values that are already high, as well as yield noise in relatively homogeneous regions of an image.CLAHE, though, does not sharpen values higher than a given contrast threshold, thereby eliminating the issues of oversharpening and noise [42].
CLAHE should not be applied to RGB (red, blue, green) images.Applying CLAHE in color spaces like RGB and CMYK (cyan, magenta, yellow, key) will yield a different color distribution for each color channel.Instead of applying CLAHE separately to the R, G, B channels of a color image, a better approach applies the algorithm only to the luminance channel of a color image, which also prevents unwanted hue and saturation change.This, however, requires the source image to be converted to a different color space, e.g.HSV (hue, saturation, value) or CIELab (lightness, red/green, blue/yellow) first.Contrast enhancement in 3-D color spaces that makes use of luminance does not produce noisier images, unlike when processing in more common color spaces, thus ensuring color uniformity [47,48].We have utilized the CIELab space, since visual testing showed that the fish, rulers, and eyes were further pronounced than when processed in HSV space.Detectron also yielded slightly better eye detection rates through contrast enhancement in CIELab space than in HSV space.Figure 4 shows a preenhanced image and the result after contrast enhancement.

Picking fish of highest confidence
One of the issues with the previous metadata results was the detection of multiple fish in a single image, which was categorized as an error in our prior work.This "error" could occur with overlapping detection boxes over the same fish or through the erroneous detection of another random object in the image.We inspected the cases where multiple fish were detected.In all cases, the fish bounding box with the highest confidence value was the one that best covered the fish specimen present in the image.Additionally, there were never instances in which the fish bounding box of highest confidence was not actually a fish [6].Since our study image collection only contains images with a single specimen, when detectron returns more than one detected fish, we select the fish of highest confidence value, thus eliminating the previous multiple fish error.Figure 5 shows an example in which fish were detected multiple times.The bounding box with the highest confidence score provides the expected result.

Image scaling
The images in which an eye was not detected made up the majority of the erroneous cases.This led to a decision to rerun the model on images where a fish was detected, but the eye was not.It was hypothesized that the eyes were too small to be detected in these cases.To address these errors, the fish bounding box was cropped into a separate image, which was then upscaled by a factor of 4×, and the model was rerun on the scaled image.This helped to detect more eyes once they were scaled to a larger size.
If the eye is not detected even in the scaled image, it is counted as a missed eye.In the event an eye is detected on the scaled fish, however, the eye coordinate within the unscaled image needs to be returned.This requires taking the top left corner of the bounding box and adding the scaled down eye coordinate, as described in Eq. 1: x eye_scaled 4 y eye_original = y fish_top_left + y eye_scaled 4 (1) The simplest method for pixel interpolation during upscaling is the Nearest Neighbor algorithm, in which the output pixel value is set to the nearest pixel's value [49].Linear Interpolation estimates the appropriate intensity pixel values by finding the distance-weighted average of the four nearest pixels around the output pixel [50].Bicubic Interpolation determines the pixel value from the weighted average of the 16 closest neighboring pixels utilizing a third-degree interpolant function [51].
Nearest Neighbor Interpolation was initially attempted with little effect on finding missing eyes.Linear Interpolation yielded significantly better results, while Bicubic Interpolation yielded the best results.Further research uncovered that there are more complex methods like Lanczos4 Interpolation [52] and Deep Learning models like EDSR (Enhanced Deep Super-Resolution Networks) [53].Testing demonstrated that Bicubic Interpolation yields slightly better accuracy than Lanczos4 and EDSR, although Lanczos4 and EDSR provided slightly better masking.Figure 6 presents the different scaling procedures on an image in which the fish was detected, but the eye initially was not.

Pixel analysis
The masks and bounding boxes produced by detectron are generally quite good, although they almost never completely or tightly enclose the detected objects.The mask may include additional background as part of the fish, or the bounding box may clip away part(s) of the specimen.To solve these shortcomings, we utilize pixel analysis methods commonly found in image informatics to produce more accurate object masks and bounding boxes [6].

Threshold adjustment
The first calculation in the pixel analysis process determines the cutoff intensity between what constitutes the foreground (i.e. the fish) and background of the image.Initially, the calculation is based on the bounding box and mask generated by detectron.Specimen images are read in as gray scale, and pixels in the image are treated as unsigned integers between 0 and 255.Otsu thresholding [54], a technique that maximizes the variance between the foreground and background intensities, is used to compute an initial cutoff value between foreground and background.While the Otsu value occasionally generates an accurate mask as is, usually the contrast between foreground and background is low and much of the lighter parts of the fish (such as its tail fin) are marked as background.
To overcome this improper segmentation, the threshold value should be either adjusted up or down, depending on whether the background is lighter or darker than the fish.For our current dataset, the background is always lighter (i.e.closer to 255), so the threshold value needs to be scaled up to include more of the foreground image.For optimal results the scaling should be dependent on the contrast between the background and foreground, which can be affected by the level of pigmentation of the fish.
We found that an improved threshold value can be computed as the halfway point between the Otsu threshold value and the mean of the background intensities.This adjusted threshold value usually produced an acceptable balance between capturing most of the fish's fins, without also masking parts of the background [6].

Consolidating the foreground
While thresholding has the potential to generate better masks than a neural network (when provided an initial approximate bounding box), it also introduces considerable noise.Single or small groups of errant pixels can be marked as foreground depending on the consistency of the background, and interior pixels of the fish (especially around the fins) can be marked as background.To be useful for generating an accurate bounding box and for subsequent computational analysis, the mask must consist of one single "blob" over the fish, i.e. containing no holes, and no other pixels disconnected from this blob can be marked as foreground.
To accomplish this, we apply an iterative process of flood filling from all the foreground pixels in the image until a blob is generated that is large enough to constitute the fish.This leads to another metaparameter, using greater than 10% of the current bounding box has masked the specimen in all observed cases.Once the fish's blob is found, internal noise then needs to be removed.This is done by flood filling from each of the corners of the bounding box, where the specimen is not present (all four corners in the overwhelming majority of cases), then taking the inverse of the result.The fish mask is excluded from these corner flood fills, so this process removes all noise from both the background and foreground of the image, leaving only a single mask over the fish itself [6].

Adjusting the bounding box
With a more accurate mask generated, it is then necessary to check whether the bounding box needs to be expanded or shrunk along any of its edges.Expansion is done first, by checking whether any edge intersects with any of the foreground mask pixels.If one does, the edge is expanded out by 1 pixel.If any edges are expanded, the whole process of masking and expansion is repeated to account for any changes in average intensities.Once no edges contain foreground pixels, the bounding box is then shrunk.Each edge is contracted by one pixel until it contains one or more foreground pixels.Once the shrinkage step is accomplished, the final mask and bounding box have been generated [6].

Fallback
The pixel analysis process occasionally fails, e.g. when floodfilling does not produce a large enough blob or the bounding box adjustment does not terminate.This can occur if certain Fig. 6 Lanczos4 (left), Bicubic (middle), and Linear (right) pixel interpolations flood fill operations behave unexpectedly, or if the image is too washed out or otherwise atypical for the thresholding process to work correctly.In the event this happens, the original mask and bounding box generated by detectron is used for metadata generation [6].

Metadata generation
Table 4 lists the metadata properties that were generated in our previous work.These include properties that are produced by detectron and the methods described above: has_fish, fish_count, has_ruler, ruler_bbox, background.mean,std,foreground.mean,std,bbox, score, and has_eye.The methods to compute derived metadata properties are described below.The new properties that we are now able to generate are listed in Table 5.

Contrast
The contrast between the intensities of the foreground and background pixels is computed as background.meanforeground.mean[6].

Centroid and eye_center
Centroids are provided for the masks and bounding boxes that are generated by detectron, and since we do not recalculate the mask of fish eyes we can use that value directly for eye_center.Since we recalculate the mask of the fish, its centroid must be recalculated as well.This can be done through Eq. 2: where M 00 is the pixel area of the fish's blob, M 10 is the sum of all the x values of blob pixels, and M 01 is the sum of all the y values of blob pixels [6].

Side
Determining which side of the fish is visible is predicated on finding its eye.If an eye is found, the sign of the x component of the vector from the centroid of the fish to the centroid of the eye specifies which side is visible: negative for left and positive for right.This assumes the fish was photographed vertically (i.e.dorsal fin on top), which is essentially always the case for all image collections our group has worked on [6].

Primary_axis and clock_value
The primary_axis of a fish can be calculated by taking the covariance of its blob in x and y, which yields its principle eigenvector.The eigenvector can be directly assigned to the property.If an eye is present, we ensure that primary_axis points in the direction of the eye relative to the fish's centroid.
Our team encoded this information as a "clock value" between 1 and 12 when manually recording it.To convert principal_axis to clock_value, the signs of x and y on the principal axis are used to determine which Cartesian quadrant the fish angles into relative to its centroid.Depending on the quadrant, we dot product the principal axis with either [−1, 0], [0, −1], [1, 0] or [0, 1], which correspond to 9, 6, 3 and "0" o'clock respectively.The resulting radian value is then converted to a polar displacement in clock value space, and added to the comparative clock value used in the dot product.This gives the fish's clock value from 0 to 11.9.Before recording clock_value in the output, the value is rounded to the nearest integer, with a 0 final result replaced with 12 [6].

Scale and length
The fish length of INHS images, measured in pixels inch , can be calculated by measuring the distance in pixels between the digits 2 and 3 (a 1 inch separation) found on the ruler by detectron.Converting this to pixels cm gives the scale metadata property as reported in the output.The UWZM images, in contrast, use a metric ruler in centimeters, and as such, the distance between the digits "2" and "3" is directly measured in pixels cm .For the fish length property, it is necessary to determine the furthest points from the centroid of the fish in each direction along its major axis.Since fish are normally measured in a straight line from their snout down the middle of their trunk, every pixel of the fish blob is projected onto the fish's major axis (as a line through its centroid).The projec- tion is done through numpy [55] by performing Principle Component Analysis (PCA).The first step of this process includes: finding all mask pixels, computing the covariance matrix, computing the eigenvalues and eigenvectors of the matrix, and then computing the angle of rotation from the X axis.The second part includes applying the negative rotation to the mask pixel coordinates, which aligns the fish's major axis with the X axis, and then computing the difference between the highest and lowest x values.Dividing this distance by scale gives the fish length in centimeters [6].A similar process is done for the fish width as well, where the difference between highest and lowest y values are computed from the transformed pixels.

Contiguous distances
Two additional computed properties are cont_length and cont_width.These are computed by using the transformed mask pixels from above, but with slight modifications.Through numpy, we examine the counts of the x and y values of the pixels.The indices of the x and y values with the highest counts are identified.This process identifies the x and y values with the longest contiguous strips parallel to the major and minor axes.The length and width calculations are computed as the difference between the max and min of the x and y values respectively within these bins.The sharpness of the peaks of the frequency-distribution curve of mask pixel coordinates.

Region properties
One of the goals of the updated metadata generation process is to provide additional geometric properties based on the morphology of the fish.Features like perimeter, area, and eccentricity were immediately deemed most useful to the BGNN project use case, whereas further research is needed to determine other meaningful geometric properties.The Python machine learning library skimage [56] contains the measure package, which computes various geometric properties of the image.We made use of one of the provided functions, regionprops, which provides the aforementioned properties as well as: feret_diameter_max, major_axis_length, minor_axis_length, solidity, and convex_area.Other properties, like euler_number and perimeter_crofton are offered in this function, but were deemed unnecessary for our work.

Statistical properties of the mask coordinates
The statistical distribution properties of the mask pixel coordinates can be calculated through Python statistical computing library scipy [57].The stddev, skew, and kurtosis were calculated on the x and y coordinate distributions and recorded in the metadata.These values can be used as distinguishing features of a fish's shape.

Mask encoding
Another feature which may be useful for studying the morphology of fish is the outline of the specimen.A concise and efficient method for capturing the outlining boundary of an object is Freeman Chain Encoding [58].In general, a chain code is a lossless compression algorithm for monochrome images that separately encodes the boundary of each connected component-or "blob"-in an image.For each such region, a point on the boundary is selected and its coordinates are noted.The encoder then moves along the boundary of the region and, at each step stores a symbol representing the direction of the movement.This procedure continues until the encoder returns to the starting position, at which point the blob has been completely encircled.By storing the encoding and the start coordinate, it is easy to recreate the mask by reverse encoding the sequence, then flood filling the mask.The encoding of the outline should serve as a signature that supports morphological comparisons between fish specimens, as well as providing a compressed representation of the specimen's mask.

Results
The computational enhancements described in Sect.4.3 reduced the overall error rate from 4.6 to 1.1%.Here, an error is defined as the inability to detect a fish, a fish eye, ruler or the numbers '2' and '3', which are used to compute image scale, within a specimen image.Whereas the INHS metadata generation process took 3.5 hours to run on 7244 images, our GPU optimizations have helped reduce this to 2 hours on 7013 images.The metadata generation process on the UWZM set of 4013 images took 2.5 hours.The difference in computation time is reasonable, since the UWZM images have higher resolution by an order of magnitude compared to the INHS images.
To demonstrate the effectiveness of our error reduction techniques, we ran the original INHS-only model on the refined INHS dataset and compared the results of various enhancement combinations in Table 7.In this table the bottom row contains the number of errors, and the error rates (note n = 7013), produced by our analysis before applying the enhancements.The middle row provides these figures after applying the fish selection rule and specimen upscaling.The top row results then include contrast enhancement.The columns in the table show which errors occurred.It can be seen that error rates improved (lessened) as more enhancements are applied.Employing all enhancements reduces the total error rate from 4.6 to 2.5%.
Additionally, we have computed results for the various enhancement combinations included in the newly trained INHS + UWZM model, which is applied to the combined INHS and UWZM testing set, as well as on the individual INHS and UWZM testing sets.The results from these studies are presented in Table 8 through Table 10.In these tables, 'A' denotes that the training set has been augmented with additional entries.Just this one enhancement gives an overall error rate of 3.0% (note n = 11,168).As the enhancements are introduced the errors rates go down, with the lowest rate, in the top row, being 1.1%.Tables 9 and 10 break down the results by INHS and UWZM datasets, with the general trend of more enhancements providing better results being evident.

Fish detection
As prescribed by our collection criteria, images in the INHS and UWZM testing datasets contain exactly one fish.In the case of no enhancements, except training set augmentation, 11,125 out of 11,168 images had exactly one fish detected, a 99.6% correct rate.In 42 of the images, multiple fish were found.The one fish that was not detected was an extremely small fish from the INHS collection.This type of specimen is currently missing from the training set.In the multiple fish cases, 1 of the 42 contained tags that overlapped the fish and were themselves labeled as a second fish.Of the remaining 41, detectron erroneously labeled the fish as two or more separate fish objects, or labeled a subsection of the fish multiple times.Fish that were "over-detected" were generally quite large and/or dissimilar from the fish found in the training set.Applying the rule that selects the fish object of highest confidence produces no images with multiple fish.Examining the 42 multiple-fish cases showed that this approach always produced the correct result, with 11,167 out of 11,168 images having exactly one fish detected, a 99.9% correct detection rate.After applying contrast enhancement, a second small fish (in the UWZM collection) was not detected, with 11,166 out of 11,168 images having exactly one fish detected, a 99.9% correct rate.Figure 7 shows the two images in which a fish was not detected.These results are summarized in Table 8, Columns 3 and 5.

Ruler detection
When no enhancements are employed, detectron is able to detect rulers in all of the test images.Applying contrast enhancement produces one image where the ruler is not found.Still this produces a 99.9% correct rate.When contrast enhancement was not employed and the ruler was found, there were 28 cases where the numbers "2" and/or "3" on the ruler were not detected.Therefore, a scale calculation could not be performed, producing a 99.7% success rate for the scale computation.Applying contrast enhancement improves the calculation, with 17 images for which the numbers were not detected and the scale could not be computed.As seen in Tables 9 and 10, Columns 6 and 7, these errors all come from the INHS dataset.This is understandable, since the images in the UWZM dataset are extremely consistent, whereas the overall INHS image quality can vary significantly.
Images where one of these objects were not detected generally had some form of coloration issue.They were either washed out, very dark or yellow in hue.Some of the rulers for which "2" and/or "3" were not detected were particularly scratched and damaged.Many of the rulers where both numerals were missed were particularly small within the image, which again may be solvable through expanding the training dataset to more collections than just INHS and UWZM and/or adding more 2's and 3's to the training data.7 The INHS image where the fish is never detected (left) and the UWZM image where the fish is not detected after contrast enhancement (right)

Eye detection
In the case of no enhancements (besides training set augmentation), detectron was unable to find a fish eye in 273 of the images, producing a 2.4% error rate.Applying upscaling to the fish images resulted in a significant improvement with 206 images having no detected eyes, a 1.8% error rate.Applying contrast enhancement provided yet another major improvement with 112 missing eyes, giving an error rate of 1.0%.Contrast enhancement was meant to help all categories of errors, but ended up helping missing eyes the most.Upon investigation, the undetected eyes were generally extremely dark, small, or looked nothing like those found in the training set.These cases usually included catfish and extremely small fish, along with various fish where the eyes are effectively unrecognizable.

Mask generation and encoding
Fish bounding boxes were calculated for all images in which a fish was found.Manual investigation demonstrated that due to the thresholding process in grayscale space, in many cases, near translucent and light hue fins and tails were excluded.Masks and bounding boxes contain the head and trunk of the fish in nearly all cases, but further refinement of our algorithms will be needed to ensure that light fins and tails are masked consistently and accurately.The masks were then encoded and stored in the metadata along with the starting coordinate used in the encoding process.Figure 8 presents the mask (white pixels), as well as the encoded outline of the fish (blue pixels) from Fig. 4.

Scale and length
Image scale and fish lengths were calculated for 11, 148 of the images.For the remaining 20 images, either the fish, the Fig. 8 The mask and outline of a fish, which are generated through pixel analysis and Freeman Encoding respectively "2" and/or the "3" on the ruler were not detected.Image scale ( pixels cm ) and fish length were measured, using ImageJ [59].Scale calculations using the "2" and "3" method are nearly identical to those calculated by hand between the tick marks on the ruler.When the tail of the fish is accurately masked and the specimen is fairly straight, the length calculation is highly accurate as well.Thus, the primary means of lowering the error of the length calculation is to improve tail masking accuracy.

Region and statistical properties
Region and statistical properties were computed from the masks for the three specimen images in Fig. 9.The property values are listed in Table 6 and demonstrate that they provide distinguishing features based on the shape of the fish specimens.By doing so, our methods have performed with high accuracy with minimal additions of unseen data to the training set.There has also been great success with applying various error reduction techniques that include image scaling, an augmented training set, selection of fish with the highest confidence score, and contrast enhancement.By augmenting the training set and performing contrast enhancement, the error rates on each class generally decreased.By performing image scaling on the cropped fish where an eye was initially missing, the number of missing eyes dropped significantly.By performing fish selection, given the nature of the images where multiple fish were detected, the best fish mask was always selected, eliminating the multiple fish error.While we have seen some individual error rates increase after applying contrast enhancement, the overall effect is a lower error rate for the aggregate of all errors.Surprisingly, applying image scaling to the rulers had no effect on improving the number of missing "two"s and "three"s , indicating a need for more training data.Additional training epochs will not improve these errors, since we found that training beyond 15,000 epochs yielded worse results.This is known as the exploding gradient problem, a common problem in deep learning that has been evident since the advent of gradient-based parameter learning [60].Our current results are more than acceptable and demonstrate an augmented proof of concept that offers a path forward for using object detection technology, enhanced by image informatics techniques, to improve and enrich the metadata needed for advanced specimen image analysis.Overall, our work should advance scientific discovery that is based on analysis of biological specimen image collections.
Our investigation has thus far focused on fish as the specimen of study.Fish are vertebrate animals (phylum Chordata), with over 34, 000 known unique species [61], with many more likely undiscovered.Species names are merely labels, and the discovery of species variation depends on both genotype and phenotype information.The ability to computationally analyze thousands of images of a single fish species, from different habitats and time periods, can lead to new discoveries that are impossible to pursue with manual methods.Digital library researchers have been concerned with computationally extracting image features, using content-based image retrieval methods.The work by Toress [62], while over 15 years old, demonstrates the challenges and opportunities to automatically generating useful metadata.Efforts to integrate such automatic metadata generation methods into digital library workflows and architectures still seem limited.This is likely due to the diversity of image shapes, sizes and the inconsistent configurations of specimens, labels, rulers, etc. within them.Object detection as explored in our research, working with an established architecture, is applicable to the larger world of biodiversity, well beyond fish, to include other fauna and flora, art and artefacts, and other digitized objects made accessible for scientific and scholarly research.Following object detection, one can apply pixel analysis and informatics methods to compute many more higher order properties from the initial segmentations.

Conclusion
In this paper we extended a previously described automatic metadata generation approach.Using machine learning and image informatics algorithms, along with a number of image processing methods, our approach is able to locate, mask and analyze specimens (currently limited to fish) in collection images with a high degree of accuracy.Additional geometric measurements on the specimens are now computed, while also improving the overall error rates, as well as the runtime through GPU parallelization.Testing this approach on 7013 images from the INHS dataset and 4155 images from the UWZM dataset, we see major success with only 1.1% of the 11,186 images yielding at least one error.Through further refinement and generalization beyond the INHS and UWZM images, as well as more species than just fish, we aim to create a tool that can be distributed to specimen image collection curators to correct the metadata sparsity that motivated this work.

Future work
The most pressing next step is to refine the pixel analysis thresholding process in order to improve the accuracy of the specimen masks.The current process performs thresholding on a single color channel (intensity).Some of the lightest tails appear yellow in hue to the human eye and easily distinguishable, but when compressed to a single intensity value they are almost identical in value to the grayish background.Given what we have learned with contrast equalization, using CIELab space could be ideal for mask adjustment.Another possible approach to solving this problem is to threshold and mask on subsets of the bounding box, as to ensure that very dark trunk pixels do not affect the thresholding of lighter regions.
Our longer term goal is to create a generalized process that works on classes of specimen images.For the BGNN project we are beginning with fish images, but we are designing the metadata generation system so that it can eventually operate on other species, if appropriately trained.The first step, which has been accomplished, was to achieve greater generality by augmenting the training set from INHS to UWZM.Another requirement will be to generalize the ruler reading process beyond the reading of digits on the ruler, which will likely involve an automated method of reading ruler ticks instead of digits.Lastly, the model training setup should be modified from using the default parameters to one that is further optimized.As a result of the aforementioned factors, training beyond 15,000 epochs has yielded exploding gradients, thus producing poorer results.A suggested improvement is to experiment with the learning rate, with more complex solutions involving the use of a learning rate scheduler or optimization algorithms like RMSProp and ADAM.
Overall, the research reported in this paper will improve our BGNN workflow, and at the same time demonstrates an innovative approach that should greatly enhance digital library services for the tens of thousands of digitized specimens in a spectrum of image collections.

Fig. 1 A
Fig. 1 A typical INHS image (left) and a typical UWZM image (right)

Fig. 4 Fig. 5
Fig. 4 A fish image before contrast enhancement (left) and after enhancement (right)

Fig.
Fig.7 The INHS image where the fish is never detected (left) and the UWZM image where the fish is not detected after contrast enhancement (right)

Table 1
Training dataset Class Number of instances

Table 4
Original metadata properties (* indicates derived properties) e. 'left' or 'right') of the fish that is facing the camera (dependent on finding its eye).

Table 5
Additional metadata properties