A new methodology in constructing no-reference focus quality assessment metrics

This paper proposed a new methodology which converts a full-reference focus quality assessment metric into a no-reference one. The methodology consists of three hypotheses which describe the relationship in focus quality between the original image and its variants. Using the proposed methodology, two no-reference metrics were constructed. The first used Brenner Gradient and the second used a full-reference metric proposed by ourselves. Evaluation was conducted on a public dataset and our own proposed dataset. Comparing with other no-reference metrics, our second one exhibited best performance on both datasets, with calculation time comparable to some fastest metrics considered.


Introduction
Autofocus (AF) is essential for achieving automated microscopy image analysis without visual inspection. However, focusing under microscope can be challenging given the small depth of field of the lens. Through decades, various AF methods have been proposed, including gradient-based approach [1], patching and interpolation [2], wavelet-based approach [3], depth from focus/defocus based algorithm [4], learning based approach [5] , focus measure curve estimation [6], among many others, and many comparison studies have been conducted on them [7][8][9][10][11]. Focus measurement is also the foundation in focus-based applications. For example, in shape from focus, a focus metric is applied locally on a small window surrounding each pixel. Studies by Malik and Choi [12] and Pertuz et al. [13] compared various focus metrics on their performance for shape from focus under various imaging conditions and different window sizes. Depth from focus/defocus is another application of focus metrics [14][15][16], which estimated the focus level of each pixel and associated it with the scene depth.
Most focus assessment metrics need either a score threshold to distinguish in-focus from out-offocus cases or referring prior / subsequent images, through full-reference (FR) or reducedreference (RR), in order to determine the maximum score. This has several drawbacks. Firstly, the "ideal" threshold for a given batch of images (from which the in-focus images are selected) may be different if the imaging conditions, e.g. illumination, vary from batch to batch, making it difficult or even impossible to choose a global optimal threshold. Secondly, the local maxima in some focus metrics may result in false identification of the optimal focus. To prevent this, exhaustive search techniques need to be performed in order to determine the global maximum.
Thirdly, given an image batch not containing any focused image, the focus metrics will still produce a score for each image, in which case the global maximum is misleading as it does not guarantee focusing. On the other hand, for some metrics, e.g., the Wavelet_2 in Ref. [17], even the global maximum itself may not correspond to the in-focus position. It is therefore desirable to have focus metrics working on isolated images without inter-referencing and being free from dataset-specific threshold. Such metrics are known as no-reference (NR) focus quality assessment (FQA) metric.
NR FQA belongs to a more general category called NR image quality assessment (IQA). NR IQA is of great interest in applications of image acquisition, transiting, compression and recording, where images inevitably suffer from some types of damage (e.g. white noise or Gaussian blur) and need to be recognized and removed. Typical applications of NR IQA include noise estimation [18], blur/sharpness assessment [19,20], distortion evaluation [21], restoration [22], etc. As in-focus and out-of-focus images have different blur level, NR IQA metrics for defocus identification, i.e., NR FQA, have been proposed. Most NR FQA metrics are either sophisticatedly hand-crafted utilizing spatial and spectral characteristics of the images [23][24][25][26], or purely learning-based through AI approaches [27,28].
In this paper, we aim to develop a new methodology in constructing NR FQA metric. The methodology consists of a set of hypotheses and can easily convert a FR metric into a NR one.
However, to ensure the methodology works properly, the FR metric should possess certain characteristic. One suitable FR metric is Brenner gradient (BG) [1]. According to Refs. [17,29], BG exhibits a sharp, distinct peak at the focus range, and decays rapidly away from focus. Such characteristic makes BG highly sensitive to the change in an image's sharpness level, when the image goes from in-focus to out-of-focus, while blunt to sharpness change from one location to another within the out-of-focus range. To demonstrate the effectiveness of the methodology, we proposed two NR focus metrics based on BG and another FR metric we designed in this study, respectively. Their performances were evaluated and compared with state-of-art NR focus metrics on our own datasets as well as a public dataset, FocusPath [30]. In summary, this paper makes the following contributions in focus assessment: • A new methodology in constructing NR FQA metric is proposed • A new FR metric, called sum of gradient (SoG), is proposed • Two NR FQA metrics, NRBG and NRSoG, constructed based on the methodology, are proposed • A new dataset, SS316_ShotPeen 1 , is proposed The remainder of this paper is as follows: Section 2 reviews previous work related to NR FQA. Section 3 describes the methodology and its hypotheses as well as the proposed NR metrics.
Comparisons between our metrics and other state-of-art methods on our own dataset and FocusPath are presented in Section 4. Section 5 concludes the paper.

Previous work
There have been many studies on NR FQA in past decades. For example, Marais and Steyn [23] proposed a metric differentiating between in-focus and out-of-focus blur using a variation of the spectral subtraction method. Wu et al. [24] proposed a NR method for defocus blur measurement in which Sobel operator is used for edge detection and Radon transform is applied to locate line features. Bahrami and Kot [31] defined the maximum local variation (MLV) of each pixel and used the standard deviation of the weighted MLV distribution as a metric to measure sharpness.
In another study by Hassen et al. [26], the sharpness metric was developed based on local phase coherence (LPC) near distinctive image features evaluated in the complex wavelet transform domain. Recently, Hosseini and Plataniotis [32] proposed MaxPol convolution kernels. The kernels are close approximation to a visual sensitivity model, and with the model, they built their NR image sharpness metric. In a further study, Hosseini et al. [30] proposed a novel NR sharpness metric using MaxPol kernels and based on the human vision system response. Their test results showed that the proposed metric significantly outperformed other metrics over both synthetic and natural blur databases. In Ref. [33], Hosseini et al. further tailored their method in Ref. [30] for digital pathology images, resulting an NR FQA metric called FQPath, which showed the best overall performance among eleven state-of-art methods considered in their study.
With the advances in AI technologies, in particular deep learning, researches on learning-based focus metric have been conducted. For example, Senaras et al. [27] proposed a novel deep learning framework, DeepFocus, to identify blurry regions in digital slides. In another study, Yang et al. [28] trained a deep learning model using synthetically defocused images generated from natural in-focus microscope images. The model can work in no-reference way by giving prediction on the level of focus for isolated images. Wang et al. [34] proposed a highly efficient CNN-based model FocusLiteNN which has 148 parameters for FQA. The model was trained specifically for assessing pathological images. For AI approach, although no user-specified parameters are needed, model training can be time-consuming and the performance of model depends on the quality of dataset, which may not always be good or enough. Moreover, the transferability of AI model may also be a problem, especially when the images encountered in real applications are significantly different from those in the training set.

Methodology and hypotheses
The key idea of the proposed methodology which converts a FR metric to a NR one is the sharpness comparison between the original image ( ) and its three variants, namely, blurred image ( ), downsampled image (h), and blurred downsampled image (j), using the scores given by the FR metric. For a suitable FR metric, denoted as M, which gives higher score for more focused image,   . Hypothesis c) can be interpreted that for an image already very sharp, further downsampling will only increase its sharpness slightly, so its relative sharpness increment is small, on the other hand, an image not very sharp will have more sharpness increment after downsampling, so its relative sharpness increment is large. A suitable FR metric means the score it produces satisfies the abovementioned hypotheses. We further prescribe that the FE metric is a convex function of image sharpness, i.e. the metric score acceleratingly increases as the image becomes more and more sharp.
For the given M, the construction of the corresponding NR metric can be described as follows.
First, we calculate the ratio of M scores between f and h and between g and j, respectively, i.e.
We further prescribe that Finally, it is required that If is focused, from (3&4) we can derive that ℎ < 1, thus Hypothesis b) is satisfied.
The conditions given by (3&4) can be qualitatively explained with Fig. 1, which shows M as a convex function of image sharpness. In microscope imaging, a FR metric, e.g. BG, is a function of z-stack position. Here the general term "sharpness" is used in order to compare the focus level between the original image and its variants, as there is no corresponding z-stack location associated with them. The score of and ℎ is marked by green lines. The sharpness of and is smaller than that of and ℎ respectively according to Hypothesis a). For easy explanation, we further assume that the effect of blurring is infinitesimal, therefore and are to the left of and ℎ with a small shift ∆ in the axis of sharpness and their M scores are marked with red lines. Expand ( ) and (ℎ) at and up to first order in Taylor series, and if the condition of (3) is satisfied, we have, in which 1 and 2 are the derivative of with respective to sharpness, or more explicitly, the slope of the two blue lines in Fig. 1. From (4&5) we can get, Hence, if (3&4) are satisfied, 1 < 2 , which means image is in the region of the sharpness axis where the score of M is acceleratingly increasing. For a FR metric like BG, which only exhibits rapid increase within a small neighborhood of focus location, the acceleratingly increasing region where is ensures that it is or very near to the focus location. are softened by two sigmoid functions, each giving a value close to 1 or 0 if the condition is satisfied or unsatisfied. We use ratio of to ℎ and reciprocal of to implement conditions (3&4) which showed better performance than simply taking subtraction. Two weight factors, 1 and 2 , are applied to amplify the input strength. If both conditions are satisfied, the sum of their outputs will be close to 2, and the output of the ReLU (with bias -1) will be close to 1, otherwise the output of ReLU will be close to 0. The ReLU output will multiply ℎ to produce the final NR score.

NR metrics
Two FR metrics are adopted in this paper to construct their corresponding NR metrics. One is BG, of which the score is calculated as   both scores increase as the z-stack moves closer to the focus location, however, from bottom to peak, the relative change of SoG is more than two times larger than that of BG, meaning that the former is more sensitive to focus change. For Group 3, as expected, both metrics have small values and low variation in their scores. Gaussian kernel is that it is commonly used for blurring [13,35,36]. Other kernels, e.g. uniform kernel, have also been tried and the difference is not significant. The weights, 1 and 2 , both are equal to 10 in our design so as to enhance the sigmoid input by one order of magnitude.  For Groups 1&2, NRSoG1 gives non-zero scores on a wider range of z-stack than NRBG1. This is because ℎ of NRSoG1 is lower and changes more slowly than that of NRBG1, as can been seen, e.g., in Group 1. Since the conditions in (3&4) are softened by sigmoid functions, even if they are slightly violated, the scores of both metrics may still be larger than zero. This gives the metrics more flexibility to assess images with few features or low pixel intensity variation, such as those in Group 2. However, if ℎ or is significantly larger than or 1, the metric scores will be zero. This can be observed clearly in Group 3. Furthermore, for both metrics, the non-zero score is higher for more focused image, which demonstrates the effectiveness of the proposed methodology.
It is worth mentioning that the sharpness measured by the metrics in this paper is the one felt by human's visual sensory, which is also what the term "sharpness" in Fig. 1 means. For an in-focus image of blurry or featureless scene, for example, a purely black surface, its appearance may resemble those in Group 3, and the metrics will score zero on that image.
Finally, the definition of NRSoG and NRBG, which considers both height and width convolutions in the corresponding FR metrics, is given as follows The max operation is adopted as a way to ensure that for a sharp image with pixel intensity varying only in one direction, either height or width, the NR metrics will give non-zero score.

Experiments & discussion
We evaluated our proposed metrics on two blur image datasets, both in png format. One is from 21 slices of specimen 10 at position 1. Detailed description of SS316_ShotPeen can be found from its homepage.  Table 1, with the top three best scores highlighted in bold.
It is seen that all the scores of NRSoG, except SRCC in FocusPath and calculation time, are among the best for both datasets. Most scores of FocusLiteNN and HVS-MaxPol-2 (both were trained or fine-tuned based on FocusPath) are also among the best for FocusPath. But for SS316_ShotPeen, some scores, e.g. PLCC, KRCC and SRCC, of FocusLiteNN and HVS-MaxPol-2 are significantly lower than those of NRSoG. For MLV and ARISMC, each has some scores among the top three for one dataset, but has relatively poor performance for the other. Therefore, NRSoG is of higher The fact that some metrics show good performance on one dataset but moderate, if not poor, on the other indicates that the two datasets contain images of different focus quality characteristics, and those metrics captured characteristics of either but not both datasets. Therefore, the proposed dataset, SS316_ShotPeen, widens the scope for FQA metric validation. On the other hand, one of the most important virtues of a NR metric is the ability to give absolute score on image sharpness.
The score should be universally meaningful and irrespective of the image contents. Therefore, a dataset with ground truth focus quality score for each image is very useful to test the noreferenceness. This is another motivation of proposing SS316_ShotPeen. As seen in Table 1, the RMSE against ground truth for NRSoG is the lowest on both datasets, suggesting NRSoG has the best overall no-referenceness.
NRBG and NRSoG are just two possible NR metrics that can be constructed using the methodology we proposed. By incorporating different FR metrics into the methodology in Fig. 2, new NR metrics can be generated. Therefore, developing NR metric can be reduced from sophisticatedly handcrafting the power spectrum/frequency analyzers to looking for suitable FR metrics, which can be much easier (for example, neither NRBG nor NRSoG includes any power spectrum/frequency analysis). The methodology is based on the three hypotheses (we choose not using the term "axioms" to avoid overstatement), hence this work is the first attempt on building an axiomatic system for NR FQA.
The proposed methodology goes beyond existing NR metrics in several aspects. First, the methodology does not require analysis on the frequency components, which is simpler in concept and faster in processing speed than those sophisticatedly designed NR metrics. Second, the methodology is based on hypotheses which can be quantitively tested and refined. NR metrics with power spectrum/frequency analysis utilize the fact that images with high-frequency contents are sharp. However, since there is no definitive boundary between high and low frequencies and the concept of high/low frequency is scene-dependent, it is difficult to set up quantitive rules on the frequency components. Therefore, all frequency-based metrics have to rely on empirical assumptions which estimate whether a frequency component is high or low in the given image.
Third, compared to AI-based metrics, the methodology has explainable architecture and better generality.

Conclusion
A new methodology in constructing the NR focus quality metric is proposed. The methodology consists of three hypotheses which describe the relationship in focus quality between the original image and its blurred, downsampled, and blurred downsampled versions. With those hypotheses, two NR metrics, NRBG and NRSoG, were constructed, using BG and our proposed SoG, respectively, as kernels.
Validation was conducted on a small dataset consists of three groups of images, i.e., those with large pixel value variation, small pixel value variation, and black images. The behaviours of both BG and SoG were analysed. The results showed that SoG has a wider range for non-zero values and a higher peak than BG, hence more sensitive to focus quality change. The NRBG and NRSoG, constructed from BG and SoG, respectively, showed desirable behaviour which yielded high value for more focused image and zero value for the black images.
The performance of NRBG and NRSoG was evaluated and compared with that of other NR metrics on two datasets, FocusPath and our proposed SS316_ShotPeen. The results showed that NRSoG has best performance for both datasets, showing much higher generality than the rest metrics, and its calculation time is comparable to some fastest metrics considered in the present study. NRBG, which is simpler than NRSoG, is one of the fastest metrics, and it has good performance on both datasets. It has also been shown that some metrics performed well on either one dataset but not both, indicating that SS316_ShotPeen has characteristics different from FocusPath. Therefore, the proposed dataset widens the scope for FQA metric validation.
Compared to state-of-art NR FQA metrics, the proposed methodology has advantages including simplicity in concept, hypotheses-based approaches that can be quantitively tested, explainable architecture and good generality. Most importantly, it is an alternative approach which can provide new insights and deepen the understanding of the NR FQA problem.
In the future, several studies will be conducted. The first will be a full analysis and performance evaluation on SoG. The main focus of the current study is to present the methodology and demonstrate its effectiveness. The SoG serves as a kernel for building the best NR metric, NRSoG.
However, the SoG deserves a thorough study on its own. From validation study in this paper, it has been observed that SoG shows better performance than Brenner gradient. Therefore, it can be an effective FR FQA metric for various focus/defocus based applications, and a potential replacement of Brenner gradient. The second study will be on the underlying mechanism of the proposed methodology. Since both the methodology and the frequency-based approaches address the same phenomenon, and the frequency analysis is a more fundamental approach, the former should be able to be explained by the latter. The third study will be on the validation of the hypotheses we proposed by using various metrics and datasets. Finally, we will refine the methodology and will look for new kernels for constructing no-reference metrics.