Using a rank-based design in estimating prevalence of breast cancer

It is highly important for governments and health organizations to monitor the prevalence of breast cancer as a leading source of cancer-related death among women. However, the accurate diagnosis of this disease is expensive, especially in developing countries. This article concerns a cost-efficient method for estimating prevalence of breast cancer, when diagnosis is based on a comprehensive biopsy procedure. Multistage ranked set sampling (MSRSS) is utilized to develop a proportion estimator. This design employs some visually assessed cytological covariates, which are pertinent to determination of breast cancer, so as to provide the experimenter with a more informative sample. Theoretical properties of the proposed estimator are explored. Evidence from numerical studies is reported. The developed procedure can be substantially more efficient than its competitor in simple random sampling (SRS). In some situations, the proportion estimation in MSRSS needs around 76% fewer observations than that in SRS, given a precision level. Thus, using MSRSS may lead to a considerable reduction in cost with respect to SRS. In many medical studies, e.g. diagnosing breast cancer based on a full biopsy procedure, exact quantification is difficult (costly and/or time-consuming), but the potential sample units can be ranked fairly accurately without actual measurements. In this setup, multistage ranked set sampling is an appropriate design for developing cost-efficient statistical methods.


Introduction
Breast cancer is the most commonly occurring female cancer, and a leading source of cancer-related deaths among women worldwide. It is thought to be a disease of the developed world, while it is prevalent in developing countries as well. Breast cancer survival rate in developed countries is higher than that in middle/low-income countries, where the lack of early detection programs with good coverage is pronounced. In developing countries, the situation is deteriorated by the lack of adequate diagnosis and treatment facilities. Many studies have been conducted about different aspects of breast cancer (Shapiro, 2018).
Although the causes of breast cancer are not fully understood, researchers concur that certain factors raise a person's risk of developing this disease. These factors include age, family and personal health history, genetics, and hormonal factor, among others. Adopting proper prevention measures, at different levels, may reduce incidence rate of the disease in the long term.
Unfortunately, there is no definitive method of preventing breast cancer. The early detection, however, is a crucial step for the successful management of the disease. It allows for different treatment options, thereby increasing survival chance, and improving quality of life. In addition, this may cut the treatment costs which are high for both the involved person and for society as a whole.
Some standard tests which can be utilized to diagnose breast cancer include breast exam, mammogram, breast ultrasound, and biopsy. The last one is the only definite way to diagnose breast cancer. In this method, tissues or cells are removed from the body in order to be tested in a lab. Then, a pathologist verifies whether the sample contains cancer cells. A comprehensive biopsy procedure is costly, and processing the results takes time. Fine needle aspiration (FNA) biopsy is the least expensive technique of tissue sampling, which provides reliable results. This belongs to a group of less invasive methods for sampling a breast lesion to determine if it is cancerous or benign. The FNA biopsy has been success-fully applied in early detection of breast cancer. Owing to this diagnostic tool, many unjustified major surgical (open) biopsies are avoided.
Suppose we are interested in estimating the prevalence of breast cancer in a given population, using a full biopsy procedure. Taking cost considerations into account, it is interesting to employ a sampling design that enables us to draw inference about the population proportion based on a possibly small sample. Toward this end, rank-based sampling methods can be efficiently adopted. They are applicable in settings where exact quantification is difficult (costly and/or time-consuming), while informal ranking of the potential sample units can be done fairly accurately and easily. The rankings are performed by judgment or through use of an easily available covariate, and they need not be totally free of errors. For example, in the context breast cancer, the FNA biopsy is a quick and a relatively cheap test which yields some visually assessed cytological characteristics (covariates). These covariates can be used to rank the patients according to the probability of having cancerous tumors in a rank-based sampling method. The informal ranking process assists the experimenter to focus attention toward the actual measurement of more representative units in the population, thereby enhancing precision of the estimation.
Ranked set sampling (RSS) is a ranked-based design, due to McIntyre (1952McIntyre ( , 2005 which is a cost-efficient alternative to simple random sampling (SRS). It has been recently applied in a variety of disciplines, including agriculture (Mahdizadeh and Zamanzade, 2018), auditing (Gemayel et al., 2012), and medicine (Zamanzade and Mahdizadeh, 2017) among others. Bouza-Herrera and Al-Omari (2019) collect some new developments in this area. In this article, we deal with estimating the prevalence of breast cancer using a generalization of RSS, i.e. multistage ranked set sampling (MSRSS).
In Section 2, proportion estimator under MSRSS is presented, and its properties are studied. Section 3 contains results of a simulation study performed to investigate finite sample performance of the proposed estimator. Some illustrations using real data are also included. Final conclusions and directions for future research are provided in Section 4. Proofs are collected in two appendixes.

Methods
According to RSS scheme, a sample of size N = nm, using set size m, is obtained as follows: 1. First, m 2 units are identified from the population.
2. Next, the m 2 units are randomly divided into m sets of size m.
3. The elements of the ith (i = 1, . . . , m) set are ordered, and the unit with judgment rank i is identified.
4. The m units identified in step 3 are actually measured. 5. Finally, steps 1-4 are repeated for n cycles.
The ranking mechanism in step 3 does not involve actual quantifications of the attribute of interest.
This can be based on expert opinion, or covariates' information. The final sample may be denoted as This sample is more informative than a sample of size N collected by SRS, and thus it often improves statistical inference. The better ranking quality, the higher amount of improvement. Perfect ranking is the situation that ranking errors are absent. The ranking scheme is said to be consistent if the same ranking mechanism is applied to all sets of size m.
In order to clarify RSS, we describe drawing a ranked set sample using m = 3 and n = 1. First, 9 sample units are identified from the population, and randomly divided into 3 sets of size 3. The three sets are denoted by where U j i (i, j = 1, 2, 3) is the ith units in the jth set. In each set, the units are ordered with respect to the variable of interest. This step results in , where U j [i] (i, j = 1, 2, 3) shows the unit with judgment rank i in the jth set. Finally, the ranked set sample is obtained by quantifying the variable of interest for elements of the set U 1 . Many RSS-based procedures have been developed for discrete and continuous data; see Wolfe (2012) for a recent review. In the following, we focus on binary data. Liudahl (2004), andChen et al. (2005) addressed point estimation for the population proportion p, while Terpstra and Miller (2006), and Terpstra and Wang (2008) dealt with interval estimation problem. Let X [i]j be either 0 or 1 representing a failure or success, respectively. Then, proportion estimator in RSS is given bŷ Suppose the proportion estimator based on a simple random sample of size N is denoted byp SRS . The next result states important properties of this estimator, which holds regardless of the ranking quality.
It can be found in the literature, but the proof has not been detailed, as far as we know.
Proposition 1: Let {X [i]j : i = 1, . . . , m ; j = 1, . . . , n} be a ranked set sample, drawn based on a consistent ranking scheme, from a population with proportion p. Ifp RSS is defined as above and c) As n tends to infinity, The original RSS scheme has been tailored to propose more efficient designs in specific situations.
MSRSS scheme, due to Al-Saleh and Al-Omari (2002), is an interesting generalization that allows attaining higher efficiency, given a fixed set size. An rth stage ranked set sample of size N = mn, using set size m, is obtained as follows: 1. First, m r+1 units are identified from the population.
2. Next, the m r+1 units are randomly divided into m r−1 sets of size m 2 .
3. Steps 1 and 2 of RSS algorithm are done on each set in step 2 to have a (judgement) ranked set of size m. This yields m r−1 (judgement) ranked sets of size m.

4.
Step 3 is done on the m r−1 ranked sets to have m r−2 second stage (judgement) ranked sets of size m.

5.
Step 3 is repeated until ending in an rth stage (judgement) ranked set of size m.
6. The m units identified in step 5 are actually measured.
The resulting multistage ranked set sample is denoted by {X (r) [i]j : i = 1, . . . , m ; j = 1, . . . , n}, where X (r) [i]j is the ith judgment order statistic in the jth cycle. Apparently, MSRSS with r = 1 is simply the basic RSS.
We now illustrate drawing a multistage ranked set sample using r = 2, m = 3, and n = 1. First, 27 sample units are identified from the population, and randomly divided into 3 sets of size 9. The three sets are denoted by is the unit in the ith row and jth column of the kth set. In each set, the units of each row are ordered with respect to the variable of interest. This step yields shows the unit with judgment rank j in the ith row of the kth set. Next, the units in the sets are ordered. Finally, the 2nd stage ranked set sample is obtained by quantifying the variable of interest for elements of the set V 1 is the unit with judgment rank i in S i . MSRSS has been applied in estimating the population mean by Al-Saleh and Al-Omari (2002). Frey and Feeman (2018), and Zamanzade (2017b, 2019a) are examples of recent works based on this design. To the best of our knowledge, proportion estimation in MSRSS has not been investigated in the literature. In particular, this is a frequently used procedure in medical studies.
We propose proportion estimator in MSRSS aŝ Properties of this estimator are derived in analogy with the sample mean in MSRSS. The main results are summarized in the next proposition.
[i]j : i = 1, . . . , m ; j = 1, . . . , n} be a multistage ranked set sample, drawn based on a consistent ranking scheme, from a population with proportion p. Ifp MSRSS is decreasing in r, if the perfect ranking is assumed. d) As n tends to infinity, , where d → denotes convergence in distribution.

Simulation study
To investigate finite-sample properties of the proposed estimator, we conducted a Monte Carlo simu- MSRSS in Proposition 2 (a) with that ofp SRS , it is easily seen that the above RE does not depend on the number of cycles. In our comparisons, we thus assumed that n = 1.
The numerator of the RE has a simple formula, i.e., V ar (p SRS ) = p(1 − p)/N . In order to determine the RE, the variance ofp  In generating multistage ranked set samples, we need a model that allows to consider possibility of the judgment ranking errors. In RSS literature, such models are known as imperfect ranking models; see Frey (2007), for example. In the following, we describe an extension of the imperfect ranking model developed by Dell and Clutter (1972).
Suppose the variable of interest X, with mean µ X and standard deviation σ X , is ranked by means of a covariate Y . In this model, the two variables are related as where Z is a standard normal random variable independent from X. Here, the ranking quality is determined by parameter λ, which is the correlation coefficient between X and Y . The random ranking and the perfect ranking correspond to λ = 0 and λ = 1, respectively. In our problem, X is a Bernoulli  highest value at p = 0.5, while it declines symmetrically toward p = 0 and p = 1. For any combination of m and p, increasing r gives rise to improvement in the RE, and this is more evident as p deviates from zero/unity. Finally, efficiency gain is naturally increasing in λ, given that the other factors are fixed. This is the case with most of the statistical methods developed for RSS-based schemes. It is interesting to note that for a small sample of size 5,p MSRSS could be four times more efficient than p SRS if p is close to 0.5, and the perfect ranking is assumed (see the top panel in Figure 3).
As shown in Proposition 2 (b), the proportion estimator in MSRSS is more efficient than its SRS version based on the same number of measurements. Therefore, the proposed estimator has a good potential to be used whenever cost consideration is of high importance, which is the case in many medical studies. For example, if a predetermined error bound in estimating the true proportion is desired, then it can be achieved with a smaller sample in MSRSS as compared with SRS, thereby reducing the involved cost. Formally, percentage of sample size reduction (PSSR) can be measured via  Figures 1-3 are reflected here. For fixed m and r, the PSSR reaches its maximum at p = 0.5, while it falls symmetrically toward p = 0.1 and p = 0.9. If the other factors are fixed, then the PSSR is increasing r. Interestingly, the proportion estimation in MSRSS needs 76.90% fewer observations than that in SRS when m = 5, r = 4, and p = 0.5. This could be an appealing feature for medical researchers whose works require expensive measurements.

Illustration using real data
In this sub-section, the proposed procedure is illustrated using Wisconsin Breast Cancer Data (WBCD), which was originally compiled by Street et al. (1993). It is one of the first data sets where feature extraction was conducted in an attempt to apply a machine learning algorithm for improving malignancy prediction of a medical condition. Employing image recognition and machine learning techniques on this data set, Street et al. (1993) achieved a major advance in the accuracy of malignant tumor prediction.
The WBCD includes 699 observations on ten variables, which is accessible via "mlbench" Package 1 developed for R statistical software. The dichotomous variable of interest (X) indicates whether a tumor is malignant (success) or benign (failure). Here, malignancy was diagnosed through a comprehensive biopsy procedure. Additionally, there are nine visually assessed cytological covariates which are pertinent to determination of breast cancer: clump thickness (Y 1 ), uniformity of cell size (Y 2 ), unifor-  ), normal nucleoli (Y 8 ), and mitoses (Y 10 ). These covariates are easily obtained from the FNA biopsy, and their values range from 1 (normal) through 10 (most abnormal).
In the following, the WBCD is considered as a hypothetical population. Then, drawing a multistage ranked set sample is exemplified, and efficiency of SRS and MSRSS in estimating the population proportion are compared.

An example of sampling in MSRSS
Here, we describe MSRSS using r = 2, m = 3, and n = 1. It is assumed that the judgment ranking is based on the covariate Y 2 . First, 27 sample units are drawn with replacement from the population, and randomly divide them into 3 sets of size 9. The three sets are given by where we utilized the same notation introduced in Section 2 for illustration of MSRSS. Also, value of the covariate for any unit appears in parentheses. In each set, the units of each row are ordered using the covariate's information. In the case that ties occur during this process, they are broken at random.
Proceeding in this way, we arrive at

Now, the units in the sets
are ordered, and the unit with judgment rank i (i = 1, 2, 3) in S i is selected for actual measurement.
This is to say that the 2nd stage ranked set sample is obtained by quantifying X for V 1 [11] , V 2 [22] , and V 3 [33] . The resulting sample is given by X [1]1 = 0, X [2]1 = 0, X [3]1 = 1 , where 0 (1) shows that a tumor is benign (malignant). According to MSRSS procedure in Section 2, X [i]j is the ith judgment order statistic in the jth cycle.

Efficiency comparison
In this sub-section, we investigate performances of SRS and MSRSS designs in estimating the proportion of malignant breast tumors. In this population, 241 out of 699 patients have malignant breast tumors, so the true population proportion is p = 0.34. The efficiency comparison is based on the RE defined in Section 3.1. This quantity is again determined through Monte Carlo simulation with 100,000 replications. In generating multistage ranked set samples, sampling from the WBCD is performed with replacement to guarantee that the quantified units are independent of each other. Also, the judgment ranking can be based on any of the nine covariates described earlier. In particular, we utilized Y 2 , Y 5 , and Y 9 . Spearman correlation coefficient for pairs (X, Y 2 ), (X, Y 5 ), and (X, Y 9 ) are 0.86, 0.76, and 0.53, respectively. Therefore, these choices of the covariate allow us to evaluate effect of the ranking quality on the performance ofp (r) MSRSS . Figure 4 shows the estimated RE for m ∈ {3, 4, 5} and r ∈ {1, 2, 3, 4}, under the above three judgment ranking scenarios. We used n = 1 because the RE is independent of n, as mentioned before.
It is seen that values of the RE exceed unity in all of the situations considered. Accordingly, the proportion estimator in MSRSS is more efficient than its competitor in SRS based on the same number of measurements. Moreover, for a fixed m, the RE is increasing in r. As expected, using Y 2 leads to larger values of the RE. It is worth noting that among the nine covariates, Y 2 has the highest correlation coefficient with the variable of interest X.
It is interesting to examine the PSSR in estimating the proportion from the WBCD. Table 2 reports estimated values of the PSSR for m ∈ {3, 4, 5} and r ∈ {1, 2, 3, 4}. The estimation is based on 100,000 samples, and Y 2 is used in the judgment ranking process. Given a set size, the PSSR improves as the stage number becomes larger. This property was also observed for each p in Table 1. Finally, estimated values of the PSSR fall in the range 25%-49%, which seems fairly good.

Discussion
Our developed procedure is quite general and can be applied to estimate a proportion in other problems.
For example, suppose we want to study prevalence of obesity in a population, based on body fat. Dual energy X-ray absorptiometry is one of the body fat testing methods that has been validated and thus, is considered as the "gold standard", but it is too costly to implement. The abdomen circumference is an obesity measure which is obtained readily. Thus, it can be employed as a covariate for using our proportion estimator in studying obesity. We believe that similar situations are abundant in medicine.
A deficiency associated with MSRSS design is that drawing a sample of size m, in a single cycle, requires identifying m r+1 units from the population. The number of units may be big if m and r are

Conclusion
This article puts forward an efficient method for estimating prevalence of breast cancer. It builds on MSRSS that incorporates auxiliary information in order to guide the experimenter toward drawing a more representative sample, as compared with SRS. Theoretical properties of the proposed estimator are investigated, and some numerical studies are performed.
In some medical studies, it is of interest to make inference about a population's characteristic, when measurement is expensive and/or time-consuming. Utilizing cost-efficient statistical methods for data analysis is critical at this juncture. We hope that our proposal would be a powerful tool in the arsenal of medical researchers. b) From the first part, one can write where in the third equality, identity (1) has been used.
c) It can be seen thatp To put it another way,p RSS is the sample mean of n independent and identically distributed random variables with mean p and variance m i=1 p [i] 1 − p [i] /m 2 . The asymptotic normality is then concluded from the central limit theorem.
Proof of Proposition 2. a) An identity similar to (1) holds in MSRSS (see Proposition 3 in Mahdizadeh and Zamanzade (2017a)). It establishes that where F (r) [i] (x) (i = 1, . . . , m) is the common distribution function of X (r) [i]1 , . . . , X [i]n . An application of identity (2) shows the unbiasedness ofp It can be shown that the covariance terms in (3) are positive. To do so, without loss of generality, it is assumed that i < j. Then, one can write Putting (3) and (4)  where the first equality results from the fact that X (r−1) (i) and X (r) [i]1 are identically distributed according to MSRSS procedure. d) Proof of asymptotic normality ofp (r) MSRSS parallels that ofp RSS , and it is omitted.