Severity-oriented Multi-objective Crowdsourced Test Reports Prioritization

Crowdsourced testing has been widely used to improve software quality since it can quickly obtain a vast number of test reports revealing various bugs. One of the most critical tasks in crowdsourced testing is determining how to inspect these valuable reports eﬃciently and quickly. In recent years, many automated techniques, like clustering and prioritizing, have arisen to assist in examining crowdsourced test reports. Due to the high repetition rate, most existing crowd-sourced test report prioritization techniques concentrate on ﬁnding diﬀerent bugs earlier while ignoring bug severity. The bug’s severity reﬂects how detrimental the bug is to the quality of the software. In practice, developers should ﬁx highly severe bugs as early as possible. In this paper, we propose a multi-objective prioritization technique for crowdsourced test reports that considers the diversity and severity of bugs. First, test reports are sorted in order of severity, and then the clustering results of test reports are used to adjust the ﬁrst sorting sequence to ensure diversity. To acquire the clustering results, we ﬁrst obtain the text feature vector and image feature vector of the test report separately, then fuse these two feature vectors. Finally, we use the density clustering algorithm to cluster the fused feature vectors. We conducted experiments on an industrial crowdsourced test report dataset from six mobile application projects to validate our method. The results show that our method can detect serious bugs earlier and diﬀerent bugs faster than existing methods.


Introduction
Recently, crowdsourced testing has become very popular in the software engineering research field [1].Through an open call, it recruits a large population of global online workers to quickly complete a variety of software testing tasks, like functional testing, usability testing, and user experience testing [2].In contrast to conventional software testing, crowdsourced testing can instantly obtain a large number of test results to enable rapid iterative software development.To meet the demand for this key advantage, various crowdsourcing testing platforms (such as Utest1 , Testbirds2 and Baidu Crowd Test 3 ) have emerged in recent years [3].These platforms serve as a communication bridge between test task requesters and workers, laying the groundwork for the boom in crowdsourced testing.
In crowdsourced testing, a large workforce concurrently performs test tasks posted on the crowdsourcing platform and submits test results in the form of test reports.Test task requesters usually obtain numerous test reports that have a high rate of duplication and poor quality [4].How to inspect these test reports efficiently becomes an unavoidable challenge for test task requesters; therefore, they seek methods to identify and prioritize useful information from a large number of test reports to speed up the software development process.
To alleviate this challenge, researchers have proposed several test report prioritization methods that fall into two categories based on the number of optimization objectives, i.e., single-objective test report prioritization [5][6][7][8][9][10] and multi-objective test report prioritization [11].Single-objective prioritization uses a specified optimization objective to determine the best sequence for inspecting test reports.Most of the existing single-objective prioritization methods ultimately aim to improve the bug detection rate of the test report set.Feng et al. [5] were the first to address the issue of crowdsourced test report priority to find all bugs early.However, prioritizing test reports based on individual optimization goals is often one-sided.To meet several optimization goals simultaneously, multi-objective prioritization has arisen.Tong and Zhang [11] proposed a prioritization technique that considers not only the detection rate of bugs but also their severity.However, the ranking strategy of this method is not strictly based on severity but on maximum textual information entropy.
In practice, test reports include several properties, such as reporter, bug type, bug severity, bug description, and bug screenshots, which provide necessary information to the developer for later bug fixing.Typically, different properties carry distinct types of information.Bug descriptions usually provide a thorough bug discovery process, whereas bug screenshots generally display the abnormal software activity views, which provide extra information to facilitate understanding of the bug.Bug severity often represents the degree of its impact on the normal operation of the software.Many researchers use bug descriptions and screenshots to identify similar reports [8][9][10][12][13][14].Inspired by these factors, we prioritized the test reports based on bug descriptions, screenshots, and severity information.
To find many bugs with high severity as early as possible, we propose a new multiobjective prioritization method named DivSev for crowdsourced test reports.First, the test reports are prioritized according to their severity from largest to smallest to ensure the early identification of a serious bug.Then, the initial prioritization results are adjusted based on the clustering result of test reports to ensure bug diversity.To obtain better clustering results for test reports, we first extract the features of the test reports that best represent them and then perform density clustering based on these features.We use natural language processing (NLP) tools to extract bug description features and spatial pyramid matching (SPM) [15] techniques to extract bug screenshot features; we then obtain fused features by simply stitching bug description features and bug screenshot features horizontally to characterize the test reports.Finally, the fused features are clustered by the density-based spatial clustering of applications with noise (DBSCAN) [16].
To validate our technique, we conducted experiments on an industrial mobile crowdsourced test report dataset comprising more than 1,600 test reports from six projects.The experimental results show that our technique is closest to the theoretical IDEAL and has the most significant improvement on RANDOM in terms of the average percentage of faults detected (APFD) [17] and APFDsv [18] metrics compared to the existing prioritization methods.
The primary contributions of this paper are as follows: • We propose an innovative method to prioritize crowdsourced test reports.The test reports are first prioritized according to their severity, and then the prioritization results are adjusted using test report clustering results.• We conducted comprehensive experiments on an industrial crowdsourced test report dataset containing over 1,600 test reports.Compared with other methods, we found that our method performs well in the detection rate of both bugs and high-severity bugs.
• We investigate the sensitivity of a key parameter to aid users in applying our technique.Based on real results, we outline the ideal parameter settings.
The remainder of this paper is organized as follows: Section 2 introduces the technical background and research motivation.In Section 3, we present the details of our technique framework.In Section 4, we evaluate our technical framework and introduce the experimental setup.Section 5 discusses the experiment results and analyzes the answers to research questions.Section 6 addresses threats to the validity of our technical framework and validation.In Section 7, we review prior related research on this topic.In Section 8, we draw conclusions and outline some directions for future work.

Background and Motivation
This section provides background information on crowdsourced testing and the inspiration for this research.
Crowdsourced testing uses a sizeable global workforce on the internet to complete various software testing tasks.The process of crowdsourced testing in the industry is shown in Fig. 1.Crowdsourced testing involves three stakeholders: task requestors, crowdsourced workers, and the crowdsourcing platform [1].First, task requestors, usually companies and organizations, release testing tasks that list all testing requirements on the crowdsourcing platform.Crowdsourced workers then choose and accept testing tasks based on their experience or interest and test the given software on their own devices according to the test requirements.After that, crowdsourced workers submit test reports that reveal the bugs found in the software to the crowdsourcing platform.Finally, task requestors will inspect all received test reports and pay the workers based on the number of bugs revealed in the test report they submitted.
Crowdsourced testing tasks often have financial compensation that attracts many crowdsourced workers, making many test reports available to task requestors [3].However, finding valuable information from an overwhelming number of crowdsourced test reports is time-consuming because of the high duplicate rate.Duplicate test reports mean the same or similar descriptions revealing the same bugs [4].Crowdsourced test reports have a significant proportion of duplicates for three reasons: 1. Since many crowdsourced workers complete testing tasks independently, multiple crowdsourced workers will inevitably find the same bugs.2. Since crowdsourced workers are required to finish testing tasks quickly, the highly stressful work can easily lead to the careless submission of duplicate test reports.3. Since crowdsourced testing offers monetary rewards, some crowdsourced workers may copy others' test results.
In this context, many researchers focus on detecting duplicates to reduce the number of test reports inspected to speed up the inspection [12-14, 19, 20].In contrast, others prioritize test reports to ensure that each test report examined differs from the one inspected earlier [5][6][7][8][9][10].Most researchers only consider high repetition rates, which is one-sided for test report inspection.But, Tong and Zhang [11] consider bug severity in addition to bug repetition.The severity of a bug means the degree to which the bug damages the quality of the software.In practice, developers should fix high severity bugs as early as possible.However, they did not use the bug severity information of test reports directly, but used textual information entropy instead.This fact motivated us to propose a multi-objective prioritization technique that takes into account bug diversity and bug severity.
Crowdsourced testing is more prevalent in mobile testing because it is cheaper than traditional mobile testing, meets the need for many different test environments, and provides real-world usage scenarios for developers to evaluate software performance better.Mobile crowdsourced test reports usually contain brief bug descriptions and rich bug screenshots.To illustrate, we present four mobile crowdsourced test reports of three different areas of applications in Table 1.This table shows the translated bug description text and screenshots.We can observe the reports with fewer than 50 words and several screenshots.Crowdsourced workers prefer to use short text and a few screenshots to describe bugs because typing is less convenient on mobile devices, while taking screenshots is easier.
So, many researchers have used two information modalities (i.e., text and images) to optimize the processing of mobile crowdsourced test reports [8][9][10][11][12][13][14].How to effectively fuse textual and image information to better identify duplicate reports is a critical problem that researchers need to solve.Feng et al. [8] and Liu et al. [10] fused textual and image information at the distance matrix level, specifically by using a balanced formula to combine the test report's text distance matrix and image distance matrix.Tong and Zhang [11] first cluster the test reports with screenshots, assigning the test reports containing only text information to the existing clusters.Inspired by Li et al. [21], we fuse text and images simply by stitching text and image features together horizontally

Approach
The design of our method is described in this section.We display the framework of our method in Fig. 2. The framework is divided into five main stages: 1) textual feature extraction, 2) image feature extraction, 3) feature fusion, 4) density clustering, and 5) multi-objective prioritization.Restarting the game after the end of the game does not completely end the last game.

Preliminary
Crowdsourced test reports usually consist of many fields, such as reporter, bug type, bug severity, bug description, bug screenshot, etc., to provide detailed information for later bug fixing.In this paper, we only use bug description, bug screenshot, and bug severity to analyze test reports.Thus, the test report set can be denoted as T RS = {tr i (T i , I i , S i ) | i = 0 . . .n}.In a test report, tr i , T i represents the text description of the bug performance and its discovery process, I i represents the screenshots that illustrate the abnormal software behavior, and S i represents the severity of the bug.
For specific test reports that include several screenshots, we use I i = I i1 , I i2 , . . ., I im to denote the screenshot set, where I im represents the mth screenshot in test report tr i .In our study, text and image information are dealt with separately.

Textual Feature Extraction
Since bug descriptions are written in natural language, we use the NLP technique to extract text features of the bug descriptions.More specifically, we model the bug description into keyword feature vectors.This feature vector modeling procedure consists of segmentation, part-of-speech filtering, stop word removal, and keyword weighting.
To extract the crucial information more accurately, we first use an existing, mature NLP tool, Jieba4 , to segment each bug description and tokenize each word with its part of speech (POS).Jieba is a lightweight Python-based system for word segmentation whose popularity in the open-source community is very high.After this, the raw word stream still contains noise, such as misspellings and rare terms.This phenomenon can be mitigated by extracting only verbs and nouns from text information [22], [23].Therefore, we retain verbs and nouns of each bug description after filtering out all other terms.The nonsense nouns and verbs were filtered again according to the stop word list to obtain the final keywords for each bug description.The corresponding keywords of a test report tr i can be denoted as KW i = {Kword i1 , Kword i2 , Kword i3 , . . ., Kword im }.Kword im denotes the mth keyword of test report tr i .Then, we can build the corpus with these test report keywords.Suppose we have j keywords in total; the corpus can be denoted as KV = {Kword 1 , Kword 2 , Kword 3 , . . ., Kword j }.Finally, term frequency-inverse document frequency (TF-IDF) is used to assign weights to the keywords of each test report to characterize its textual content in the form of numerical feature vectors.The text feature vector of a test report tr i can be expressed in the following form: In this, T F V i denotes the text feature vector of a test report tr i , and T F − IDF ij denotes the weight of the jth keyword of test report tr i .
For the jth keyword of test report tr i , the TF-IDF weight can be computed based on Equation 2. In Equation 2, v ij represents the frequency of the jth keyword of test report tr i , N represents the total number of test reports, df ij represents the number of test reports containing this keyword.

Image Feature Extraction
We convert bug screenshots into feature vectors to extract information from the images to analyze the test reports.In crowdsourced testing for mobile applications, the target applications are usually tested on multiple devices, which results in bug screenshots of crowdsourced mobile test reports, often with variable resolutions and color contrast.Therefore, a typical image feature extraction technique, such as bag Of features (BOF) [24], which considers only the distribution of global features based on the original RGB values and ignores their location information, is not well suited for our task to obtain image feature vectors.We extract image features using the spatial pyramid matching (SPM) [15] technique to address the challenges.
The SPM technique first selects a certain number of screenshots from the test report screenshot set to construct the data for the training model.Then it derives scale-invariant feature transform (SIFT) features for each selected screenshot.SIFT features are insensitive to transformations such as uniform scaling, orientation, and illumination changes and primarily reflect the GUI structure information.Next, it uses the K-mean algorithm to cluster these SIFT features into K classes.After obtaining the clustering results, it divides each image in the test report screenshot set into multiple sub-regions in a pyramid format.Then, the histograms within each sub-regions of each pyramid level are computed by extracting the sub-regions images' SIFT features and then assigning these features to the modeled clustering results.Finally, all histograms are concatenated to generate a vector representation of the image.

Feature Fusion
We perform the fusion of text with image by stitching the text and image feature vectors horizontally to unify the representation of test reports.The feature fusion formula is as follows: F i denotes the fusion feature vector of test report tr i , T F i denotes the text feature vector of test report tr i , and IF i denotes the image feature vector of test report tr i .If a test report tr i contains more than one screenshot, we add and average the feature vectors of each screenshot in a test report to combine them into a single fused image feature vector.Image feature fusion formula is as follows: IF ij denotes the image feature vector of the jth screenshot of test report tr i , and n denotes the number of screenshots of test report tr i .
If a test report has no image, the image feature vector of the test report will be a 4200-dimensional vector of all '0's because the image features we extracted using SPM are 4200 dimensions.

Density Clustering
We may group the test reports using the previously generated fused feature vectors for all test reports to detect duplicates.Considering that the number of received test reports is frequently unpredictable in the real world of crowdsourced testing, we employ DBSCAN [16], which does not require determining the number of clusters in advance, to aggregate test reports.The core principle of DBSCAN is to build clusters depending on the data item density.First, it needs two parameters: the radius of a neighborhood (ϵ) and the minimum number of points required to form a dense region (minP oints).Then, it begins by selecting a random point from the dataset.If there are more than minP oints points at a distance of ϵ from that point (including the initial point), they are all considered a cluster.The cluster is then expanded by checking each new point to determine if it has more than minP oints within ϵ distance and, if so, expanding the cluster recursively.It runs out of points to add to the cluster eventually.It then selects a new random point and repeats the procedure until no point can be selected.
DBSCAN is suitable for every distance function.Therefore, the distance function can be viewed as an additional parameter.We utilize cosine similarity to compute the distance between each pair of test reports because previous research [25] has proved its efficiency in high-dimensional data, precisely our situation.The formula for calculating the similarity of each pair of test reports is as follows: D ij denotes the distance between test report tr i and test report tr j , F i denotes the fused feature vector of test report tr i , F j denotes the fused feature vector of test report tr j .

Multi-objective Prioritization
Using the test report clustering results and bug severity, we can prioritize test reports based on multiple targets for task requesters to check.We perform multi-objective prioritization of crowdsourced test reports based on severity and diversity.We first ranked the test reports from highest to lowest severity and then adjusted the initial ranking order results based on the clustering results to ensure diversity.The adjustment procedure is as follows: first, traverse the test report from front to back in queue order.Suppose there are test reports from the same group as tr i before the position of test report tr i .In that case, all test reports after test report tr i are pushed forward one order, test report tr i is adjusted to the end of the queue, and the first test report with adjusted position is recorded.When traversing the first repositioned test report, the traversal is terminated because it and its subsequent test reports have been reordered and do not need to be reordered again.This allows the first |C| (the number of clusters of test reports) test reports in the sorted sequence to belong to different categories so that the top part of the test report sequence retains both diversity and high severity, increasing the likelihood of quickly identifying different and high-severity bugs in the test report.Fig. 3 shows how the clustering results can be used to change the severity-based ranking results.The severity-based test report ranking results are R3, R11, R2, R7, and R5.The test reports are clustered in 3 groups, R11 and R2 as a group, R3 and R5 as a group, and R7 as a group.First, iterate to R3, which is not moved since R3 is in the first position in the sequence.Then iterate to R11; since there is no test report from the same group as R11 before R11, R11 is also not moved.Then iterate to R2; as the test report R11 is from the same group as R2 before R2, all the test reports after R2 are moved forward one position, and R2 is moved to the last position and is recorded.Then iterate to R7; since there is no test report from the same group as R7 before R7, R7 is also not moved.Finally, iterate to R5; as the test report R3 is from the same group as R5 before R5, all the test reports after R5 are moved forward one position, and R5 is moved to the last position.The final result of the multi-objective test report prioritization is R3, R11, R7, R2, and R5.

Experiment
We conduct extensive experiments using actual industrial data to validate our method.All methods are executed on a personal computer with a 2.3 GHz Dual-Core Intel processor and 8 GB of RAM.In this section, we first propose the research questions of this experiment.Then, we introduce the experimental dataset, baselines, and evaluation metrics in detail.Finally, we present the hyperparameters settings of our approach.

Research Questions
In our experiment, we raise the following three research questions: [RQ1:] Can our approach substantially improve test report inspection to discover more diverse high-severity bugs faster?
[RQ2:] Can our approach substantially improve test report inspection to discover more diverse bugs faster?
[RQ3:] How does the experimental parameter affect the effectiveness of our approach?
Identifying more test reports that reveal different bugs with high severity from a large number of crowdsourced test reports early is critical to the subsequent bug-fixing process.To confirm the effectiveness of our technique in actual test report inspection tasks, we designed RQ1 to inform whether it can improve the bug detection rate considering the severity level.Since the most existing crowdsourced test prioritization method only considers diversity, for the sake of fairness of comparison, we propose RQ2 to verify whether our approach can significantly improve the bug detection rate.Because several key parameters influence our technique's performance, we analyze our method's performance under different parameter settings by RQ3.It is worth noting that we only focus on one parameter, i.e., the radius of a neighborhood (ϵ).

Data Collection
We adopted a publicly available test report dataset5 to validate the effectiveness of our method.This dataset consists of test reports from the mobile application testing sub-competition of the national software testing competition held by the Mooctest crowdsourcing testing platform in 2016.The competition simulates real mobile crowdsourced testing.The tournament organizers provide mobile applications developed by actual companies as well as crowdsourced testing task requirements.Participants need to test mobile applications and submit test reports within 4 hours according to the requirements of the testing task.More than 10 professional testers and developers manually evaluated and labeled each test report received.
The dataset contains test reports for six mobile applications covering the fields of education, travel, games, and health.This dataset has 1644 test reports; its statistical information is shown in Table 2.In Table 2, |T R| represents the number of test reports, and |B| represents the number of bugs discovered by test reports.The last five columns reflect the number of test reports for each severity level.For instance, |Sev1| indicates the number of test reports with a severity of 1.The range of severity in this dataset is 1-5.The higher the severity value, the more harmful it is to the software.

Baselines Setting
Since RQ1 and RQ2 aim to evaluate different aspects of the validity of prioritization results, we choose two state-of-the-art prioritization methods (BDDiv, TSE) for comparison.BDDiv [8]   reports and fuse the two types of information by the distance matrix.One difference between these two methods is the use of clustering techniques or not.In addition, we simulate the ideal inspection order to explore the potential of our methods.Also, we simulate the random inspection order to investigate the improvement of our methods.Thus, we have the five methods listed below.
• TSE [10]: A prioritization method based on a diversity strategy.
• DivSev: A prioritization method based on diversity strategy and severity strategy.
• IDEAL: The theoretically ideal inspection order satisfies the optimal solution under the given conditions.• RANDOM: Random inspection order for simulating the situation without ancillary technology.

Evaluation Metrics
For RQ1, we employ APFDsv, a variant of the APFD incorporating fault severity, to measure the performance of our approach.APFDsv is a metric used in test case prioritization [18].We migrated this evaluation metric to crowdsourced test report prioritization for the first time.APFDsv combines the index of the first test report that reveals each fault and its severity.The formula of APFDsv defined in Equation 6, where m is the number of faults revealed by all test reports, s is the severity measure of fault i, n is the total number of test reports, T fi is the index of the first test report in the prioritization order that reveals fault i.A higher APFDsv number in our experiment denotes a more effective test reports inspection process.In other words, a technique with a higher APFDsv score can find more severe faults sooner.
We utilize two additional metrics to compare the APFDsv values of the different test report prioritization methods.
Firstly, we use %∆, the percent difference between the APFDsv of the two methods.The %∆ can be calculated as the following equation: x 1 and x 2 are the APFDsv values of two test reports prioritization methods.Note that we focus on comparing the differences between IDEAL and our method to study the potential of our method and between our method and RANDOM to study the improvement of our method.The difference between IDEAL and technique X can be calculated as Gap = (IDEAL − X) /X.Also, the difference between technique X and RANDOM can be calculated as Improvement = (X − RAN DOM ) /RAN DOM .Secondly, we employ the Mann-Whitney U statistical test [26] to assess whether the difference between the APFDsv of different methods is statistically significant.The difference is considered statistically significant if the p-value of this test is less than 0.05.The advantage of the Mann-Whitney U test is that sample populations are not required to be normally distributed.
For RQ2, we use APFD to measure our approach's rate of detecting faults.The calculation of APFD is more straightforward than that of APFDsv, which only uses the index of each fault that first appeared.The APFD is calculated using the following equation: The APFD values are between 0 and 1, and a higher APFD score indicates a higher fault detection rate.In addition, we also use %∆ and the Mann-Whitney U statistical test to compare the APFD values of the different prioritization approach.

Hyperparameters Setting
In our experiments, the hyperparameters to be adjusted are present in both the SPM and DBSCAN techniques.But the SPM technique is not the focus of this paper; we do not adjust the hyperparameters of the SPM technique but directly adopt the values of the same hyperparameters used in the comparison method.When applying the SPM approach, we first resize all screenshots to 480 * 480 pixels and then set DictSize = 200, L = 3, and HistBin = 100.DictSize is the size of the descriptor dictionary, L is the number of levels of the pyramid, and HistBin is the number of images to be used to create the histogram bins.
Clustering test reports is one of the crucial processes in our method.Considering that the clustering results mainly determine our method's effectiveness, we need to adjust the hyperparameters in the DBSCAN technique to obtain better performance.DBSCAN has only two hyperparameters, ϵ and minP oints, where ϵ represents the radius of the neighborhood and minP oints represents the minimum number of points to form a dense region.In reality, only one of the crowdsourced test reports may reveal a unique bug, so we directly set minP oints = 1.For ϵ, we get the final value of 0.5 by adjusting it.To maintain the consistency of the results, we did not change these hyperparameters in any of the experiments, except for RQ3, which is designed to investigate the parameter sensitivity.

Result Discussion
In this section, the experimental results are analyzed to answer three research questions.

Answering Research Question 1
[RQ1:] Can our approach substantially improve test report inspection to discover more diverse high-severity bugs faster?
Since the comparison approaches have random selection operations in the prioritization process, each experiment was repeated 30 times to reduce the bias caused by randomness.We present the experimental result in Fig. 4 and Table 3. Fig. 4 shows the boxplots of the APFDsv results for the six projects.In Table 3, we present the mean APFDsv values of all approaches and display the improvement over RANDOM and the gap between IDEAL.In addition, we conduct the Mann-Whitney U statistical test between our approach (DivSev) and the other three approaches (RANDOM, BDDiv, TSE) and also present the results in Table 3.
Based on the boxplots of APFDsv values in Fig. 4 and the third column of Table 3, we can see that those methods (TSE, DivSev) that use clustering techniques outperform the RANDOM method on all projects to different degrees.However, the BDDiv method did not exceed the RANDOM method in one project (TravelDiary).From Table 3, for all projects, we can observe that the improvement of our method (DivSev) ranges from 9.05% to 19.89% compared with the RANDOM, while TSE improves only 2.77% to 5.8%.In particular, for our method, we find that it consistently outperforms other methods (RANDOM, BDDiv, TSE).The fourth column of Table 3 demonstrates the gap between different techniques and the theoretical IDEAL.We observe that the gap between our method and IDEAL ranges from 3.13% to 10.95%, while the gap between TSE and IDEAL varies from 12.8% to 25.1%.
We additionally conduct the Mann-Whitney U Test for APFDsv between our proposed DivSev and three baseline approaches (RANDOM, BDDiv, TSE).Results show that for all projects, the p-values between our proposed DivSev and each of the three baselines are all below 0.05.This signifies that our approach's severe bug detection performance is significantly better than existing approaches, which further indicates the advantages of our approach over the three commonly-used methods.Also, in all projects, the lengths of boxplots of our method are shorter than other methods.This observation suggests that the performance of our approach is more stable than other methods, as the box length represents data variability.
Summary: All these prioritization strategies can increase the inspection efficiency of test reports for most projects compared to the RANDOM strategy.Our method distinctly improves the performance for severe bug detection.Compared to other approaches, our approach shows a lesser deviation from the theoretical IDEAL result.Prioritizing test reports can still be improved with further effort, however.

Answering Research Question 2
[RQ2:] Can our approach substantially improve test report inspection to discover more diverse bugs faster?
To answer RQ2, we used APFD to calculate the bug detection rate and ran each experiment 30 times.We first give a general view of the performance of all methods, measured in terms of APFD values.Fig. 5 shows boxplots of the APFD results for each method run 30 times on six different projects.We find that our method outperforms other methods on all projects except HuJiang.We conducted a more thorough examination of this issue and discovered what we consider to be the cause of the difference in results for the HuJiang project.HuJiang has the smallest ratio of the number of reports containing screenshots to the total number of reports.Our technique predominantly favors image information, while the comparison methods (BDDiv, TSE) prefer textual content.Therefore, the effectiveness of our technique may be worse when the percentage of reports containing screenshots is small.From Fig. 5 , we also find that our method has a shorter boxplot length than other methods, which means that our method is more stable than other methods.time than the BDDiv method, which is in a different order of magnitude.The running time of each method's sorting process can be viewed in Table 5.
From the third column of Table 4, we can see that both our method and the TSE method have a higher average APFD value than the RANDOM method on six projects.Still, the BDDiv method is greater than the RANDOM method on only five (excluding TravelDiary).More specifically, our approach improved from 1.93% to 6.85% over the RANDOM method, whereas the TSE method improved from 0.99% to 4.67% over the RANDOM method.
Summary: Generally, our approach significantly boosts the bug detection rate compared to existing approaches.However, we have found that some categories of projects are less suitable for our technology -namely, projects that contain a smaller percentage of test reports with screenshots.Compared to the RANDOM strategy, these prioritization techniques can improve inspection effectiveness for most projects.Our method deviates from the theoretical IDEAL result less than other strategies.There is, nonetheless, the opportunity for further research to enhance prioritization strategies for test report inspection.

Answering Research Question 3
[RQ3:] How does the experimental parameter affect the effectiveness of our approach?
In this subsection, we discuss further the effect that parameter settings have on performance.This assists users of our method in making proper settings for different application situations.We investigate the parameter sensitivity of our method based on one important parameter: the radius of a neighborhood ϵ, which influences the main step of the clustering process.The parameter ϵ must be given in the first step of the density clustering.In this study, we analyze the APFDsv and APFD scores, where ϵ varies from 0.1 to 1.0 with increments of 0.1.Fig. 6 shows the sensitivity of APFDsv and APFD results to the parameter ϵ, given minP oints = 1.And we present the mean value and standard deviation of APFDsv in the same setting in Table 6.Similarly, Table 7 shows the APFD results with the changes of parameter ϵ.
Table 6 and Fig. 6 show that the APFDsv fluctuates under different values of ϵ.When the value of ϵ reaches 0.5, four projects, i.e., HuaWei, Hujiang, MyListening, and TravelDiary, obtain the highest APFDsv score.The other two projects, i.e., 2048 and Wonderland, reach the highest APFDsv score when ϵ = 0.4 and ϵ = 0.6, and the difference of APFDsv between ϵ = 0.5 is less than 0.02.Also, the standard deviation Fig. 6 The sensitivity of APFDsv and APFD to the parameter ϵ. of APFDsv for each of these projects in Table 6 stays in a range of small numbers, i.e., from 0.014 to 0.044, which indicates that the APFDsv result is relatively stable with the change of ϵ.
From Table 7 and Fig. 6, we observe that when the value of ϵ reaches 0.5, three projects, namely HuaWei, MyListening, and TravelDiary, receive the highest APFD score.For the other three projects, 2048 reaches the highest APFD score when ϵ = 0.7, Hujiang reaches the highest APFD score when ϵ = 0.4, and Wonderland reaches the highest APFD score when ϵ = 0.3.In addition, the standard deviation value of APFD scores remains within a modest range, i.e., 0.006 to 0.033, indicating that the APFD result is relatively stable with respect to the variation of ϵ.
Summary: While parameter ϵ will influence the performance of our method to different degrees, our method is generally insensitive to its variations.Moreover, we suggest that the default value of ϵ be set to 0.5.[27], similar outcomes are expected when allocating a novel task to students and experts.Hence, the experimental data gathered from students may not be the main threat to our validation procedure.Subject Program.Due to the limitation of the data source, we could only validate the method for six crowdsourced test reports of android applications.Our approach does not necessarily guarantee good results for the crowdsourced test reports of desktop applications or browsers.But these six mobile applications cover many areas, such as education, travel, games, and health.Therefore, we believe that the crowdsourced test reports generated for these applications can demonstrate the effectiveness and applicability of our approach.
Natural Language.All experimental data are written in Chinese; therefore, we cannot guarantee that similar results will be seen in other languages.However, this issue can be alleviated because we undertake simple word segmentation, not intricate semantic analysis.The process of extracting text features by natural language techniques is not the core of our method.Many sophisticated NLP tools are available today for processing different languages, demonstrating our method's translatability.

Related Work
This section discusses the work related to crowdsourced test report analysis, i.e., crowdsourced test report prioritization, duplicate crowdsourced test report detection, crowdsourced test report classification, and crowdsourced test report refactoring.

Crowdsourced Test Report Prioritization
Crowdsourced test report prioritization methods can be divided into two categories based on the type of information they use: text-based or text-and image-based methods.
Text-based methods: Feng et al. [5] first introduced the problem of crowdsourced test report prioritization, taking inspiration from test case prioritization to rank crowdsourced test reports using a diversity strategy and a risk strategy.The diversity strategy allows developers to inspect a wide variety of test reports sooner.The risk strategy enables them to examine test reports that are more likely to reveal true bugs as early as possible.Later, Yang and Chen [6] used a combination of diversity and classification strategies to sort the test reports considering the effect of duplicates.Zhu et al. [7] migrated the matured test case prioritization methods (total greedy algorithm, additional greedy algorithm, genetic algorithm, and ART) to crowdsourced test report prioritization based on diversity strategies and evaluated the effectiveness of these methods.They found that all these methods performed well, with an average APFD of more than 0.8.
Text-and image-based methods: Since mobile crowdsourced test reports usually contain shorter bug descriptive text and rich bug screenshots, Feng et al. [8] combined text and image information to sort test reports to help developers find as many bugs as possible as quickly as possible.They used a balanced distance matrix to fuse text and image information and give higher weights to text information.Later, Tong and Zhang [11] proposed a prioritization technique called CTRP to quickly get test reports with comprehensive bug descriptions using text information entropy as the sampling strategy.But they were more interested in the image information.Yu et al. [9] took a closer look at the relationship between text descriptions and screenshots and proposed a crowdsourced test report prioritization method called DeepPrior using deep learning technology.
Most existing prioritization methods intend to ensure the diversity of test reports.The reason for this phenomenon is the high duplication rate of crowdsourced test reports.However, they do not use information about the bug's severity, an essential indicator for the subsequent bug inspection and repair process arrangement.Based on this, we propose a multi-objective test report prioritization using both diversity and severity strategies.

Duplicate Crowdsourced Test Report Detection
Based on the above classification criteria for test report prioritization, duplicate test report detection methods can be divided into one more category (i.e., video-based methods).
Text-based methods: Jiang et al. [19] identified a new problem of fuzzy clustering test reports, the solution of which requires overcoming three barriers (invalid barrier, uneven barrier, and multiple bug barrier).Chen et al. [20] proposed a report clustering model RCSE to combine duplicate reports.
Text-and image-based methods: Yang et al. [13] fused the text and image information of the test reports to characterize them better to obtain the clustering results.
Liu et al. [10] also proposed a clustering algorithm to reduce the test report inspection time.Wang et al. [12] offered the duplicate test report detection method SETU, which extracted four features to characterize test reports: image structure, image color, TF-IDF, and word embedding, and designed a hierarchical algorithm to produce a list of duplicate test reports.Cao et al. [14] proposed a crowdsourced test report selection tool, STIFA, which performs clustering and selection.
Video-based methods: Developers may be challenged to identify duplicate videobased test reports manually.Cooper et al. [28] introduced a model TANGO to solve this problem, which incorporates tailored computer vision techniques, optical character recognition, and text retrieval.
In addition to the research mentioned above on methodological proposals, Huang et al. [29] conducted the first experimental evaluation of 10 duplicate crowdsourced test report detection methods to explore which is the golden method.The results show that ML-REP, a machine learning-based approach, and DL-BiMPM, a deep learningbased method, are the best; however, the latter is more sensitive to training data quantity and takes longer to train and predict.

Crowdsourced Test Report Classification
We can classify test report classification methods into two categories, second classification, and multiclassification.
Second classification: Wang et al. [30] proposed a clustering-based classification method to automatically identify reports that reveal true bugs from many crowdsourced test reports.Because classifying crowdsourced test reports using supervised machine learning techniques usually requires manually labeling a large amount of training data, they [31] proposed the use of active learning to reduce the cost of manual labeling and enable good classification results, They [32] later also built an effective cross-domain classification model in which deep learning techniques are used to discover intermediate representations shared across domains.Guo et al. [33] proposed a knowledge transfer classification (KTC) approach to predict the severity of test reports.Yu et al. [34] proposed ReCoDe to detect the consistency of crowdsourced test reports.Chen et al. [35] suggested a framework TERQAF using logistic regression to model the quality of test reports.
Multiclassification: Li et al. [21] empirically investigated the effectiveness of six classification algorithms: naive Bayes (NB), k-nearest neighbors (kNN), support vector machine (SVM), decision tree (DT), random forest (RF), and convolutional neural network (CNN), to classify bug types for test reports.They found that SVM is superior to other methods in classifying crowdsourced test reports.Zhao et al. [36] proposed a unified test report assignment framework that matches traditional and crowdsourced test reports with the right bug fixers.
The existing test report classification methods involve many aspects of tasks, such as true and false bug classification, bug severity classification, test report consistency classification, and test report quality classification.However, most researchers divide test reports into two categories, and further studies are needed on multi-categorization tasks.

Crowdsourced Test Report Refactoring
Crowdsourced test report refactoring can be divided into generation and enhancement to help developers quickly understand the bugs revealed in the test reports.
Generation: Because it takes much time to speculate about the actual operation process of crowdsourced workers and potential software bugs by browsing many screenshots, researchers hope to convert the screenshots into text information to speed up the understanding of bugs.Liu et al. [37] used natural language processing techniques to analyze text descriptions of high-quality test reports containing similar images to generate descriptive keywords for the images.Yu [38] proposed a deep learning model called CroReG to generate text for images.Yu et al. [39] later proposed BIU, a novel method that uses image understanding techniques to assist developers in automatically inferring bugs and generating bug descriptions from bug screenshots.
Enhancements: To reduce the number of test reports to be inspected and improve the quality of inspected test reports to speed up inspection efficiency, researchers mostly extract useful information from repeated test reports to supplement the main report.Chen et al. [40] proposed a new test report augmentation framework (TRAF) that enhanced the content of the three fields of the test report, namely, environment, input, and description.They used visualization technology to highlight the added content to make the contrast obvious.Hao et al. [41] and Li et al. [42] proposed the report enhancement approach CTRAS.

Conclusion
This paper presents a novel prioritization technique (DivSev) to assist developers in inspecting the vast amount of test reports generated by crowdsourced testing.Our preliminary investigation revealed that the large majority of existing test report prioritization algorithms rely solely on bug descriptions and screenshots to find different defects earlier, ignoring bug severity.This encouraged us to design a test report prioritization method that helps developers inspect more bugs of high severity as early as possible.To validate our method, we conducted extensive experiments on a public dataset containing crowdsourced test reports for six mobile applications.The experimental results indicate that: 1) DivSev distinctly improves the performance of severe bug detection compared to other methods; 2) DivSev significantly boosts the bug detection rate compared to existing approaches; and 3) DivSev is relatively stable in terms of key parameters.In future work, we will focus on extracting features for test reports using multimodal pre-training models to optimize the performance of our techniques.

Declarations
Funding.This work was supported by the Natural Science Foundation of Fujian Province (No. 2022J011238).Competing interests.The authors have no relevant financial or non-financial interests to disclose.Ethics approval.Not applicable.

Table 1
Example mobile crowdsourced test reports.
was proposed by Feng et al. in 2016, while TSE [10] was proposed by Liu et al. in 2022.Both methods use text and image information from the test

Table 2
Statistical information of dataset.

Table 3
Mann-Whitney U Tests of APFDsv scores.

Table 5
Runtime for prioritization.

Table 6
The comparison of APFDsv under different settings of ϵ.

Table 7
The comparison of APFD under different settings of ϵ.This study's experimental subjects come from a nationwide software testing competition rather than actual crowdsourced software testing.Only students play the role of crowdsourced workers, which is inconsistent with the diversity of crowdsourced workers required by crowdsourced technology.If the crowdsourced workers came via the internet from open calls, our results might have been different.However, according to the research of Salman et al.