The main purpose of this research is to adapt the traditional code coverage criteria in order to address specific OO features, while at the same time, the resulting new criteria have high degree of automation, simplicity and low cost of execution, as before. In this section, the proposed object coverage criteria are empirically evaluated using a set of different OO classes as benchmarks. These classes have been selected from different open source programs in Java. We use our prototype tool, i.e., OCov4J, to measure object coverage criteria. The results are then compared with the results of traditional coverage criteria (obtained by the JaCoCo tool) to find which criteria better reveal OO-related failures.
The major goal of this empirical evaluation is to investigate the correlation between object coverage criteria and the ability to reveal specific OO-related faults. The secondary goal is to examine how our poly-object coverage criteria address the polymorphic and dynamic binding issues. For these purposes, several classes under test (CUTs) from different open source Java programs/libraries have been selected. We have then generated a test suite for each CUT such that the resulting test suites yield high level of coverage for traditional code coverage criteria. Next, a set of OO faults have been seeded in the source code of every CUT or its related classes in the inheritance hierarchy to make specific OO faulty versions (For this purpose, we have used auto-generated mutants with some manually seeded faults). Finally, we have calculated object coverage criteria for each test suite as well as the number of faulty versions detected by this test suite in order to examine the correlation between these two categories of values. In the following subsections, we review the details of benchmarks, the evaluation process, and finally, the results of the evaluation.
4.1. Benchmark projects
A total of 270 classes (excluding private classes, intermediaries, and internal classes) from six different open source projects have been included in the empirical evaluation. Apart from JTetris which is a small project with educational purposes, five open source and widely used projects with active community have been selected.
Table 1 shows the project name, the package used during evaluation, the number of classes, the depth of the inheritance tree (DIT), and the lines of code (LOC) (excluding comment and blank lines) for each selected project. The MetricsReloaded[4] tool has been used to calculate LOC and DIT of each class. A summary description of benchmark projects is provided below:
- JTetris[5]: This project is a simple implementation of the Tetris computer game in Java. This implementation is based on the formal specification provided in (Smith 2012). It uses inheritance, polymorphism and dynamic binding to model the elements and the process of the game. The classes under the package models are used to specify different elements and their movement in the game.
- Apache Validator[6]: This is a widely used Java library for data validation that provides various sets of validation rules for different data types. It is a sub-project of the popular Apache Commons project. We use classes under package validator.routines which model different validators for various data types.
- Apache BCEL[7]: The Apache “Byte Code Engineering Library” is an open source project designed for analyzing and manipulating Java compiled files, namely bytecode files. This project is very popular and has largely used OO features. Related works such as (Alexander et al. 2012; Offutt et al. 2006) have used this project in their evaluation process. The package classfile used in the evaluation contains the classes that model different internal parts of a bytecode file.
- Apache Collections[8]: This library enriches the data structure framework of Java Development Kit (JDK) by providing some new data structure classes or helper classes for existing data structures. We use the map package which provides different types of map data structures.
- Google Graphs[9]: Google Graphs is a famous library for working with graph data structures in Java. It is a sub-project of the famous Google Guava project which includes common libraries for Java, maintained by Google. We use package graph from this project including different types of graphs and various network data structures.
- Ta4J[10]: It is an open source project for technical and financial analysis in Java. This project has received a lot of attention from Java developers. The classes under the package indicatorscontain different technical indicators used in the evaluation. Each indicator predicts future price movements by analyzing historical price data.
Table 1
A summary of benchmark projects
Project
|
Package
|
Classes
|
DITs of all classes
|
Average DIT
|
LOCs of all classes
|
JTetris
|
org.jtetris.models
|
11
|
34
|
3.090
|
521
|
Apache Validator
|
org.apache.commons.validator.routines
|
27
|
57
|
2.111
|
9,456
|
Apache BCEL
|
org.apache.bcel.classfile
|
82
|
154
|
1.878
|
14,046
|
Apache Collections
|
org.apache.commons.collections4.map
|
36
|
110
|
3.055
|
12,797
|
Google Graphs
|
com.google.common.graph
|
38
|
80
|
2.105
|
7,812
|
Ta4J
|
org.ta4j.core.indicators
|
76
|
236
|
3.015
|
5,234
|
Sum
|
270
|
-
|
-
|
49,866
|
Average
|
45
|
112
|
2.5
|
8,311
|
To examine the "object coverage criteria", 40 classes from the mentioned projects have been selected. We call these CUTs as “target classes”. These classes lie in different inheritance hierarchies. Using EvoSuite, a test suite has been generated for each target class to measure its ability to detect OO-related failures. In addition, 24 classes whose majority are different from target classes have been used to evaluate the "poly-object coverage criteria". We call these 24 classes as “base classes”. Target, base, and other classes in the inheritance hierarchy are used to create faulty versions of each project; the details are presented in Subsection 4.3. Each target class has at least one parent. Moreover, we use some target classes with several ancestor classes so that different types of OO problems can be modeled in a faulty version. Also, each base class has at least one child so that it can be used in polymorphic manner. Finally, we use base classes with different numbers of descendant classes to cover more polymorphic faults. By selecting different, real projects, and different target and base classes in various inheritance hierarchies, we attempt to reduce threats to external validity of our empirical validation.
4.2. Test Suites
For each target class in each inheritance hierarchy, a test suite has been created using the EvoSuite tool. For half of the selected classes, i.e., 20 classes, the generated test suite has 100% traditional statement coverage. 11 other classes have coverage more than 70%. 7 classes have a coverage between 50% and 70%, and only for two classes, the coverage is slightly less than 50%. Thus, in total, most of the target classes, i.e., more than 77% of them, achieve high code coverage levels (between 70% and 100%). Attempts have been made to ensure that the tests have a high level of code coverage so that they more likely detect failures in our faulty versions of projects.
We have used EvoSuite 1.1.0 with default configuration to generate test suites. Also, the maximum search budget time for each test suite generation has been 1 minute. After generating a test suite for each target class, the object coverage criteria for each target class are measured by adding the OCov4J agent to the JUnit test execution process.
4.3. Faulty Versions
Fault seeding in which artificially faults are inserted into programs is commonly used to compare testing approaches (Papadakis et al. 2019). To measure how much a test suite has been able to reveal OO-related failures in a target class, some tiny changes have been made to the target class code along with its parent and ancestor classes. This way, fault seeded versions of projects, called “faulty versions”, have been built. To inject different types of OO defects into the faulty versions of projects, we have used the classification of OO faults in (Offutt et al. 2001; Alexander et al. 2002) as well as approaches for generating OO mutations in (Kim et al. 2001; Offutt et al. 2006; Ma et al. 2006). According to the mentioned approaches, OO faults and problems (related to concepts such as inheritance, polymorphism, and dynamics-binding) can be categorized as follows:
- Inappropriate and inconsistent use of inherited state variables: Many errors due to inheritance may occur when the child class misuses the inherited state space of the parent or ancestor classes. In the fault model introduced in (Offutt et al. 2001), these errors are classified as “state definition anomaly” and “state definition inconsistency”. IHD and IHI operators have also been used in mutation approaches such as the approach of (Offutt et al. 2006) in order to generate such general errors. These operators hide the inherited state variable by deleting or adding a variable whose name is the same as the name of a variable in the parent or ancestor classes.
- Incompatible invocations of inherited methods: Some other inheritance-related errors are because of the incorrect usage of the parent state space in defining overridden methods. There are also errors due to the either direct or indirect invocation of these methods. In the fault model of (Offutt et al. 2001), these errors are respectively classified as “state definition inconsistency due to state variable hiding” and “indirect inconsistent state definition”. To create such general errors by mutation approaches such as the approach of (Offutt et al. 2006), operators IOD and IOR have been used. IOD and IOR cause failures by deleting and renaming overridden methods in the child class, respectively.
- Incorrect and inconsistent invocation of parent/ancestor constructors: The way of initializing a class can cause some common, potential OO failures. Executing the constructor of a class will result in executing the constructor of the parent and ancestor classes until the execution reaches the root class. Also, each class may explicitly execute a specific version of its parent class constructor, so inappropriate invocation through these sequences of constructor innovations may lead to data anomalies. In addition, calling other methods of a class in the constructor can cause some potential failures because these methods can be overridden by subclasses. In the fault model of (Offutt et al. 2001), these errors are classified as “anomalous construction behavior” and “incomplete construction”. Operators IPC and JDC have been introduced to model such errors in the OO mutation approach of (Offutt et al. 2006). The former operator eliminates the parent constructor, and the later makes the default constructor to be executed. These changes model the problems which take place because of the interaction between the class constructor and inherited constructors.
- Polymorphism and using inconsistent types: Polymorphism and dynamic binding can lead to potential problems through executing overridden methods in different contexts (i.e., different instances of descendant classes’ objects). These problems may occur especially when the inheritance depth is more than two. For example, the yo-yo problem is a well-known case of this failure type that is mentioned to be difficult to find (Alexander et al. 2010). In the fault model of (Offutt et al. 2001), these errors are classified as “inconsistent type use”. Several operators have been introduced to generate such general errors in the mutation approach of (Offutt et al. 2006). For example, the PNC operator changes the object instantiation from a class to a subclass of this class. Operator PMD changes the declaration type of a variable to the parent or an ancestor of the class.
- Misusing programming language keywords in accessing object state and inherited state: Although this error type is not included in the fault model of (Offutt et al. 2001), Ma and Offutt consider these errors common among developers and introduce mutation operators for them (Offutt et al. 2006). These operators include ISK, JTD and JSC which are responsible for deleting the keywords super (for problems related to the inherited code access), this (for problems related to the object states), and static (for problems in accessing the shared space between classes), respectively. Note that these types of faults may lead to errors that are semantically equivalent to the errors in the aforementioned categories. Therefore, it is necessary to check the generated faulty versions to avoid equivalent mutants.
Although we could implement our fault seeding process by using only one mutation tool that supports OO mutation operators (e.g., MuJava (Maet al. 2006) which implements the mutation approach of (Offutt et al. 2006)), we have also used a manual approach alongside with MuJava to produce faulty versions. This was done due to the lack of mature tools in this field. Among all the OO mutation tools, only MuJava is accessible and can be used for real cases, such as our benchmarks presented in the previous subsection. However, MuJava, like other OO mutation approaches, generates mutants by applying only a tiny change to one statement of a single class file; this has some shortcomings for modeling specific OO faults as listed below:
1. While making a small change in one statement of one class file usually yields a valid mutant in procedural mutations, this approach may lead to invalid mutants in the OO paradigm. For example, in MuJava, the JTD operator tries to create an OO mutant by removing the keyword this from the beginning of the variable name so that the value assigned to this variable remains local and does not bind to the class state variable (which can be a common fault as mentioned in our OO faults categories). Now, consider class Bars in the following example, where applying the JTD operator in line 6 can yield a mutant. This mutant could not be compiled because of the final initializing rule in declaring state variable barcount with the final keyword (line 2). This rule enforces a state variable to be initialized when the object is instantiated. Therefore, to create such mutants, we should apply two tiny changes at the same time in the code: removing the final keyword in line 2, and removing this keyword in line 6. Therefore, in order to do OO mutations, in some situations, it is necessary to make multiple changes in different parts of the class.
01 class Bars {
02 final Int barCount;
03 Bar[] bars;
04 public Bars(int barCount) {
05 bars = new Bar[barCount];
06 this.barCount = barCount;
07 }
08 }
|
2. The existing approaches generate OO mutants for a particular class by only considering this class in isolation and applying their operators to this class code. However, many OO failures (especially polymorphic failures) originate from a fault in the parent or ancestor classes. Although such faults will not affect the correctness of the class itself, it may lead to OO failures in the descendant classes (for example, refer to the counters example in Subsection 2.3). Thus, when we generate some mutants for a class, we may need to apply changes in its parent or ancestor classes.
3. To model some OO faults, it is necessary to make small changes in the target class and its parent/ancestor classes, simultaneously. For example, the IPC operator in MuJava generates mutants by removing the invocation of one of the parent class constructors from the constructor of the class. When we remove this invocation, the Java compiler replaces the call of the default constructor of the parent class (this constructor does not have any parameter) with the removed parent constructor call; if the parent class does not have a default constructor, the change made to the child class will cause a compile error. In this condition, in addition to changing the child class file (removing the parent constructor call), we need to add the default constructor to the parent class, simultaneously. Consider the class ClassUnderTest below, for which we want to create a mutant by removing the parent constructor invocation in line 4, as highlighted. Suppose ParentClass does not have a default constructor, so, to make the mutant valid and compliable, we have to add a default constructor to ParentClass at the same time, as highlighted in line 3 of the ParentClass code.
01 class ClassUnderTest extends ParentClass {
02 public ClassUnderTest(Object param){
03 // remove the below line to disable the parent constructor
04 super(param);
05 //other constructor code of this class
06 }
07 }
01 class ParentClass {
02 // add below line as a default constructor
03 public ParentClass(){}
04 public ParentClass(Object param){
05 //other constructor code of this class
06 }
07 }
|
In regard of the above points, we have used a hybrid approach for the fault seeding process. In addition to using the MuJava tool to generate OO mutants, some more faulty versions have manually been created by analyzing the target classes along with the parent/ancestor classes. In general, to generate faulty versions, we followed the following steps: |
1. First, we generated all the OO mutants that can be produced using only the inheritance, polymorphism, and dynamic binding operators of MuJava. |
2. Next, we manually generated faulty versions in different mentioned categories (specified at the beginning of this subsection) by changing two or more parts of a class at the same time, modifying the parent/ancestor classes, or changing several classes at the same time, including the target class and the classes in the inheritance hierarchy. |
3. Finally, we checked all faulty versions generated (either automatically or manually) for a class to remove equivalent versions.
Below are two examples of automatic and manual mutants used in the evaluation process. Class Field of project Apache Validator represents a mutant that has automatically been generated by MuJava. Line 3 has been removed from this class so that the parent constructor would not be called. The DPOIndicator class (from the Ta4J project) shows a manual mutant generated by the approach described in this section. Two simultaneous changes have been applied to this class to make the mutant. First, the “this” keyword has been removed from the beginning of variable barCount in line 6 to change the scope of this variable. However, since variable barCount in line 2 is declared with the “final” keyword, this keyword must also be removed so that the mutant can be compiled.
01 public class Field extends FieldOrMethod {
02 public Field(final Field c) {
03 super(c); //this line removed to generate a mutant
04 }
05 // other code of this class
06 }
01 public class DPOIndicator extends CachedIndicator < Num> {
02 private final int barCount; //the “final” keyword is removed
03 // other field of class
04 public DPOIndicator(Indicator < Num > price, int barCount) {
05 super(price);
06 this.barCount = barCount; //the “this” keyword is removed
07 timeShift = barCount / 2 + 1;
08 this.price = price;
09 sma = new SMAIndicator(price, this.barCount);
10 }
11 // other code of this class
12 }
|
Performing the above steps and discarding the equivalent faulty versions, 668 faulty versions have been generated for target classes. In fact, about 16.7 faulty versions have been produced for each class, on average. Most of these mutants (more than 72%) have automatically been generated by the MuJava tool. Table 2 shows the details of the generated faulty versions for each target class. In addition to the project name and the target class name, there are the following columns in this table:
- DIT: The inheritance depth of the target class in the inheritance hierarchy;
- All Faulty Versions: The total number of all faulty versions generated for the target class, either using MuJava or manually;
- Auto-generated Faulty Versions: The total number of faulty versions automatically generated by MuJava for the target class;
- Manual Faulty Versions: The total number of faulty versions that have been manually generated by the approach introduced in this section;
- Auto-generated Ratio: The ratio of auto-generated faulty versions to total faulty versions.
It should be noted that the number of faulty versions for each class depends on the sum of LOCs of the target class and its parent/ancestor classes, the number of overridden methods in this class, the inheritance depth of this class, and finally, the number of variables inherited from the parent/ancestor classes.
In general, the number of OO mutants is significantly less than the number of mutants generated by common, procedural mutation approaches. For example, in case of the famous triangle program which has 30 lines of code, about 950 mutants can be created using different types of traditional mutation operators like changing arithmetic operators or logical operators (Ma et al. 2006). However, in a study of 256 classes of the Apache BCEL framework, it has been shown that about 14.5 OO mutants can be generated for each class using the MuJava tool (Offutt et al. 2006).
Table 2
Target classes and related faulty versions
Project
|
Target Class
|
DIT
|
All Faulty Versions
|
Auto-generated Faulty Versions
|
Manual Faulty Versions
|
Auto-generated Ratio
|
Google Graph
|
StandardValueGraph
|
3
|
21
|
16
|
5
|
76%
|
StandardMutableValueGraph
|
4
|
30
|
16
|
14
|
53%
|
StandardMutableNetwork
|
3
|
36
|
30
|
6
|
83%
|
StandardMutableGraph
|
4
|
22
|
15
|
7
|
68%
|
NetworkBuilder
|
2
|
7
|
6
|
1
|
86%
|
ImmutableValueGraph
|
4
|
28
|
15
|
13
|
54%
|
GraphBuilder
|
2
|
8
|
5
|
3
|
63%
|
JTetris
|
Square
|
3
|
18
|
18
|
0
|
100%
|
Screen
|
4
|
10
|
10
|
0
|
100%
|
S
|
4
|
22
|
19
|
3
|
86%
|
ReverseL
|
4
|
21
|
18
|
3
|
86%
|
Rectangle
|
3
|
20
|
19
|
1
|
95%
|
Block
|
2
|
14
|
13
|
1
|
93%
|
Apache Validator
|
TimeValidator
|
3
|
17
|
12
|
5
|
71%
|
PercentValidator
|
4
|
13
|
10
|
3
|
77%
|
LongValidator
|
3
|
17
|
12
|
5
|
71%
|
CurrencyValidator
|
4
|
16
|
12
|
4
|
75%
|
CalendarValidator
|
3
|
16
|
10
|
6
|
63%
|
Apache Collections
|
TransformedSortedMap
|
4
|
24
|
16
|
8
|
67%
|
StaticBucketMap
|
2
|
7
|
7
|
0
|
100%
|
ReferenceMap
|
3
|
25
|
23
|
2
|
92%
|
PredicatedSortedMap
|
4
|
17
|
8
|
9
|
47%
|
PassiveExpiringMap
|
2
|
15
|
10
|
5
|
67%
|
LRUMap
|
3
|
26
|
18
|
8
|
69%
|
LazySortedMap
|
4
|
16
|
9
|
7
|
56%
|
FixedSizeMap
|
3
|
16
|
10
|
6
|
63%
|
Apache BCEL
|
Method
|
3
|
13
|
12
|
1
|
92%
|
Field
|
3
|
12
|
11
|
1
|
92%
|
EnumElementValue
|
2
|
9
|
6
|
3
|
67%
|
ConstantInvokeDynamic
|
3
|
13
|
4
|
9
|
31%
|
ConstantFieldref
|
3
|
15
|
5
|
10
|
33%
|
Code
|
2
|
12
|
10
|
2
|
83%
|
ArrayElementValue
|
2
|
6
|
5
|
1
|
83%
|
Ta4J
|
ClosePriceIndicator
|
4
|
16
|
9
|
7
|
56%
|
MMAIndicator
|
5
|
19
|
8
|
11
|
42%
|
ATRIndicator
|
3
|
15
|
9
|
6
|
60%
|
SMAIndicator
|
3
|
14
|
10
|
4
|
71%
|
DPOIndicator
|
3
|
15
|
10
|
5
|
67%
|
RWIHighIndicator
|
3
|
14
|
9
|
5
|
64%
|
MACDIndicator
|
3
|
13
|
8
|
5
|
62%
|
|
Sum
|
-
|
668
|
473
|
195
|
-
|
|
Average
|
3.15
|
16.7
|
11.8
|
4.9
|
72%
|
4.4. Running Tests and Checking Faulty Versions
Each faulty version generated by the process mentioned in the previous subsection is either a modified Java class file or a collection of several modified Java class files contained in a directory. For each faulty version, we replaced the modified Java files with the original Java classes in the source directory of the related project. Then, we compiled the modified project and ran the target class test suite against the compiled (modified) project. If there was any test failure, we labeled the faulty version as “detected”, which meant the test suite had been able to detect the seeded fault in the project.
We had to do the same steps for all remaining faulty versions to indicate if the faulty version was detectable or not. Because these steps were repetitive and costly, we developed a helper tool, called MuRunner, to automate this process. This tool was integrated with the native Java compiler and the Maven build tools for compiling projects. MuRunner also supports different versions of JUnit to run tests. It is a command line tool which receives the root directory of “faulty versions” as input. Then, it retrieves the list of all faulty versions in the related directory. The tool performs all the required steps, which include replacing modified Java classes in the project, recompiling the modified project, running the JUnit test suite against the compiled project, and finally, collecting the results. These results show which faulty versions have been detected by the tests. MuRunner is publicly available through GitHub. In addition, all faulty versions produced during this evaluation are available alongside the tool as a sample project[11]. Using MuRunner, OCov4J, and provided samples, everyone can redo the experiment done in this research or perform additional ones.
4.5. Results and Discussion
Table 3 shows the experiment results for different faulty versions of the target classes. For each target class, in addition to the project name and the target class name, the following fields are determined in this table:
- OCov: The object statement coverage level of the used test suite for the target class;
- Cov: The traditional statement coverage level of the used test suite for the target class;
- Detected Faults (All): The number faulty versions detected by the test suite vs. the total number of all faulty versions (either produced by MuJava or generated manually);
- Detected Faults (Auto): The number of auto-generated faulty versions detected by the test suite vs. the total number of auto-generated faulty versions (mutants generated by MuJava);
- Detection Ratio (All): The ratio of the number of detected faulty versions to the number of all faulty versions (either produced by MuJava or generated manually);
- Detection Ratio (Auto): The ratio of the number of detected faulty versions to the number of auto-generated faulty versions (mutants generated by MuJava).
Table 3
Experimental results for each target class
Project
|
Target Class
|
OCov
|
Cov
|
Detected Faults (All)
|
Detected Faults (Auto)
|
Detection Ratio (All)
|
Detection Ratio (Auto)
|
Google Graph
|
StandardValueGraph
|
0.435
|
0.917
|
8 / 21
|
6 / 16
|
0.381
|
0.375
|
StandardMutableValueGraph
|
0.400
|
0.662
|
6 / 30
|
4 / 16
|
0.200
|
0.250
|
StandardMutableNetwork
|
0.406
|
0.746
|
7 / 36
|
5 / 30
|
0.194
|
0.167
|
StandardMutableGraph
|
0.233
|
1.000
|
3 / 22
|
1 / 15
|
0.136
|
0.067
|
NetworkBuilder
|
1.000
|
1.000
|
6 / 7
|
5 / 6
|
0.857
|
0.833
|
ImmutableValueGraph
|
0.225
|
0.750
|
3 / 28
|
2 / 15
|
0.107
|
0.133
|
GraphBuilder
|
0.958
|
0.967
|
6 / 8
|
5 / 5
|
0.750
|
1.000
|
JTetris
|
Square
|
0.138
|
1.000
|
2 / 18
|
2 / 18
|
0.111
|
0.111
|
Screen
|
0.563
|
1.000
|
5 / 10
|
5 / 10
|
0.500
|
0.500
|
S
|
0.137
|
1.000
|
2 / 22
|
1 / 19
|
0.091
|
0.053
|
ReverseL
|
0.135
|
1.000
|
1 / 21
|
1 / 18
|
0.048
|
0.056
|
Rectangle
|
0.235
|
1.000
|
3 / 20
|
3 / 19
|
0.150
|
0.158
|
Block
|
0.805
|
1.000
|
9 / 14
|
8 / 13
|
0.643
|
0.615
|
Apache Validator
|
TimeValidator
|
0.522
|
1.000
|
5 / 17
|
4 / 12
|
0.294
|
0.333
|
PercentValidator
|
0.625
|
1.000
|
5 / 13
|
4 / 10
|
0.385
|
0.400
|
LongValidator
|
0.540
|
1.000
|
5 / 17
|
5 / 12
|
0.294
|
0.417
|
CurrencyValidator
|
0.542
|
1.000
|
8 / 16
|
4 / 12
|
0.500
|
0.333
|
CalendarValidator
|
0.645
|
1.000
|
2 / 16
|
1 / 10
|
0.125
|
0.100
|
Apache Collections
|
TransformedSortedMap
|
0.507
|
1.000
|
11 / 24
|
8 / 16
|
0.458
|
0.500
|
StaticBucketMap
|
0.815
|
0.817
|
5 / 7
|
5 / 7
|
0.714
|
0.714
|
ReferenceMap
|
0.068
|
0.455
|
2 / 25
|
0 / 23
|
0.080
|
0.000
|
PredicatedSortedMap
|
0.535
|
1.000
|
7 / 17
|
4 / 8
|
0.412
|
0.500
|
PassiveExpiringMap
|
0.847
|
0.903
|
7 / 15
|
4 / 10
|
0.467
|
0.400
|
LRUMap
|
0.438
|
0.642
|
8 / 26
|
6 / 18
|
0.308
|
0.333
|
LazySortedMap
|
0.471
|
1.000
|
6 / 16
|
3 / 9
|
0.375
|
0.333
|
FixedSizeMap
|
0.531
|
0.769
|
11 / 16
|
8 / 10
|
0.688
|
0.800
|
Apache BCEL
|
Method
|
0.439
|
0.702
|
6 / 13
|
6 / 12
|
0.462
|
0.500
|
Field
|
0.276
|
0.462
|
7 / 12
|
6 / 11
|
0.583
|
0.545
|
EnumElementValue
|
0.640
|
0.647
|
4 / 9
|
3 / 6
|
0.444
|
0.500
|
ConstantInvokeDynamic
|
0.444
|
1.000
|
5 / 13
|
1 / 4
|
0.385
|
0.250
|
ConstantFieldref
|
0.500
|
1.000
|
5 / 15
|
1 / 5
|
0.333
|
0.200
|
Code
|
0.606
|
0.612
|
5 / 12
|
4 / 10
|
0.417
|
0.400
|
ArrayElementValue
|
0.686
|
0.741
|
2 / 6
|
2 / 5
|
0.333
|
0.400
|
Ta4J
|
ClosePriceIndicator
|
0.148
|
1.000
|
0 / 16
|
0 / 9
|
0.000
|
0.000
|
MMAIndicator
|
0.179
|
1.000
|
2 / 19
|
1 / 8
|
0.105
|
0.125
|
ATRIndicator
|
0.136
|
0.500
|
2 / 15
|
0 / 9
|
0.133
|
0.000
|
SMAIndicator
|
0.246
|
0.800
|
5 / 14
|
3 / 10
|
0.357
|
0.300
|
DPOIndicator
|
0.234
|
0.889
|
4 / 15
|
3 / 10
|
0.267
|
0.300
|
RWIHighIndicator
|
0.485
|
0.689
|
9 / 14
|
7 / 9
|
0.643
|
0.778
|
MACDIndicator
|
0.206
|
0.750
|
2 / 13
|
0 / 8
|
0.154
|
0.000
|
To informally check whether there is a positive correlation between object coverage level and the faulty versions detection ratio, we use a scatter diagram. This diagram is depicted in the left side of Fig. 3 based on columns OCov and Detection Ratio (All) in Table 2. According to this scatterplot, there is a positive linear correlation between the object coverage level and the percent of the OO faulty versions which were detected by the test suite.
Although the scatterplot shows a positive correlation, to find the strength of this correlation, we should obtain the correlation coefficient value. To choose the appropriate method to calculate this coefficient, we examine some of the characteristics of the resulting experimental data.
As the first characteristic, the left side of the plotter diagram approximately shows a linear relationship between two variables OCov and Detection Ratio (All). Now, using the Shapiro-Wilk test, which is one of the most powerful normality tests (Razali and Wah 2011), we check if both variables follow a normal distribution. The null hypothesis \({H}_{0}\) of this test is that the variable is normally distributed. According to the test, p-values for variables OCov and Detection Ratio (All) are respectively 0.056 and 0.096. Since p-value for the two variables is greater than the significance level α (0.05), we accept \({H}_{0}\); therefore, both variables are normally distributed. Another characteristic of data that is observed in the plot diagram is the absence of considerable outliers.
Based on the mentioned characteristics, we use the Pearson method to obtain the correlation coefficient between the object coverage level and faulty versions detection ratio. The Pearson correlation coefficient for these variables is 0.792 where the degree of freedom (df) is 38; df is the number of samples minus by 2. These values indicate a high positive correlation between object coverage level and failure detection ratio. Also, it should be noted that, this result is statistically significant, because the level of significance for a one-tailed test with significance level α of 0.05 and df = 38 is 0.257, and our Person correlation coefficient (0.792) is much greater than 0.257. If we use the results in Table 3 to draw a scatterplot for the traditional statement coverage level (column Cov in Table 3) and faulty versions detection ratio as depicted in the right side of the Fig. 3, no considerable correlation will be observed. Moreover, the Pearson correlation coefficient between these two variables is -0.010 which is considered as negligible correlation (Razali and Wah 2011).
As mentioned, we developed a manual approach to generate more diverse mutants with better coverage of object-oriented faults. However, it may be argued that manually generated mutants can cause bias in the results (Fig. 3). To answer this claim, first note that most mutants have been generated automatically by MuJava (as shown in Table 2, 72% of all mutants have been produced automatically). Second, the positive correlation is still significant if the evaluation is limited to auto-generated mutants merely.
To examine this issue in more detail, columns “Detected Faults (Auto)” and “Detection Ratio (Auto)” in Table 3 show the results of our experimental evaluation, only by taking into account all 473 auto-generated mutants. Also, Fig. 4 shows the scatterplot between object coverage level in column OCov and faulty versions detection ratio (calculated considering just auto-generated mutants) in column Detection Ration (Auto). The depicted scatterplot turns out a positive correlation between these two variables.
To compare the auto-generated correlation with the overall correlation presented before, we considered the Pearson correlation coefficient to determine the strength of the auto-generated correlation. The value of this coefficient for the auto-generated mutants is 0.779, which is slightly less than the correlation coefficient value for all mutants, i.e. 0.792. In addition, the correlation coefficient is statistically significant because it is much greater than 0.257 (as mentioned before, 0.257 is the level of significance for a one-tailed test with significance level α of 0.05 and df = 38). The right side of Fig. 4 shows the scatterplot between traditional coverage level and faults detection ratio regarding all auto-generated mutants, which, like the corresponding diagram that takes all mutants into account (right side of Fig. 3), shows no significant correlation between traditional coverage level and detected OO-related faults.
4.5.1. Poly-object Coverage Criteria Evaluation
In addition to the main object coverage criteria, we proposed the "Poly-object coverage criteria" in order to address problems related to polymorphism and dynamic binding. These criteria are essentially the same as the object coverage criteria, but the former are introduced for a class (as a base class) by considering its child and descendant classes. To evaluate the proposed Poly-object coverage criteria, we considered 21 classes from the benchmark projects introduced in Subsection 4.1. Each base class in our experiment has at least one child class and may also have other descendant classes. For each base class bc, we have selected the following versions from the set of faulty versions, which have been created for the descendant classes of bc (constructed as mentioned in Subsection 4.3):
- Faulty versions of descendant classes of bc which have been generated by fault injection into the code of bc; and
- Faulty versions of descendant classes of bc that falls into the "Polymorphism and inconsistent types use" category (based on our fault categories specified in Subsection 4.3). This category addresses polymorphism and dynamic binding problems.
Table 4 shows the results of this evaluation for base classes. Column NOD indicates the number of descendant classes of the base class (including its child). Column PolyCov shows the poly-object coverage level of the base class, considering its descendant classes as shown in column Descendents. The next column, Detected Faults, represents the number of detected faulty versions vs. the number of all faulty versions. Finally, the last column implies the detection ratio of faulty versions.
Using a similar approach to the previous section, we observe a high positive correlation between columns PolyCov and Detection Ratio in Table 3. The scatter diagram shown in the left side of Fig. 5 clearly shows the correlation between the poly-object coverage level and the polymorphic-related-fault detection ratio. Like the previous section, due to the characteristics of data in columns PolyCov and Faults Detection Ratio, we considered the Pearson correlation coefficient to determine the strength of the observed correlation. The value of this coefficient for the resulting data is equal to 0.821. Therefore, this correlation coefficient is statistically significant because the level of significance for a one-tailed test with significance level α of 0.05 and the df = 22 is 0.330, and our Pearson correlation coefficient (0.821) is greater than 0.330.
Table 4
Experimental results for the poly-object coverage level
Project
|
Base Class
|
NOD
|
Descendants
|
PolyCov
|
Detected Faults
|
Detection Ratio
|
Google Graph
|
StandardValueGraph
|
1
|
StandardMutableValueGraph
|
0.278
|
1 / 9
|
0.111
|
AbstractValueGraph
|
2
|
StandardValueGraph, StandardMutableValueGraph
|
0.083
|
0 / 14
|
0.000
|
AbstractBaseGraph
|
3
|
StandardMutableValueGraph, StandardMutableGraph, StandardValueGraph
|
0.161
|
2 / 16
|
0.125
|
StandardNetwork
|
1
|
StandardMutableNetwork
|
0.326
|
2 / 7
|
0.286
|
JTetris
|
Grid
|
6
|
Screen, S, ReverseL, Rectangle, Square, Block
|
0.244
|
3 / 8
|
0.375
|
Block
|
4
|
S, ReverseL, Rectangle, Square
|
0.125
|
1 / 5
|
0.200
|
Polygon
|
2
|
S, ReverseL
|
0.125
|
1 / 6
|
0.167
|
Apache Validator
|
TransformedMap
|
1
|
TransformedSortedMap
|
0.484
|
3 / 8
|
0.375
|
PredicatedMap
|
1
|
PredicatedSortedMap
|
0.548
|
2 / 5
|
0.400
|
LazyMap
|
1
|
LazySortedMap
|
0.250
|
5 / 9
|
0.556
|
AbstractMapDecorator
|
5
|
PassiveExpiringMap, FixedSizeMap, LazySortedMap, PredicatedSortedMap, TransformedSortedMap
|
0.374
|
10 / 23
|
0.435
|
AbstractLinkedMap
|
1
|
LRUMap
|
0.292
|
4 / 14
|
0.286
|
AbstractReferenceMap
|
1
|
ReferenceMap
|
0.054
|
2 / 20
|
0.100
|
Apache Collections
|
Attribute
|
1
|
Code
|
0.630
|
1 / 2
|
0.500
|
ElementValue
|
2
|
EnumElementValue, ArrayElementValue
|
0.563
|
2 / 4
|
0.500
|
ConstantCP
|
2
|
ConstantFieldref, ConstantInvokeDynamic
|
0.382
|
3 / 9
|
0.333
|
Constant
|
2
|
ConstantFieldref, ConstantInvokeDynamic
|
0.333
|
2 / 5
|
0.400
|
FieldOrMethod
|
2
|
Field, Method
|
0.484
|
1 / 2
|
0.500
|
Apache BCEL
|
AbstractCalendarValidator
|
2
|
TimeValidator, CalendarValidator
|
0.530
|
9 / 16
|
0.563
|
BigDecimalValidator
|
2
|
PercentValidator, CurrencyValidator
|
0.500
|
2 / 5
|
0.400
|
AbstractNumberValidator
|
3
|
PercentValidator, CurrencyValidator, LongValidator
|
0.479
|
6 / 11
|
0.545
|
Ta4J
|
RecursiveCachedIndicator
|
1
|
MMAIndicator
|
0.167
|
0
|
0.000
|
CacheIndicator
|
7
|
ATRIndicator, RWIHighIndicator, MACDIndicator, MMAIndicator, ClosePriceIndicator, DPOIndicator, SMAIndicator
|
0.111
|
2
|
0.133
|
AbstractIndicator
|
7
|
ATRIndicator, RWIHighIndicator, MACDIndicator, MMAIndicator, ClosePriceIndicator, DPOIndicator, SMAIndicator
|
0.595
|
12
|
0.522
|
4.6. Limitations
The object code coverage criteria emphasize on the execution of different parts of the code that represent the state and behavior of an object. These parts include the main class code along with the code of the inherited classes, while considering the actual type of the object under test at runtime. These criteria can only show that the different parts of the code, which are related to the object, are executed at least once. In other words, these criteria, like the traditional code coverage criteria, only enforce the execution of different parts of the code, and do not necessarily show the presence or absence of program’s failures.
One of the main means for revealing failures (especially failures based on the logic of programs or classes) is to use the assertion part of test cases. This part plays the role of the test oracle. Like the traditional code coverage criteria, the object coverage criteria are not able to effectively evaluate the test case assertions. This is why our experimental evaluation has been conducted based on very simple faults and mutants, which are more likely to be detected by assertions of auto generated tests when parts of the code that contain faults are executed.
Therefore, our evaluation results cannot show that achieving a high object coverage level for a test suite necessarily leads to revealing real OO failures. This claim is beyond the scope of this research, and examining the correlation between high object coverage levels and high capability to detect real-word OO failures could be the subject of a future work.
Since we have used OO mutations, i.e., MuJava class mutations, to generate mutants, and we have shown the correlation of the object coverage criteria with the detection ratio of these mutants, one could claim that the object coverage criteria are not necessarily required, when there exist mutation analysis techniques which support OO-related mutants. However, in addition to existing OO mutation approaches, like MuJava, we have used various techniques to generate faulty versions. These techniques, which are not practically used in existing OO mutation tools, make simultaneous changes in some part of the class under test, as well as its parent and ancestor classes. They also apply simultaneous changes in several family classes. Furthermore, like the traditional code coverage approaches, our "object coverage" approach could be used very simply, with high automation and with negligible execution cost. In contrast, there are not many automated tools for OO mutations, and also, like all mutation techniques, OO mutation methods have high execution cost (Segura et al. 2011).
4.7. Threats to Validity
One of the internal validity threats could be potential faults in the OCov4J tool implementation. We have tested this tool thoroughly. In addition, we have published the tool as an open source software for further evaluation and extension. As part of testing OCov4J, we have compared this tool with a mature traditional code coverage tool, JaCoCo. To do so, we have applied both tools to some classes without any parent or child. Based on our arguments in Subsection 3.1, applying the object coverage criteria and the traditional code coverage criteria to these classes must lead to the same results. Fortunately, OCov4J and JaCoCo tools have performed according to this expectation.
There are two points to note with respect to external validity threats. Firstly, it is the benchmarks chosen for our evaluation (refer to Subsection 4.1). We attempted to select as many different benchmark projects as possible and chose real world and widely used projects in different scopes like validators, file decoders, data structures and analytic software. Moreover, in selecting classes from each project, we have chosen classes in different inheritance hierarchies with different depths. The second threat to external validity could be because of our approach to generate faulty versions (refer to Subsection 4.3). This can raise the possibility of bias and make the generality of the results questionable. In regard of this concern, we followed high referenced approaches like (Offutt et al. 2001; Ma et al. 2006; Offutt et al 2006) to manually seed different types of OO faults. In addition to the manual faulty versions, we also used mutations that were automatically generated by MuJava with its different mutant operators for OO features, such as inheritance and polymorphism. We also tried to not use equivalent mutants by examining all auto-generated mutations. We have published our faulty versions as a benchmark alongside our tools, OCov4J and MuRunner, as open source tools to facilitate the reproduction of results and do more experimental evaluation.
[4] https://github.com/BasLeijdekkers/MetricsReloaded
[5] https://github.com/sbu-test-lab/jtetris
[6] https://commons.apache.org/validator/
[7] https://commons.apache.org/bcel/
[8] http://commons.apache.org/collections/
[9] https://github.com/google/guava
[10] https://github.com/ta4j/ta4j
[11] https://github.com/sbu-test-lab/mu-runner