Object Coverage Criteria for Supporting Object-oriented Testing

doi:10.21203/rs.3.rs-1735102/v1

Download PDF

Research Article

Object Coverage Criteria for Supporting Object-oriented Testing

https://doi.org/10.21203/rs.3.rs-1735102/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 30 Jun, 2023

Read the published version in Software Quality Journal →

You are reading this latest preprint version

Code coverage criteria are widely used in object-oriented (OO) domains as test quality indicators. However, these criteria are based on the procedural point of view, and therefore, do not address the specific features of OO programs. In this article, we extend the code coverage criteria and introduce a new set of criterion, called “object coverage criteria”, which cope with OO features like object instantiation, inheritance, polymorphism, and dynamic binding. Unlike previous criteria, the new criteria regard the actual type of the object under test and some inherited codes from the parent/ancestor classes that represent the object’s states and behaviors. The new criteria have been implemented in a prototype tool called OCov4J for the Java language. Using this tool and conducting an empirical study on totally 270 classes (with about 50k lines of code without blank lines and comments) from several large and widely used open source projects, we have found a considerable positive correlation between the object coverage level (defined via the new proposed criteria) and the number of detected specific OO failures. Not only the proposed criteria provide ease of use, high automation, and low execution cost, they can effectively be applied to real world object oriented programs.

Object-oriented testing

test adequacy criteria

code coverage criteria

inheritance

polymorphism

Code coverage criteria or metrics, such as statement or branch coverage, have been widely used in object-oriented (OO) domains to select, stop and validate tests (Binder 2000), or to guide the automated test generation process for various OO languages (Gay et al. 2015; Gopinath et al. 2014; Schwartz et al. 2018; Fraser and Arcuri. 2015). Most existing automated test generation tools use code coverage criteria as a quality indicator to find a set of test inputs which cause high code coverage. For example, these criteria have been widely used to design fitness functions in search-based test techniques in order to guide the test data generation process as well as evaluating generated tests. One of the prominent works in this area is the EvoSuite approach which generates JUnit tests for Java (Fraser and Arcuri 2011).

No matter how effective the code coverage criteria are compared to other criteria (like data-flow coverage and mutation-based criteria), the former are commonly used in the software industry as unit test quality measures (Hemmati 2015). This popularity could be because of their ease of use in addition to the availability of fully automated tools which support them for different programming languages and various technologies, with negligible execution cost.

Existing code coverage criteria have been derived according to the procedural point of view. Therefore, they determine which parts of the program code (mainly code of program’s procedures/functions) are executed at least once during the tests. In the OO programming context, code coverage indicates which parts of methods of classes (instead of procedures in procedural programs) are executed. However, OO programs are not just sets of methods embedded in different classes. OO programs considerably rely on user-defined types introduced according to the concepts of abstraction and encapsulation. These types include state space and behavior and form different objects in the execution phase. Previous types or classes may be reused to define new types. This reuse can be realized through inheritance or aggregation. Also, polymorphism along with dynamic binding mechanisms can cause different behaviors from a single object.

All of these facilities may result in new types of errors and make the test of OO software more complex than testing previous conventional programs (Perry and Kaiser 1990; Offutt et al. 2001; Alexander et al. 2002; Ma et al. 2006; Ghoreshi and Haghighi 2016). Consequently, traditional code coverage criteria might be not equally effective for the OO domain and the procedural domain. The following are some of the main issues with using code coverage criteria for OO programs:

Methods/classes interactions: In comparison to procedural programs, OO programs have significantly shorter and simpler methods; instead of using lengthy and complex methods, these programs implement needed functionalities via interactions between dozens of simple methods and relationships between classes (Alexander et al. 2010). For example, consider a very common approach in OO programs in which a method in a class uses an overridden method from one of its parent or ancestor classes to do a piece of logic and then do its own logic. This is a sample of interaction between two methods of different classes to implement a single functionality. Traditional code coverage criteria do not consider such interactions and regard each method separately and in isolation. This way, if an overridden method is fully executed during the tests, the used code coverage criterion (like statement coverage) is satisfied while the interaction between methods is not addressed at all. By ignoring interactions between different classes and methods, we may easily achieve a very high code coverage level in OO programs because most methods are usually small and simple.
Object instantiations and different object types: Regardless of special cases (such as classes with static methods in Java or C++), we require executable instances to execute classes and their related methods. These instances are created through “object instantiations” in OO programs. Using object instantiations, code in a common class may be executed by different objects of different classes; for example, suppose the situation where the class of the object under test has been derived from a parent/ancestor common class. Since the actual type of the object executing the code of a common class is not considered by the code coverage criteria, a class that is not executed by objects of its own type may archive a high code coverage level.
Inheritance issue: Inheritance is one of the fundamental concepts in object orientation. Many classes reuse the definition of other classes to define themselves. An inheriting class may change the definition of some inherited methods or extend their definition. Using inheritance, therefore, it is more likely to have new types of faults in programs (Offutt et al. 2001). The possibility of these faults increases by increasing the level of inheritance (Aziz et al. 2019). Since the code coverage criteria only examine the code of the class under test and do not consider the inherited classes, they probably result in insufficient tests for inherited methods.
Polymorphism and dynamic binding issues: polymorphism and dynamic binding allow an object to take other forms of descendant classes in different executions. These powerful features can lead to different faults like yo-yo problems in OO programs (Offutt et al. 2001). However, faults resulting from polymorphism fall outside the scope of the traditional code coverage criteria because they do not matter what type of object executes which part of the code.

This paper provides some new test adequacy criteria that address the specific OO features mentioned above. In fact, the new proposed coverage criteria consider issues related to areas like object instantiation, inheritance, polymorphism, and dynamic binding, in addition to the issues of traditional procedural programs. We call these new coverage criteria as “object coverage criteria”. These criteria generally act like code coverage criteria. However, they especially consider the actual type of the object under test as well as the part of the code from parent/ancestor classes, which represents or affects the object’s states or behaviors. Thus, the new criteria are extensions of the traditional code coverage criteria and subsume them. The process of measuring the object coverage criteria is similar to the measurement of the traditional criteria, and they can easily be applied to any OO program with a fully automated tool. According to the above mentioned features, the object coverage criteria can completely replace the code coverage criteria for OO programs.

In addition to introducing the new criteria, we have implemented a prototype tool, called OCov4J, for automatically measuring and reporting these criteria for the Java language. Although OCov4J is a prototype tool, in practice, it can be applied to many real-life Java projects, such as those used in the evaluation section. Using this tool in an empirical evaluation, we applied the object coverage criteria to different inheritance hierarchies, totally containing 270 Java classes from five large and widely used open source projects along with an educational project. By seeding 668 OO-related faults into some of these classes, and analyzing the results, we found a strong positive correlation between the coverage level, defined by the new criteria, and the number of OO faults which have been revealed.

The content of this article is organized as follows. In the next section, we describe the problem context in three areas with some examples. Section 3 introduces our new test adequacy criteria. Section 4 covers our empirical evaluation process and results. Next, we briefly review some related approaches in section 5. Finally, section 6 concludes the paper and introduces some ideas for future work.

The code coverage criteria are very common. They are indeed basic criteria for evaluating tests of a program. These criteria are also widely used in industry to determine the effectiveness of unit tests. There are various tools supporting the whole process of instrumenting code, running unit tests, measuring code coverage criteria, and finally, reporting their values. For example, JCov (as a part of OpenJDK) and JaCoCo^{^[1]} tools are very popular in Java and are used by many open source projects.

The process of calculating coverage metrics for an OO program, such as a Java code, often begins with instrumentation. This can be done through the instrumentation of the source code or the instrumentation of the final executable bytecode. Code instrumentation usually appends additional statements into code to each line of the program (or to each main block, or before each branch of the program). This way, the execution of the added statements implies the execution of the instrumented part of the code. After code instrumentation and test execution, the ratio of the number of the sections executed by tests to the total number of code sections is calculated and reported as the code coverage level. Various coverage criteria, such as “statement coverage”, “line coverage”, and “branch coverage”, can be defined according to the code structures that are targeted to be covered by tests.

In the following, we show some issues and problems of applying the traditional code coverage criteria to OO programming languages via using some simple examples. Although we use Java and C + + programming languages in our examples, the mentioned issues are language independent and are common for most OO languages.

2.1. Issue 1: The Executor Object Type

Unlike procedural languages, which use procedures and functions as abstraction mechanisms, the most important abstraction mechanisms in OO languages are user-defined types that define state (data) and behavior. These types are usually created in most OO languages using the notion of “class”. In defining new classes, the previous classes can be reused via the aggregation and inheritance mechanisms. Objects are created dynamically based on a mechanism called "instantiation" which realizes an object at runtime. In fact, different behaviors of a class (commonly known as class methods) require an object to be executed. In this paper, we call this object “the executor object”. The traditional code coverage criteria only consider the static space of a class and do not consider runtime objects. This can cause a test suite to achieve high code coverage for a class, while any object of the class type does not execute the code of this class, or only a small part of the class is executed. To clarify this issue, consider the following sample classes in Java, which model two types of stacks:

01 class Stack {

02 int[] element;

03 int size;

04 int index;

05 public Stack(int size){

06 this.size = size;

07 element = new int[size];

08 }

09 public void push(int x){

10 //if(index = = size)

11 // throw new Exception(“stack is full”); /* commented for bug seeding */

12 element[index++] = x;

13 }

14 public int pop(){

15 //if(index = = 0)

16 // throw new Exception(“stack is empty”); /* commented for bug seeding */

17 return element[--index];

18 }

01 class CircularStack extends Stack {

02 public CircularStack(int size){

03 super(size);

04 }

05 @Overide

06 public void push(int x){

07 if(index = = size)

08 index = 0;

09 super.push(x);

10 }

11 @Overide

12 public void pop(){

13 if(index = = 0)

14 index = size;

15 return super.pop();

16 }

17 }

The Stack class models a simple stack in Java. Objects from this class set the maximum stack length using the class constructor during instantiation. This class defines two push and pop methods to add/remove elements to/from the stack. The CircularStack class models a simple LIFO fixed length buffer. According to this class, when the element array of the buffer is full, new elements are placed at the beginning of this array; and when the index of the current value of the buffer is zero, it jumps to the end of the element array. This class is defined using the previous Stack class as its parent class. In fact, it inherits and redefines the parent’s methods push and pop.

It should be noted that, we have seeded two faults by commenting lines 10–11 and 15–16 of class Stack that can result in failures at runtime. Now, consider the below test suite CircularStack_TestSuite which only contains one test case for CircularStack:

01 public class CircularStack_TestSuite {

02 @Test

03 public void CircularStack_Test1(){

04 CircularStack cs = new CircularStack(2);

05 cs.push(1); cs.push(2); cs.push(3);

06 assert cs.pop() = = 3;

07 }

08 }

The above test passes and reveals no bug in the stack implementation. Using the line coverage criterion, this test case results in 100% code coverage for class Stack. This means that although class Stack has not been tested by actual Stack objects at all, its code coverage level is 100%. However, if we run a similar test using an object instantiating the Stack class, we may encounter a runtime exception due to our seeded faults. For example, the following simple test suite Stack_TestSuite leads to an IndexOutOfBoundsException error in Java, indicating an attempt to access an invalid index within the element array. As shown in this example, the traditional code coverage criterion incorrectly assumes coverage of a class while it has not directly been tested. This condition can mislead programmers and cause them not to write separate unit tests for the "Stack" class. In the next section, we consider the type of the executor object in defining object coverage criteria in order to address this issue.

01 public class Stack_TestSuite {

02 @Test

03 public void Stack_Test1(){

04 Stack s = new Stack(2);

05 s.push(1); s.push(2);

06 s.push(3); // it throws an IndexOutOfBoundsException

07 assert s.pop() = = 3;

08 }

09 }

2.2. Issue 2: Inheritance and Ancestor Classes

Inheritance is one of the basic concepts and one of the main mechanisms of integration in OO programming languages. Inheritance differs from another type of integration in OO, namely aggregation, in several ways. A fundamental difference is that, the encapsulation of an ancestor class (ancestor classes are the types that a class indirectly inherits through its patent or super class) may not be preserved through inheritance, meaning that the new inherited class can access and change the internal representation of the ancestor classes. Offutte et al. (2001) have interpreted inheritance as "internal representation integration" and have enumerated problems that arise from the combination of the new class’s states/behavior and the states/behavior inherited from the ancestor classes.

The code coverage criteria regard the execution of each class code separately and in isolation, and do not consider the inherited parts of the parent/ancestor classes. Therefore, problems related to how the class under test interacts with the states and behaviors of the inherited classes are excluded from the scope of these criteria. To explain the issue more precisely, consider the following example containing two classes List and ClearableList. The former models a simple list backed by an array, and the latter models a simple list with an extra method, named clear, for deleting all elements of the list at once. ClearableList inherits the class List and adds the clear method to clear the list. We have seeded a bug into ClearableList by commenting the line number 7 of the class ClearableList.

01 class List {

02 int maxSize;

03 Object[] items;

04 int lastIndex;

05 public List(size){

06 this.maxSize = size;

07 items = new Object[size];

08 }

09 public void add(Object item){

10 if(lastIndex = = maxSize)

11 throw new Exception(“List is full”);

12 items[lastIndex++] = item;

13 }

14 public void remove(int index){

15 if(index < 0 || index > = lastIndex)

16 throw new Exception(“Index out of bounds”);

17 for(int i = index; i < lastIndex-1; i++)

18 items[i] = items[i + 1];

19 this.lastIndex–;

20 }

21 public int getSize(){

22 return lastIndex;

23 }

24 }

01 class ClearableList extends List {

02 public ClearableList(int size){

03 super(size);

04 }

05 public void clear() {

06 items = new Object[maxSize];

07 //lastIndex = 0; /* commented for bug seeding */

08 }

09 }

Now consider the below test suite List_TestSuite1 which contains four test cases to validate the implementations of the above classes:

01 class List_TestSuite1 {

02 @Test

03 public void List_Test1(){

04 List list = new List(10);

05 list.add(“A”); list.add(“B”); list.remove(1);

06 assert list.getSize() = = 1;

07 }

08 @Test(expected = Exception.class)

09 public void List_Test2(){

10 List list = new List(1);

11 list.add(“A”);

12 list.add(“B”); //the expected exception thrown and test passes

13 }

08 @Test(expected = Exception.class)

09 public void List_Test3(){

10 List list = new List(10);

11 list.add(“A”);

12 list.remove(2); // the expected exception thrown and test passes

13 }

14 @Test

15 public void ClearableList_Test1(){

16 ClearableList list = new ClearableList(10);

17 list.clear();

18 assert list.getSize() = = 0;

19 }

20 }

Test cases List_Test1, List_Test2, and List_Test3 pass and achieve 100% line coverage for class List. The ClearableList_Test1 test, which is passed too, also provides 100% line coverage for class ClearableList and cannot reveal our seeded bug. Although ClearableList_Test1 only tests the method defined in ClearableList and does not test the inherited methods (like add or remove), it results in 100% line coverage. Nevertheless, the seeded bug in the ClearableList class can easily be detected by a simple test that uses methods inherited from the parent class. For example, imagine the test case ClearableList_Test2, which is another test for the ClearableList class that, in addition to testing the child state space, tests the parent state space by calling the inherited method add. Unlike the previous test, ClearableList_Test2 is failed and reveals a failure in the implementation. Using this test, after the execution of the add method (line 4), one unit is added to the index variable; but when the method clear is called, although it resets the parent state variable elements, it does not reset the value of the parent’s state variable index (as mentioned, line 7 of ClearableList was commented to create this fault). Hence, the list length in the assertion section of the test becomes equal to one, which causes the test to fail.

Regarding this example, when defining our new coverage criteria, we should consider the parts of the class state and behavior, which are inherited from parent or ancestor classes.

01 @Test

02 public void ClearableList_Test2(){

03 ClearableList list = new ClearableList(10);

04 list.add(“A”);

05 list.clear();

06 assert list.getSize() = = 0;

07 }

2.3. Issue 3. Polymorphism and Descendant Classes

Like inheritance, polymorphism is one of the key concepts in OO languages. It allows an object to take many different shapes. In an inheritance hierarchy, polymorphism allows an object from a particular class to bind to objects of its child or descendant classes. In other words, an object can take the form of its upper classes in the inheritance hierarchy. In addition to polymorphism concepts, dynamic binding allows an object to bind to any descendant type in a dynamic manner at runtime. Dynamic binding causes the internal representation of a type to change dynamically at runtime. Inheritance, polymorphism, and dynamic binding create a very flexible type of integration in OO languages, which is called "abstract integration" by some authors (for example, Alexander et al. (2010)). Although one strength point of abstract integration is the robust and flexible design, its complexity can yield new faults that are not easily detectable by conventional test methods (Alexander et al. 2010). Some possible faults in OO programs that occur due to polymorphism and dynamic binding are categorized in (Offutt et al. 2001).

Since the code coverage criteria considers neither inherited classes in the inheritance hierarchy for a class, nor the actual type bound to an object, polymorphism and dynamic binding issues are out of the code coverage criteria scope. In the following, we present an example in C + + language to emphasize on the issues with the traditional code coverage criteria. Consider a family classes of Counter, ResetCounter and OneBasedCounteras:

01 //A base counter class

02 class Counter {

03 int v;

04 public:

05 Counter(){ v = 0; }

06 int value(){ return v; }

07 void inc(){ v++; };

08 };

01 //Specific a counter class with reset functionality

02 class ResetCounter: public Counter {

03 public:

04 void reset(){

05 v = 0;

06 }

07 }

01 //Specific a counter class which starts at 1

02 class OneBasedCounter: public ResetCounter {

03 public:

04 OneBasedCounter(){

05 reset();

06 }

07 void reset(){

08 v = 1;

09 }

10 }

The class Counter models a simple counter that has the inc method to increase the counter and the value method to get the current value of the counter. The next class in the inheritance hierarchy is the ResetCounter class, which acts like the previous one, but it has one additional method called reset. This method resets the counter value. At the end of the inheritance hierarchy, there is a class called OneBasedCounter. It inherits from the ResetCounter class, but it has been changed so that its initial value starts at one instead of zero. Now, consider the test suite Counter_TestSuite as bellow:

01 class Counter_TestSuite {

02 public:

03 void Counter_Test(){

04 Counter* c = new Counter();

05 c->inc;

06 assert(c->value() = = 1);

07 }

08 void ResetCounter_Test(){

09 ResetCounter* c = new ResetCounter();

10 c->reset();

11 assert(c->value() = = 0);

12 }

13 void OneBasedCounter_Test(){

14 OneBasedCounter* c = new OneBasedCounter();

15 c->reset();

16 assert(c->value() = = 1);

17 }

18 }

This test suite consists of three tests that provide 100% line coverage for all three mentioned classes. These tests are also passed and do not show any failure. Now, consider the following test CounterPoly_Test, which is like OneBasedCounter_Test, but it uses polymorphism to form the shape of an object of OneBasedCounter as a ResetCounter class:

01 void CounterPoly_Test(){

02 ResetCounter* c = new OneBasedCounter();

03 c->inc();

04 c->reset();

05 assert(c->value() = = 1);

06 }

As shown in the CounterPoly_Test test, in line 2, an object with the type of ResetCounter is declared, but it is bound to the OneBasedCounter class (these sections of line 2 are highlighted). It is interesting that this test is not passed. By reviewing the source code of ResetCounter, a bug related to polymorphism and dynamic binding issues is found. The ResetCounter class does not use the virtual keyword to define its reset method (line 4 in ResetCounter). To fix this fault, we should change the code like follows:

01 virtual void reset(){ v = 0;}

If we do not use the Virtual keyword in line 4 of class ResetCounter through the definition of the reset method, the dynamic call of the overridden method in the subclasses will be disabled when using an object in a polymorphic manner. In our example, in line 4 of CounterPoly_Test, when the method reset is called on the object c which is dynamically bound to the descendant class OneBasedCounter, the original definition in ResetCounter is called instead of calling the reset method in OneBasedCounter class; this call resets the value of the counter to zero rather than one. Although this is a simple example, similar issues lead to faults in real-world applications, as the authors in (Mcheick et al. 2010) have shown that the misuse of the virtual keyword is one of the sources of common bugs in C + + languages. As shown in this example, all classes in an inheritance hierarchy may be tested, and a high level of traditional code coverage criterion may be obtained. But there may be problems with using these classes in a polymorphic manner. The traditional code coverage criteria do not help programmers in this regard and do not provide information about whether classes are sufficiently tested in a polymorphic manner.

^{^[1]} https://github.com/jacoco/jacoco

In the previous section, some problems were reviewed that show, due to the procedural nature of the code coverage criteria, they are not aligned with the key concepts of the OO paradigm, such as object instantiation, inheritance, polymorphism, and dynamic binding. In this section, we propose an approach to adapt the traditional code coverage criteria with these specific OO concepts, so that the new coverage criteria, called “object coverage criteria”, keep the advantages of their previous counterparts. These strength points include simplicity, automation capability, and usability with low execution cost.

The object coverage criteria generally work like the traditional criteria; however, the following two aspects are considered when measuring the new proposed criteria:

The type of the executor object: Unlike the traditional criteria, the type of the object which executes each unit of the code (for example, each line, statement or branch of code) is considered. For example, a part of the class code may be executed by objects of the same class or objects of another class. The proposed criteria distinguishes between these two cases.
All inherited classes which form the internal representation of the class under test: Unlike the traditional criteria, for each class under test, the part of the inherited classes (parent and ancestor classes), that represent the whole states/behaviors of the class, is used to calculate the object coverage level.

3.1. Object Coverage Criteria Definition

Consider an inheritance hierarchy like what is shown in Fig. 1. Class C inherits the \(Paren{t}_{C}\) class and n ancestors \(Ancesto{r}_{1}\), …, \(Ancesto{r}_{n}\). The object coverage criteria can informally be introduced as follows:

First, imagine a new class \({C}_{f}\) as a flatten of Class C, with its parent and all ancestors. In other words, \({C}_{f}\) contains the C code along with all the accessible inherited code from \(Paren{t}_{C}\) and n ancestors \(Ancesto{r}_{1}\), …, \(Ancesto{r}_{n}\). By “All inherited code”, we mean all methods (class constructors are also considered as methods) that are defined in \(Paren{t}_{C}\) or each of n ancestors. By “accessible inherited code”, we mean those inherited methods that are accessible in C. For example, in Java, the accessible inherited code includes all methods or constructors which are defined in the parent or ancestor classes as non-private, i.e., public, protected or package-private.

Now, the object coverage criteria for class C are equivalent to the traditional code coverage criteria for class \({C}_{f}\), provided that all pieces of code (like statements, lines, blocks or branches) in \({C}_{f}\) are executed at least once by an object which is instantiated with class C.

Figure 1. An inheritance hierarchy ending with class C

Like traditional code coverage criteria, the object coverage criteria contains a family of criterion each emphasizing on a particular unit/structure of the code. A coverage criterion is measured according to the percentage of the corresponding units/structures which are executed by the generated test data. For example, we can define object coverage criterion for statements, lines, branches and other parts of either the code or the control flow graph structure which is derived from the source code. For sake of simplicity, in the following, we only define an object coverage criterion for statements, formally; criteria for other structures like lines, basic blocks and branches can formally be defined in a similar way.

Definition 1

Object statement coverage criterion: According to this criterion, the set of test requirements (abbreviated by TR) for a given class C is equal to all statements of C, along with all statements that

exist in either its parent or its ancestor classes, and
are accessible from C.

By this criterion, each statement in TR should be executed at least once by an object bound to class C using the provided test suite.

The given formal definition is based on the notion of test requirements which have been introduced in (Ammann and Offutt 2016). As mentioned in (Ammann and Offutt 2016), for every coverage criterion, we can also define a coverage level for each test suite TS. This simply is the ratio of the number of test requirements in TR, which are covered by TS, to the size of TR.

Definition 2

Object statement coverage level: Consider a class C, a test suite TS, and TR as the set of test requirements derived based on the "object statement coverage criterion" for class C, The "object statement coverage level" of TS is the percentage of elements in TR that are executed by an object bound to C through the execution of TS.

The concept of "criteria subsumption" makes it possible to compare various coverage criteria (Ammann and Offutt 2016). The coverage criterion x subsumes the coverage criterion y if and only if every test suite that satisfies x also satisfies y (Ammann and Offutt 2016) (Here, satisfaction means achieving 100% coverage level). By this definition, we can simply argue that the “object statement coverage criterion” subsumes the traditional “statement coverage criterion” because, using the former, the set of test requirements will include all statements of the class under test and will impose these statements to be executed at least once by the given test suite. These are indeed the statements that should be covered by the traditional statement coverage criterion.

It is obvious that the object coverage criteria work exactly the same as the traditional code coverage criteria for every class outside the inheritance hierarchy (i.e., a class that neither inherits any other class nor is inherited by any other class). This is due to the fact that such a class does not have any inherited code (because it does not inherit any class), and its code can only be executed by objects with the type of the class itself (because it is not inherited by any other class).

Another point that should be considered is that, some OO languages, like Java, C++, and Python provide static methods which belong to a class in order to support utility or helper functions and procedures. They act like a procedure or function in the procedural programming paradigm. Therefore, we do not need any instance object to invoke a static method; instead, we can invoke it statically through the name of its class. This way, since these methods do not belong to any object, there is no executor object to execute them; hence, the proposed criteria will act as the traditional criteria, and only the execution of static codes (regardless of the type of the executer object) will be considered.

The last point to note is that, when we refer to an inheritance hierarchy that ends with the class under test (see Fig. 1), only domain classes are included; domain classes are classes we have designed and implemented to solve our problem. In this condition, basic classes, such as the Object class, from which all Java classes are implicitly inherited, are not included, and therefore, we do not flatten their code (methods) with the class under test. In addition, if a class inherits from an external class (an external library), that class will not exist in the inheritance hierarchy of the class under test.

3.1.1. Example: Measuring Object Coverage Level

We now show how the "object statement coverage level" can be measured for some Java classes represented by examples in Section 2.

Consider the Stack class, which is in a simple inheritance hierarchy with the CircularStack class. As seen before, the CircularStack_TestSuite test suite achieved 100% line coverage level for the Stack class, while it was unable to reveal the fault in line 10 of this class code. However, since this test suite does not create any object of the Stack class, the object statement coverage level is equal to 0 for this class. Considering Stack_TestSuite, an object of the Stack class type executes statements in lines 6, 7, 12, and 17 of the class, so the object statement coverage level is 100% for the Stack class (note that, Stack does not have any parent; hence, we only consider statements in the Stack class itself for measuring object coverage). In addition, this test is failed and indicates a failure in the class. As shown in this example, the traditional criterion shows a high percentage of coverage for the Stack class, which can mislead developers and prevent them from writing sufficient tests for this class; the new coverage criterion addresses this shortcoming.

As another example, consider the List class and its child, ClearableList, represented in Subsection 2.2. Here, we want to measure the object line coverage level for ClearableList. We first do it by considering List_TestSuite1. Using this test suite, the traditional line coverage level becomes 100% for ClearableList because all lines of this class are executed by the tests in List_TestSuite1. Since ClearableList inherits class List, in order to determine the object line coverage level, we should first flatten List with ClearableList in form of a new temporal class, called ClearableList_Flat, as follows (the following code is pseudo-code and not a valid Java code because the :: operator is not supported in Java in this way. We have used this code only to illustrate how to obtain the coverage).:

01 class ClearableList_Flat {

02 public List::List(size){

03 this.maxSize = size;

04 items = new Object[size];

05 }

06 public void List::add(Object item){

07 if(lastIndex = = maxSize)

08 throw new Exception(“List is full”);

09 items[lastIndex++] = item;

10 }

11 public void List::remove(int index){

12 if(index < 0 || index > = lastIndex)

13 throw new Exception(“Index out of bounds”);

14 for(int i = index; i < lastIndex-1; i++)

15 items[i] = items[i + 1];

16 this.lastIndex–;

17 }

18 public int List::getSize(){

19 return lastIndex;

20 }

21 public ClearableList(int size){

22 super(size)

23 }

24 public void clear() {

25 items = new Object[maxSize];

26 // lastIndex = 0; /* commented for bug seeding */

27 }

28 }

As shown above, the accessible inherited code from class List is added to ClearableList_Flat using the scope resolution operator with symbol “::”. Tests in List_TestSuite1 execute the statements in lines 3, 4, and 19 of ClearableList_Flat (which are from the inherited code) along with the statements in lines 22 and 25 from the main class ClearableList. Although this test suite results in 100% traditional line coverage level, the object coverage level is equal to about 42% because lines 7,8 and 7 of the inherited method List::add, with lines 12, 13, 14 and 15 from List:remove are not executed by any object of class ClearableList; hence, the object coverage level is 5/12 = 42%. Furthermore, this test suite is unable to reveal the seeded fault in the ClearableList class. Now, consider the ClearableList_Test2 test suite, which in addition to the methods executed by the previous tests, executes the inherited method List::add. Therefore, it achieves 67% object line coverage level. ClearableList_Test2 fails and reveals the fault in the code.

3.1.2. Example: Challenging Test Generation Tools

One of the main applications of the code coverage criteria is their use as an indicator to guide the automated generation of test data. In recent years, various approaches and tools have been introduced to generate test data for OO programs based on the code coverage criteria (Gay et al. 2015). One of the prominent tools in this field is Evosuite, a whole test suite generation tool. Evosuite has been selected as a highly efficient tool with a high level of code coverage in various challenges and researches (Fraser and Arcuri 2015; Molina et al. 2018; Kifetew et al. 2019; Devroey et al. 2020).

Although the examples presented in Section 2 had very simple structure and were only used to explain and clarify issues related to applying the code coverage criteria to OO languages, they may challenge test generation tools in practice. In order to demonstrate this issue, we use EvoSuite to generate a test suite for one of these examples. Consider the ClearableList class once again. As mentioned before, this class inherits from the List class and adds a new method that can delete all contents of the list. We have extracted tests for this class using the latest version of the EvoSuite tool. We have used the correct version of the class by uncommenting line 7. The extracted test suite includes one test data that results in 100% traditional statement/line coverage for the class ClearableList. This test suite is as follows:

01 class ClearableList_ESTest {

02 public void test0() throws Throwable {

03 ClearableList clearableList0 = new ClearableList(1);

04 clearableList0.clear();

05 assertEquals(0, clearableList0.getSize());

06 }

07 }

However, the object line coverage of this test suite is about 46%. This indicates that although the class code has been fully executed during the test run, we have not fully tested the inherited methods. If we run the above test suite against the faulty version of ClearableList (by commenting line 7 in this class), the test suite will pass and will not reveal the injected fault. As this simple example shows, automated tools may have serious problems in revealing OO-related faults; this issue should be examined in further research.

3.2. Poly-object Coverage Criteria Definition

As stated in the previous section, two key points should be taken into account when defining an object coverage criterion: one is to consider which object type executes the class code, and the other is to consider the code inherited from the parent and ancestor classes in addition to the main class code. Since the problems related to polymorphism and dynamic binding are usually dependent on object type and inherited state/behavior code, the defined object coverage criteria can already address some of these problems. We have defined the object coverage criteria based on parent/ancestor classes of the class under test, while polymorphism problems for a class usually arise from child/descendant classes. Therefore, in order to use the object coverage criteria, it is necessary to consider the object coverage level of child and descendant classes, as well. To simplify the use of our new criteria for addressing polymorphism issues, in this section, we define the “poly-object coverage criteria”, which specifically consider a class in possible different polymorphic uses. These new criteria are based on our proposed object coverage criteria.

The poly-object coverage criteria work the same as the object coverage criteria. However, unlike the latter, which are defined for a class by considering super-classes (i.e., parent and ancestor classes), the former are defined for a class with respect to its subclasses (i.e., its child and descendant classes). To define the poly-object coverage criteria, imagine class C has a child or descendant class, called D. The poly-object coverage criteria for base class C and subclass D is fully satisfied if and only if every part of C, which is accessible from D, is executed at least once by an object with type D during test execution. Using the notion of test requirements (Ammann and Offutt 2016), we can formally define these criteria as follows. Similar to Definition 1, for sake of simplicity, we just formally define a criterion for statements; criteria for other structures like lines, basic blocks, and branches can formally be defined in a similar way:

Definition 3

Poly-object statement coverage criterion: According to this criterion, the set of test requirements (TR) for a given class C and one of its subclasses, D, is equal to all statements which are defined in C and accessible from D. By this criterion, each statement in TR should be executed at least once by an object bound to class D using the provided test suite.

We can generalize the above definition by considering a set of classes, \(\left\{{D}_{1}{. D}_{2}...{D}_{n}\right\}\), as subclasses or descendant classes of base class C:

Definition 4

Poly-object statement coverage criterion for a set of subclasses: According to this criterion, the set of test requirements (TR) for a given class C and a set of classes as subclasses or descendant classes of C in an inheritance hierarchy, is equal to all statements which are defined in C and accessible from any of classes. By this criterion, for each class in each statement in TR should be executed at least once by an object bound to using the provided test suite.

The coverage level for all poly-object coverage criteria can be defined similar to Definition 2. In general, coverage level is equal to the ratio of the number of satisfied test requirements to the total number of test requirements.

3.2.1. Example: Measuring Poly-object Coverage Level

We now give an example of how to measure the poly-object coverage level by using the Counter and OneBasedCounter classes presented in Subsection 2.3. As stated earlier, although these classes work properly when they act separately, a polymorphic bug exists in the ResetCounter class that will occur if an object of the OneBasedCounter class is bound to a variable with type of the ResetCounter class. In addition, the test suite Counter_TestSuite, which achieves 100% traditional line/statement coverage for both classes, is not able to reveal this bug. To calculate the level of poly-object coverage of the Counter base class by the Counter_TestSuite test suite, we do the following:

1. As Counter is the base class, and {ResetCounter, OneBasedCounter} is the set of descendant classes, the accessible code (lines) of Counter for descendant classes are all statements that exist in the constructor, inc and value methods.

2. For each descendant class DC, we determine which lines from the accessible lines (calculated in the previous step) are executed by objects with type of DC:

2.1. For descendant class ResetCounter: the statements in lines 5 and 6 in the Counter class are executed by ResetCounter objects.

2.2. For descendant class OneBasedCounter: the statements in lines 5 and 6 in the Counter class are executed by OneBasedCounter objects.

3. The poly-object coverage level is the ratio of the number of executed statements to all accessible statements of the base class for each descendant class, which is equal to 4/6 (about 67%).

Now if we add the test case CounterPoly_Test to the previous test suite Counter_TestSuite, the poly-object coverage level increases to 5/6 (about 83%). Furthermore, this new test suite fails and reveals the seeded polymorphic fault. We also used the EvoSuite tool to generate a test suite for our three counter classes (without any seeded fault). By adding our seeded polymorphic bug (by removing the virtual keyword in ResetCounter), this tool was unable to detect this bug, and all of the test cases in the generated test suite were passed.

3.3. The OCov4J Tool

To support the object coverage criteria for the Java language, we have implemented the OCov4J tool as a prototype tool. OCov4J is available as an open source project, published on the GitHub online code repository^{^[2]}. The source files, the executable version of the tool (jar files), and a guide on how to install and use the tool can be found on the GitHub repository.

OCov4J receives a target code (or executable project as a jar file), and one or more test suites in the JUnit format (This tool can also be used with another test unit framework in Java, though), as inputs. Next, It instruments the code on the fly, and then, calculates various coverage levels related to different object coverage criteria through analyzing the information gathered during unit test execution and examining the hierarchy of classes.

Figure 2. The OCov4J architecture

Unlike regular instrumentation libraries, which record execution information for lines (or other parts of the code), the instrumentation approach in OCov4J additionally records information related to the executor object and the runtime context. The OCov4J Instrumentation is applied at the bytecode level; therefore, it does not require the source code and can apply on any compiled application in Java and other bytecode languages like Groovy and Scala. This way, it allows us to apply object coverage criteria when only the executable files of the program are available, or in other words, we do not access the source code. Nevertheless, OCov4J extracts the embedded information about the source code from the given bytecode. Using this information (like line numbers), it can generate reports which are useful for developers and debuggers. Moreover, by using the bytecode on the fly (in-memory), OCov4J saves execution time due to excluding the need to re-compile or save files on disk.

Figure 2 shows the OCov4J architecture. This tool uses the Agent architecture in Java to modify bytecodes on the fly. OCov4J attaches to the Java Virtual Machine (JVM) execution process as an external agent and acts as an interface for loading class bytecodes into memory. During this process, OCov4J first loads bytecodes of a class. Then, by changing these bytecodes, it adds the additional code needed to calculate the object code criteria to the bytecodes.

Finally, the instrumented bytecode is delivered to the Java class loader. ASM^{^[3]}, which is a framework for analysis and manipulation of Java bytecodes, is used to modify and transfer bytecodes. Using the Java Agent architecture results in the low execution cost due to the avoidance of re-compiling and re-loading classes. In addition, this architecture provides high flexibility and makes OCov4J to easily attach to any execution process in the JVM and extract data required to obtain the coverage criteria.

The initial version of OCov4J supports the “statement coverage criterion” as well as the “line coverage criterion” of both categories "object coverage criteria" and "poly-object coverage criteria". Other related criteria, such as branch coverage and decision coverage are going to be added in future versions. This tool is compatible with most Java language features such as inner classes, generic classes/methods and lambda expressions. So, it can be applied to large and real applications, as we will show this capability in our evaluation process.

As an example of how to use OCov4J, consider class ClearableList for which we have manually measured the object line coverage criterion in Subsection 3.1.1. To automatically calculate this criterion, we assume that the ClearableList_Test1 class is a JUnit test. Now, we first attach the OCov4J Jar file to the Java process while executing the tests with the following commands. Lines 1 and 2 below execute the unit tests on class ClearableList:

01 java -cp lists.jar:junit.jar //

02 -javaagent:OCov4J.jar = ClearListorg.junit.runner.JUnitCore ClearList_Test1

03 java -jar ../path/to/OCov4J.jar --line-coverage

We used option –javaagent, highlighted above, to attach the Jar file of OCov4J to the JVM process. The command then causes the JUnit core to run the tests specified in test suite ClearList_Test1. After running these tests, OCov4J saves the coverage information in some Comma Separated Values (CSV) files in the current directory. These CSV files can be used for later processing in spreadsheets tools. OCov4J provides some commands to view coverage level values. For example, by executing line 03 of the above code, the object line coverage level is shown on terminal.

^{^[2]} https://github.com/sbu-test-lab/ocov4j.git

^{^[3]} https://asm.ow2.io/

The main purpose of this research is to adapt the traditional code coverage criteria in order to address specific OO features, while at the same time, the resulting new criteria have high degree of automation, simplicity and low cost of execution, as before. In this section, the proposed object coverage criteria are empirically evaluated using a set of different OO classes as benchmarks. These classes have been selected from different open source programs in Java. We use our prototype tool, i.e., OCov4J, to measure object coverage criteria. The results are then compared with the results of traditional coverage criteria (obtained by the JaCoCo tool) to find which criteria better reveal OO-related failures.

The major goal of this empirical evaluation is to investigate the correlation between object coverage criteria and the ability to reveal specific OO-related faults. The secondary goal is to examine how our poly-object coverage criteria address the polymorphic and dynamic binding issues. For these purposes, several classes under test (CUTs) from different open source Java programs/libraries have been selected. We have then generated a test suite for each CUT such that the resulting test suites yield high level of coverage for traditional code coverage criteria. Next, a set of OO faults have been seeded in the source code of every CUT or its related classes in the inheritance hierarchy to make specific OO faulty versions (For this purpose, we have used auto-generated mutants with some manually seeded faults). Finally, we have calculated object coverage criteria for each test suite as well as the number of faulty versions detected by this test suite in order to examine the correlation between these two categories of values. In the following subsections, we review the details of benchmarks, the evaluation process, and finally, the results of the evaluation.

4.1. Benchmark projects

A total of 270 classes (excluding private classes, intermediaries, and internal classes) from six different open source projects have been included in the empirical evaluation. Apart from JTetris which is a small project with educational purposes, five open source and widely used projects with active community have been selected.

Table 1 shows the project name, the package used during evaluation, the number of classes, the depth of the inheritance tree (DIT), and the lines of code (LOC) (excluding comment and blank lines) for each selected project. The MetricsReloaded^{^[4]} tool has been used to calculate LOC and DIT of each class. A summary description of benchmark projects is provided below:

JTetris^{^[5]}: This project is a simple implementation of the Tetris computer game in Java. This implementation is based on the formal specification provided in (Smith 2012). It uses inheritance, polymorphism and dynamic binding to model the elements and the process of the game. The classes under the package models are used to specify different elements and their movement in the game.
Apache Validator^{^[6]}: This is a widely used Java library for data validation that provides various sets of validation rules for different data types. It is a sub-project of the popular Apache Commons project. We use classes under package validator.routines which model different validators for various data types.
Apache BCEL^{^[7]}: The Apache “Byte Code Engineering Library” is an open source project designed for analyzing and manipulating Java compiled files, namely bytecode files. This project is very popular and has largely used OO features. Related works such as (Alexander et al. 2012; Offutt et al. 2006) have used this project in their evaluation process. The package classfile used in the evaluation contains the classes that model different internal parts of a bytecode file.
Apache Collections^{^[8]}: This library enriches the data structure framework of Java Development Kit (JDK) by providing some new data structure classes or helper classes for existing data structures. We use the map package which provides different types of map data structures.
Google Graphs^{^[9]}: Google Graphs is a famous library for working with graph data structures in Java. It is a sub-project of the famous Google Guava project which includes common libraries for Java, maintained by Google. We use package graph from this project including different types of graphs and various network data structures.
Ta4J^{^[10]}: It is an open source project for technical and financial analysis in Java. This project has received a lot of attention from Java developers. The classes under the package indicatorscontain different technical indicators used in the evaluation. Each indicator predicts future price movements by analyzing historical price data.

Table 1

A summary of benchmark projects
Project	Package	Classes	DITs of all classes	Average DIT	LOCs of all classes
JTetris	org.jtetris.models	11	34	3.090	521
Apache Validator	org.apache.commons.validator.routines	27	57	2.111	9,456
Apache BCEL	org.apache.bcel.classfile	82	154	1.878	14,046
Apache Collections	org.apache.commons.collections4.map	36	110	3.055	12,797
Google Graphs	com.google.common.graph	38	80	2.105	7,812
Ta4J	org.ta4j.core.indicators	76	236	3.015	5,234
Sum		270	-	-	49,866
Average		45	112	2.5	8,311

To examine the "object coverage criteria", 40 classes from the mentioned projects have been selected. We call these CUTs as “target classes”. These classes lie in different inheritance hierarchies. Using EvoSuite, a test suite has been generated for each target class to measure its ability to detect OO-related failures. In addition, 24 classes whose majority are different from target classes have been used to evaluate the "poly-object coverage criteria". We call these 24 classes as “base classes”. Target, base, and other classes in the inheritance hierarchy are used to create faulty versions of each project; the details are presented in Subsection 4.3. Each target class has at least one parent. Moreover, we use some target classes with several ancestor classes so that different types of OO problems can be modeled in a faulty version. Also, each base class has at least one child so that it can be used in polymorphic manner. Finally, we use base classes with different numbers of descendant classes to cover more polymorphic faults. By selecting different, real projects, and different target and base classes in various inheritance hierarchies, we attempt to reduce threats to external validity of our empirical validation.

4.2. Test Suites

For each target class in each inheritance hierarchy, a test suite has been created using the EvoSuite tool. For half of the selected classes, i.e., 20 classes, the generated test suite has 100% traditional statement coverage. 11 other classes have coverage more than 70%. 7 classes have a coverage between 50% and 70%, and only for two classes, the coverage is slightly less than 50%. Thus, in total, most of the target classes, i.e., more than 77% of them, achieve high code coverage levels (between 70% and 100%). Attempts have been made to ensure that the tests have a high level of code coverage so that they more likely detect failures in our faulty versions of projects.

We have used EvoSuite 1.1.0 with default configuration to generate test suites. Also, the maximum search budget time for each test suite generation has been 1 minute. After generating a test suite for each target class, the object coverage criteria for each target class are measured by adding the OCov4J agent to the JUnit test execution process.

4.3. Faulty Versions

Fault seeding in which artificially faults are inserted into programs is commonly used to compare testing approaches (Papadakis et al. 2019). To measure how much a test suite has been able to reveal OO-related failures in a target class, some tiny changes have been made to the target class code along with its parent and ancestor classes. This way, fault seeded versions of projects, called “faulty versions”, have been built. To inject different types of OO defects into the faulty versions of projects, we have used the classification of OO faults in (Offutt et al. 2001; Alexander et al. 2002) as well as approaches for generating OO mutations in (Kim et al. 2001; Offutt et al. 2006; Ma et al. 2006). According to the mentioned approaches, OO faults and problems (related to concepts such as inheritance, polymorphism, and dynamics-binding) can be categorized as follows:

Inappropriate and inconsistent use of inherited state variables: Many errors due to inheritance may occur when the child class misuses the inherited state space of the parent or ancestor classes. In the fault model introduced in (Offutt et al. 2001), these errors are classified as “state definition anomaly” and “state definition inconsistency”. IHD and IHI operators have also been used in mutation approaches such as the approach of (Offutt et al. 2006) in order to generate such general errors. These operators hide the inherited state variable by deleting or adding a variable whose name is the same as the name of a variable in the parent or ancestor classes.
Incompatible invocations of inherited methods: Some other inheritance-related errors are because of the incorrect usage of the parent state space in defining overridden methods. There are also errors due to the either direct or indirect invocation of these methods. In the fault model of (Offutt et al. 2001), these errors are respectively classified as “state definition inconsistency due to state variable hiding” and “indirect inconsistent state definition”. To create such general errors by mutation approaches such as the approach of (Offutt et al. 2006), operators IOD and IOR have been used. IOD and IOR cause failures by deleting and renaming overridden methods in the child class, respectively.
Incorrect and inconsistent invocation of parent/ancestor constructors: The way of initializing a class can cause some common, potential OO failures. Executing the constructor of a class will result in executing the constructor of the parent and ancestor classes until the execution reaches the root class. Also, each class may explicitly execute a specific version of its parent class constructor, so inappropriate invocation through these sequences of constructor innovations may lead to data anomalies. In addition, calling other methods of a class in the constructor can cause some potential failures because these methods can be overridden by subclasses. In the fault model of (Offutt et al. 2001), these errors are classified as “anomalous construction behavior” and “incomplete construction”. Operators IPC and JDC have been introduced to model such errors in the OO mutation approach of (Offutt et al. 2006). The former operator eliminates the parent constructor, and the later makes the default constructor to be executed. These changes model the problems which take place because of the interaction between the class constructor and inherited constructors.
Polymorphism and using inconsistent types: Polymorphism and dynamic binding can lead to potential problems through executing overridden methods in different contexts (i.e., different instances of descendant classes’ objects). These problems may occur especially when the inheritance depth is more than two. For example, the yo-yo problem is a well-known case of this failure type that is mentioned to be difficult to find (Alexander et al. 2010). In the fault model of (Offutt et al. 2001), these errors are classified as “inconsistent type use”. Several operators have been introduced to generate such general errors in the mutation approach of (Offutt et al. 2006). For example, the PNC operator changes the object instantiation from a class to a subclass of this class. Operator PMD changes the declaration type of a variable to the parent or an ancestor of the class.
Misusing programming language keywords in accessing object state and inherited state: Although this error type is not included in the fault model of (Offutt et al. 2001), Ma and Offutt consider these errors common among developers and introduce mutation operators for them (Offutt et al. 2006). These operators include ISK, JTD and JSC which are responsible for deleting the keywords super (for problems related to the inherited code access), this (for problems related to the object states), and static (for problems in accessing the shared space between classes), respectively. Note that these types of faults may lead to errors that are semantically equivalent to the errors in the aforementioned categories. Therefore, it is necessary to check the generated faulty versions to avoid equivalent mutants.

Although we could implement our fault seeding process by using only one mutation tool that supports OO mutation operators (e.g., MuJava (Maet al. 2006) which implements the mutation approach of (Offutt et al. 2006)), we have also used a manual approach alongside with MuJava to produce faulty versions. This was done due to the lack of mature tools in this field. Among all the OO mutation tools, only MuJava is accessible and can be used for real cases, such as our benchmarks presented in the previous subsection. However, MuJava, like other OO mutation approaches, generates mutants by applying only a tiny change to one statement of a single class file; this has some shortcomings for modeling specific OO faults as listed below:

1. While making a small change in one statement of one class file usually yields a valid mutant in procedural mutations, this approach may lead to invalid mutants in the OO paradigm. For example, in MuJava, the JTD operator tries to create an OO mutant by removing the keyword this from the beginning of the variable name so that the value assigned to this variable remains local and does not bind to the class state variable (which can be a common fault as mentioned in our OO faults categories). Now, consider class Bars in the following example, where applying the JTD operator in line 6 can yield a mutant. This mutant could not be compiled because of the final initializing rule in declaring state variable barcount with the final keyword (line 2). This rule enforces a state variable to be initialized when the object is instantiated. Therefore, to create such mutants, we should apply two tiny changes at the same time in the code: removing the final keyword in line 2, and removing this keyword in line 6. Therefore, in order to do OO mutations, in some situations, it is necessary to make multiple changes in different parts of the class.

01 class Bars {

02 final Int barCount;

03 Bar[] bars;

04 public Bars(int barCount) {

05 bars = new Bar[barCount];

06 this.barCount = barCount;

07 }

08 }

2. The existing approaches generate OO mutants for a particular class by only considering this class in isolation and applying their operators to this class code. However, many OO failures (especially polymorphic failures) originate from a fault in the parent or ancestor classes. Although such faults will not affect the correctness of the class itself, it may lead to OO failures in the descendant classes (for example, refer to the counters example in Subsection 2.3). Thus, when we generate some mutants for a class, we may need to apply changes in its parent or ancestor classes.

3. To model some OO faults, it is necessary to make small changes in the target class and its parent/ancestor classes, simultaneously. For example, the IPC operator in MuJava generates mutants by removing the invocation of one of the parent class constructors from the constructor of the class. When we remove this invocation, the Java compiler replaces the call of the default constructor of the parent class (this constructor does not have any parameter) with the removed parent constructor call; if the parent class does not have a default constructor, the change made to the child class will cause a compile error. In this condition, in addition to changing the child class file (removing the parent constructor call), we need to add the default constructor to the parent class, simultaneously. Consider the class ClassUnderTest below, for which we want to create a mutant by removing the parent constructor invocation in line 4, as highlighted. Suppose ParentClass does not have a default constructor, so, to make the mutant valid and compliable, we have to add a default constructor to ParentClass at the same time, as highlighted in line 3 of the ParentClass code.

01 class ClassUnderTest extends ParentClass {

02 public ClassUnderTest(Object param){

03 // remove the below line to disable the parent constructor

04 super(param);

05 //other constructor code of this class

06 }

07 }

01 class ‌ParentClass {

02 // add below line as a default constructor

03 public ParentClass(){}

04 public ‌ParentClass(Object param){

05 //other constructor code of this class

06 }

07 }

In regard of the above points, we have used a hybrid approach for the fault seeding process. In addition to using the MuJava tool to generate OO mutants, some more faulty versions have manually been created by analyzing the target classes along with the parent/ancestor classes. In general, to generate faulty versions, we followed the following steps:

1. First, we generated all the OO mutants that can be produced using only the inheritance, polymorphism, and dynamic binding operators of MuJava.

2. Next, we manually generated faulty versions in different mentioned categories (specified at the beginning of this subsection) by changing two or more parts of a class at the same time, modifying the parent/ancestor classes, or changing several classes at the same time, including the target class and the classes in the inheritance hierarchy.

3. Finally, we checked all faulty versions generated (either automatically or manually) for a class to remove equivalent versions.

Below are two examples of automatic and manual mutants used in the evaluation process. Class Field of project Apache Validator represents a mutant that has automatically been generated by MuJava. Line 3 has been removed from this class so that the parent constructor would not be called. The DPOIndicator class (from the Ta4J project) shows a manual mutant generated by the approach described in this section. Two simultaneous changes have been applied to this class to make the mutant. First, the “this” keyword has been removed from the beginning of variable barCount in line 6 to change the scope of this variable. However, since variable barCount in line 2 is declared with the “final” keyword, this keyword must also be removed so that the mutant can be compiled.

01 public class Field extends FieldOrMethod {

02 public Field(final Field c) {

03 super(c); //this line removed to generate a mutant

04 }

05 // other code of this class

06 }

01 public class DPOIndicator extends CachedIndicator < Num> {

02 private final int barCount; //the “final” keyword is removed

03 // other field of class

04 public DPOIndicator(Indicator < Num > price, int barCount) {

05 super(price);

06 this.barCount = barCount; //the “this” keyword is removed

07 timeShift = barCount / 2 + 1;

08 this.price = price;

09 sma = new SMAIndicator(price, this.barCount);

10 }

11 // other code of this class

12 }

Performing the above steps and discarding the equivalent faulty versions, 668 faulty versions have been generated for target classes. In fact, about 16.7 faulty versions have been produced for each class, on average. Most of these mutants (more than 72%) have automatically been generated by the MuJava tool. Table 2 shows the details of the generated faulty versions for each target class. In addition to the project name and the target class name, there are the following columns in this table:

DIT: The inheritance depth of the target class in the inheritance hierarchy;
All Faulty Versions: The total number of all faulty versions generated for the target class, either using MuJava or manually;
Auto-generated Faulty Versions: The total number of faulty versions automatically generated by MuJava for the target class;
Manual Faulty Versions: The total number of faulty versions that have been manually generated by the approach introduced in this section;
Auto-generated Ratio: The ratio of auto-generated faulty versions to total faulty versions.

It should be noted that the number of faulty versions for each class depends on the sum of LOCs of the target class and its parent/ancestor classes, the number of overridden methods in this class, the inheritance depth of this class, and finally, the number of variables inherited from the parent/ancestor classes.

In general, the number of OO mutants is significantly less than the number of mutants generated by common, procedural mutation approaches. For example, in case of the famous triangle program which has 30 lines of code, about 950 mutants can be created using different types of traditional mutation operators like changing arithmetic operators or logical operators (Ma et al. 2006). However, in a study of 256 classes of the Apache BCEL framework, it has been shown that about 14.5 OO mutants can be generated for each class using the MuJava tool (Offutt et al. 2006).

Table 2

Target classes and related faulty versions
Project			All Faulty Versions	Auto-generated Faulty Versions	Manual Faulty Versions	Auto-generated Ratio
Google Graph	StandardValueGraph	3	21	16	5	76%
	StandardMutableValueGraph	4	30	16	14	53%
	StandardMutableNetwork	3	36	30	6	83%
	StandardMutableGraph	4	22	15	7	68%
	NetworkBuilder	2	7	6	1	86%
	ImmutableValueGraph	4	28	15	13	54%
	GraphBuilder	2	8	5	3	63%
JTetris	Square	3	18	18	0	100%
	Screen	4	10	10	0	100%
	S	4	22	19	3	86%
	ReverseL	4	21	18	3	86%
	Rectangle	3	20	19	1	95%
	Block	2	14	13	1	93%
Apache Validator	TimeValidator	3	17	12	5	71%
	PercentValidator	4	13	10	3	77%
	LongValidator	3	17	12	5	71%
	CurrencyValidator	4	16	12	4	75%
	CalendarValidator	3	16	10	6	63%
Apache Collections	TransformedSortedMap	4	24	16	8	67%
	StaticBucketMap	2	7	7	0	100%
	ReferenceMap	3	25	23	2	92%
	PredicatedSortedMap	4	17	8	9	47%
	PassiveExpiringMap	2	15	10	5	67%
	LRUMap	3	26	18	8	69%
	LazySortedMap	4	16	9	7	56%
	FixedSizeMap	3	16	10	6	63%
Apache BCEL	Method	3	13	12	1	92%
	Field	3	12	11	1	92%
	EnumElementValue	2	9	6	3	67%
	ConstantInvokeDynamic	3	13	4	9	31%
	ConstantFieldref	3	15	5	10	33%
	Code	2	12	10	2	83%
	ArrayElementValue	2	6	5	1	83%
Ta4J	ClosePriceIndicator	4	16	9	7	56%
	MMAIndicator	5	19	8	11	42%
	ATRIndicator	3	15	9	6	60%
	SMAIndicator	3	14	10	4	71%
	DPOIndicator	3	15	10	5	67%
	RWIHighIndicator	3	14	9	5	64%
	MACDIndicator	3	13	8	5	62%
	Sum	-	668	473	195	-
	Average	3.15	16.7	11.8	4.9	72%

4.4. Running Tests and Checking Faulty Versions

Each faulty version generated by the process mentioned in the previous subsection is either a modified Java class file or a collection of several modified Java class files contained in a directory. For each faulty version, we replaced the modified Java files with the original Java classes in the source directory of the related project. Then, we compiled the modified project and ran the target class test suite against the compiled (modified) project. If there was any test failure, we labeled the faulty version as “detected”, which meant the test suite had been able to detect the seeded fault in the project.

We had to do the same steps for all remaining faulty versions to indicate if the faulty version was detectable or not. Because these steps were repetitive and costly, we developed a helper tool, called MuRunner, to automate this process. This tool was integrated with the native Java compiler and the Maven build tools for compiling projects. MuRunner also supports different versions of JUnit to run tests. It is a command line tool which receives the root directory of “faulty versions” as input. Then, it retrieves the list of all faulty versions in the related directory. The tool performs all the required steps, which include replacing modified Java classes in the project, recompiling the modified project, running the JUnit test suite against the compiled project, and finally, collecting the results. These results show which faulty versions have been detected by the tests. MuRunner is publicly available through GitHub. In addition, all faulty versions produced during this evaluation are available alongside the tool as a sample project^{^[11]}. Using MuRunner, OCov4J, and provided samples, everyone can redo the experiment done in this research or perform additional ones.

4.5. Results and Discussion

Table 3 shows the experiment results for different faulty versions of the target classes. For each target class, in addition to the project name and the target class name, the following fields are determined in this table:

OCov: The object statement coverage level of the used test suite for the target class;
Cov: The traditional statement coverage level of the used test suite for the target class;
Detected Faults (All): The number faulty versions detected by the test suite vs. the total number of all faulty versions (either produced by MuJava or generated manually);
Detected Faults (Auto): The number of auto-generated faulty versions detected by the test suite vs. the total number of auto-generated faulty versions (mutants generated by MuJava);
Detection Ratio (All): The ratio of the number of detected faulty versions to the number of all faulty versions (either produced by MuJava or generated manually);

Detection Ratio (Auto): The ratio of the number of detected faulty versions to the number of auto-generated faulty versions (mutants generated by MuJava).

Table 3

Experimental results for each target class
Project	Target Class	OCov	Cov	Detected Faults (All)	Detected Faults (Auto)	Detection Ratio (All)	Detection Ratio (Auto)
Google Graph	StandardValueGraph	0.435	0.917	8 / 21	6 / 16	0.381	0.375
	StandardMutableValueGraph	0.400	0.662	6 / 30	4 / 16	0.200	0.250
	StandardMutableNetwork	0.406	0.746	7 / 36	5 / 30	0.194	0.167
	StandardMutableGraph	0.233	1.000	3 / 22	1 / 15	0.136	0.067
	NetworkBuilder	1.000	1.000	6 / 7	5 / 6	0.857	0.833
	ImmutableValueGraph	0.225	0.750	3 / 28	2 / 15	0.107	0.133
	GraphBuilder	0.958	0.967	6 / 8	5 / 5	0.750	1.000
JTetris	Square	0.138	1.000	2 / 18	2 / 18	0.111	0.111
	Screen	0.563	1.000	5 / 10	5 / 10	0.500	0.500
	S	0.137	1.000	2 / 22	1 / 19	0.091	0.053
	ReverseL	0.135	1.000	1 / 21	1 / 18	0.048	0.056
	Rectangle	0.235	1.000	3 / 20	3 / 19	0.150	0.158
	Block	0.805	1.000	9 / 14	8 / 13	0.643	0.615
Apache Validator	TimeValidator	0.522	1.000	5 / 17	4 / 12	0.294	0.333
	PercentValidator	0.625	1.000	5 / 13	4 / 10	0.385	0.400
	LongValidator	0.540	1.000	5 / 17	5 / 12	0.294	0.417
	CurrencyValidator	0.542	1.000	8 / 16	4 / 12	0.500	0.333
	CalendarValidator	0.645	1.000	2 / 16	1 / 10	0.125	0.100
Apache Collections	TransformedSortedMap	0.507	1.000	11 / 24	8 / 16	0.458	0.500
	StaticBucketMap	0.815	0.817	5 / 7	5 / 7	0.714	0.714
	ReferenceMap	0.068	0.455	2 / 25	0 / 23	0.080	0.000
	PredicatedSortedMap	0.535	1.000	7 / 17	4 / 8	0.412	0.500
	PassiveExpiringMap	0.847	0.903	7 / 15	4 / 10	0.467	0.400
	LRUMap	0.438	0.642	8 / 26	6 / 18	0.308	0.333
	LazySortedMap	0.471	1.000	6 / 16	3 / 9	0.375	0.333
	FixedSizeMap	0.531	0.769	11 / 16	8 / 10	0.688	0.800
Apache BCEL	Method	0.439	0.702	6 / 13	6 / 12	0.462	0.500
	Field	0.276	0.462	7 / 12	6 / 11	0.583	0.545
	EnumElementValue	0.640	0.647	4 / 9	3 / 6	0.444	0.500
	ConstantInvokeDynamic	0.444	1.000	5 / 13	1 / 4	0.385	0.250
	ConstantFieldref	0.500	1.000	5 / 15	1 / 5	0.333	0.200
	Code	0.606	0.612	5 / 12	4 / 10	0.417	0.400
	ArrayElementValue	0.686	0.741	2 / 6	2 / 5	0.333	0.400
Ta4J	ClosePriceIndicator	0.148	1.000	0 / 16	0 / 9	0.000	0.000
	MMAIndicator	0.179	1.000	2 / 19	1 / 8	0.105	0.125
	ATRIndicator	0.136	0.500	2 / 15	0 / 9	0.133	0.000
	SMAIndicator	0.246	0.800	5 / 14	3 / 10	0.357	0.300
	DPOIndicator	0.234	0.889	4 / 15	3 / 10	0.267	0.300
	RWIHighIndicator	0.485	0.689	9 / 14	7 / 9	0.643	0.778
	MACDIndicator	0.206	0.750	2 / 13	0 / 8	0.154	0.000

To informally check whether there is a positive correlation between object coverage level and the faulty versions detection ratio, we use a scatter diagram. This diagram is depicted in the left side of Fig. 3 based on columns OCov and Detection Ratio (All) in Table 2. According to this scatterplot, there is a positive linear correlation between the object coverage level and the percent of the OO faulty versions which were detected by the test suite.

Although the scatterplot shows a positive correlation, to find the strength of this correlation, we should obtain the correlation coefficient value. To choose the appropriate method to calculate this coefficient, we examine some of the characteristics of the resulting experimental data.

As the first characteristic, the left side of the plotter diagram approximately shows a linear relationship between two variables OCov and Detection Ratio (All). Now, using the Shapiro-Wilk test, which is one of the most powerful normality tests (Razali and Wah 2011), we check if both variables follow a normal distribution. The null hypothesis \({H}_{0}\) of this test is that the variable is normally distributed. According to the test, p-values for variables OCov and Detection Ratio (All) are respectively 0.056 and 0.096. Since p-value for the two variables is greater than the significance level α (0.05), we accept \({H}_{0}\); therefore, both variables are normally distributed. Another characteristic of data that is observed in the plot diagram is the absence of considerable outliers.

Based on the mentioned characteristics, we use the Pearson method to obtain the correlation coefficient between the object coverage level and faulty versions detection ratio. The Pearson correlation coefficient for these variables is 0.792 where the degree of freedom (df) is 38; df is the number of samples minus by 2. These values indicate a high positive correlation between object coverage level and failure detection ratio. Also, it should be noted that, this result is statistically significant, because the level of significance for a one-tailed test with significance level α of 0.05 and df = 38 is 0.257, and our Person correlation coefficient (0.792) is much greater than 0.257. If we use the results in Table 3 to draw a scatterplot for the traditional statement coverage level (column Cov in Table 3) and faulty versions detection ratio as depicted in the right side of the Fig. 3, no considerable correlation will be observed. Moreover, the Pearson correlation coefficient between these two variables is -0.010 which is considered as negligible correlation (Razali and Wah 2011).

As mentioned, we developed a manual approach to generate more diverse mutants with better coverage of object-oriented faults. However, it may be argued that manually generated mutants can cause bias in the results (Fig. 3). To answer this claim, first note that most mutants have been generated automatically by MuJava (as shown in Table 2, 72% of all mutants have been produced automatically). Second, the positive correlation is still significant if the evaluation is limited to auto-generated mutants merely.

To examine this issue in more detail, columns “Detected Faults (Auto)” and “Detection Ratio (Auto)” in Table 3 show the results of our experimental evaluation, only by taking into account all 473 auto-generated mutants. Also, Fig. 4 shows the scatterplot between object coverage level in column OCov and faulty versions detection ratio (calculated considering just auto-generated mutants) in column Detection Ration (Auto). The depicted scatterplot turns out a positive correlation between these two variables.

To compare the auto-generated correlation with the overall correlation presented before, we considered the Pearson correlation coefficient to determine the strength of the auto-generated correlation. The value of this coefficient for the auto-generated mutants is 0.779, which is slightly less than the correlation coefficient value for all mutants, i.e. 0.792. In addition, the correlation coefficient is statistically significant because it is much greater than 0.257 (as mentioned before, 0.257 is the level of significance for a one-tailed test with significance level α of 0.05 and df = 38). The right side of Fig. 4 shows the scatterplot between traditional coverage level and faults detection ratio regarding all auto-generated mutants, which, like the corresponding diagram that takes all mutants into account (right side of Fig. 3), shows no significant correlation between traditional coverage level and detected OO-related faults.

4.5.1. Poly-object Coverage Criteria Evaluation

In addition to the main object coverage criteria, we proposed the "Poly-object coverage criteria" in order to address problems related to polymorphism and dynamic binding. These criteria are essentially the same as the object coverage criteria, but the former are introduced for a class (as a base class) by considering its child and descendant classes. To evaluate the proposed Poly-object coverage criteria, we considered 21 classes from the benchmark projects introduced in Subsection 4.1. Each base class in our experiment has at least one child class and may also have other descendant classes. For each base class bc, we have selected the following versions from the set of faulty versions, which have been created for the descendant classes of bc (constructed as mentioned in Subsection 4.3):

Faulty versions of descendant classes of bc which have been generated by fault injection into the code of bc; and
Faulty versions of descendant classes of bc that falls into the "Polymorphism and inconsistent types use" category (based on our fault categories specified in Subsection 4.3). This category addresses polymorphism and dynamic binding problems.

Table 4 shows the results of this evaluation for base classes. Column NOD indicates the number of descendant classes of the base class (including its child). Column PolyCov shows the poly-object coverage level of the base class, considering its descendant classes as shown in column Descendents. The next column, Detected Faults, represents the number of detected faulty versions vs. the number of all faulty versions. Finally, the last column implies the detection ratio of faulty versions.

Using a similar approach to the previous section, we observe a high positive correlation between columns PolyCov and Detection Ratio in Table 3. The scatter diagram shown in the left side of Fig. 5 clearly shows the correlation between the poly-object coverage level and the polymorphic-related-fault detection ratio. Like the previous section, due to the characteristics of data in columns PolyCov and Faults Detection Ratio, we considered the Pearson correlation coefficient to determine the strength of the observed correlation. The value of this coefficient for the resulting data is equal to 0.821. Therefore, this correlation coefficient is statistically significant because the level of significance for a one-tailed test with significance level α of 0.05 and the df = 22 is 0.330, and our Pearson correlation coefficient (0.821) is greater than 0.330.

Table 4

Experimental results for the poly-object coverage level
Project	Base Class	NOD	Descendants	PolyCov	Detected Faults	Detection Ratio
Google Graph	StandardValueGraph	1	StandardMutableValueGraph	0.278	1 / 9	0.111
	AbstractValueGraph	2	StandardValueGraph, StandardMutableValueGraph	0.083	0 / 14	0.000
	AbstractBaseGraph	3	StandardMutableValueGraph, StandardMutableGraph, StandardValueGraph	0.161	2 / 16	0.125
	StandardNetwork	1	StandardMutableNetwork	0.326	2 / 7	0.286
JTetris	Grid	6	Screen, S, ReverseL, Rectangle, Square, Block	0.244	3 / 8	0.375
	Block	4	S, ReverseL, Rectangle, Square	0.125	1 / 5	0.200
	Polygon	2	S, ReverseL	0.125	1 / 6	0.167
Apache Validator	TransformedMap	1	TransformedSortedMap	0.484	3 / 8	0.375
	PredicatedMap	1	PredicatedSortedMap	0.548	2 / 5	0.400
	LazyMap	1	LazySortedMap	0.250	5 / 9	0.556
	AbstractMapDecorator	5	PassiveExpiringMap, FixedSizeMap, LazySortedMap, PredicatedSortedMap, TransformedSortedMap	0.374	10 / 23	0.435
	AbstractLinkedMap	1	LRUMap	0.292	4 / 14	0.286
	AbstractReferenceMap	1	ReferenceMap	0.054	2 / 20	0.100
Apache Collections	Attribute	1	Code	0.630	1 / 2	0.500
	ElementValue	2	EnumElementValue, ArrayElementValue	0.563	2 / 4	0.500
	ConstantCP	2	ConstantFieldref, ConstantInvokeDynamic	0.382	3 / 9	0.333
	Constant	2	ConstantFieldref, ConstantInvokeDynamic	0.333	2 / 5	0.400
	FieldOrMethod	2	Field, Method	0.484	1 / 2	0.500
Apache BCEL	AbstractCalendarValidator	2	TimeValidator, CalendarValidator	0.530	9 / 16	0.563
	BigDecimalValidator	2	PercentValidator, CurrencyValidator	0.500	2 / 5	0.400
	AbstractNumberValidator	3	PercentValidator, CurrencyValidator, LongValidator	0.479	6 / 11	0.545
Ta4J	RecursiveCachedIndicator	1	MMAIndicator	0.167	0	0.000
	CacheIndicator	7	ATRIndicator, RWIHighIndicator, MACDIndicator, MMAIndicator, ClosePriceIndicator, DPOIndicator, SMAIndicator	0.111	2	0.133
	AbstractIndicator	7	ATRIndicator, RWIHighIndicator, MACDIndicator, MMAIndicator, ClosePriceIndicator, DPOIndicator, SMAIndicator	0.595	12	0.522

4.6. Limitations

The object code coverage criteria emphasize on the execution of different parts of the code that represent the state and behavior of an object. These parts include the main class code along with the code of the inherited classes, while considering the actual type of the object under test at runtime. These criteria can only show that the different parts of the code, which are related to the object, are executed at least once. In other words, these criteria, like the traditional code coverage criteria, only enforce the execution of different parts of the code, and do not necessarily show the presence or absence of program’s failures.

One of the main means for revealing failures (especially failures based on the logic of programs or classes) is to use the assertion part of test cases. This part plays the role of the test oracle. Like the traditional code coverage criteria, the object coverage criteria are not able to effectively evaluate the test case assertions. This is why our experimental evaluation has been conducted based on very simple faults and mutants, which are more likely to be detected by assertions of auto generated tests when parts of the code that contain faults are executed.

Therefore, our evaluation results cannot show that achieving a high object coverage level for a test suite necessarily leads to revealing real OO failures. This claim is beyond the scope of this research, and examining the correlation between high object coverage levels and high capability to detect real-word OO failures could be the subject of a future work.

Since we have used OO mutations, i.e., MuJava class mutations, to generate mutants, and we have shown the correlation of the object coverage criteria with the detection ratio of these mutants, one could claim that the object coverage criteria are not necessarily required, when there exist mutation analysis techniques which support OO-related mutants. However, in addition to existing OO mutation approaches, like MuJava, we have used various techniques to generate faulty versions. These techniques, which are not practically used in existing OO mutation tools, make simultaneous changes in some part of the class under test, as well as its parent and ancestor classes. They also apply simultaneous changes in several family classes. Furthermore, like the traditional code coverage approaches, our "object coverage" approach could be used very simply, with high automation and with negligible execution cost. In contrast, there are not many automated tools for OO mutations, and also, like all mutation techniques, OO mutation methods have high execution cost (Segura et al. 2011).

4.7. Threats to Validity

One of the internal validity threats could be potential faults in the OCov4J tool implementation. We have tested this tool thoroughly. In addition, we have published the tool as an open source software for further evaluation and extension. As part of testing OCov4J, we have compared this tool with a mature traditional code coverage tool, JaCoCo. To do so, we have applied both tools to some classes without any parent or child. Based on our arguments in Subsection 3.1, applying the object coverage criteria and the traditional code coverage criteria to these classes must lead to the same results. Fortunately, OCov4J and JaCoCo tools have performed according to this expectation.

There are two points to note with respect to external validity threats. Firstly, it is the benchmarks chosen for our evaluation (refer to Subsection 4.1). We attempted to select as many different benchmark projects as possible and chose real world and widely used projects in different scopes like validators, file decoders, data structures and analytic software. Moreover, in selecting classes from each project, we have chosen classes in different inheritance hierarchies with different depths. The second threat to external validity could be because of our approach to generate faulty versions (refer to Subsection 4.3). This can raise the possibility of bias and make the generality of the results questionable. In regard of this concern, we followed high referenced approaches like (Offutt et al. 2001; Ma et al. 2006; Offutt et al 2006) to manually seed different types of OO faults. In addition to the manual faulty versions, we also used mutations that were automatically generated by MuJava with its different mutant operators for OO features, such as inheritance and polymorphism. We also tried to not use equivalent mutants by examining all auto-generated mutations. We have published our faulty versions as a benchmark alongside our tools, OCov4J and MuRunner, as open source tools to facilitate the reproduction of results and do more experimental evaluation.

^{^[4]} https://github.com/BasLeijdekkers/MetricsReloaded

^{^[5]} https://github.com/sbu-test-lab/jtetris

^{^[6]} https://commons.apache.org/validator/

^{^[7]} https://commons.apache.org/bcel/

^{^[8]} http://commons.apache.org/collections/

^{^[9]} https://github.com/google/guava

^{^[10]} https://github.com/ta4j/ta4j

^{^[11]} https://github.com/sbu-test-lab/mu-runner

Initially, Smith and Robson (1990) showed that testing OO programs or OO classes differs from testing procedural programs. One of the most important differences is indirect testing of a class through different objects made from the class and its family. They proposed a method for testing classes. Their method, that was a combination of specification-based and attribute-based approaches, addressed object instantiations in OO programs, but it did not specifically regard other OO features, such as inheritance and polymorphism.

One of the early beliefs about OO testing approaches was that the testing time and cost are reduced when inheritance is used. That is, when a parent class is adequately tested, the test of its child classes could be much easier because it is enough to test just the new methods that have been added to the child classes. Though, these beliefs were shattered by Perry and Kaiser (1990) who analyzed the adequacy of tests for OO programs, such as programs that use single inheritance, method overriding, or multiple inheritance. The results of their analysis showed that not only the amount of effort of testing OO programs is not reduced, but also, in order to achieve "test adequacy", we have to pay more. Harold et al. (1992) expanded the work of Perry and Kaiser by determining which features should be retested in a C + + inheritance hierarchy. They presented an algorithm to categorize the inherited methods in a child class and show which of them should be re-tested in this class from the beginning.

Another approach for test adequacy criteria is based on mutation analysis. In this approach, artificial defects are seeded into the program, and then, the test adequacy is evaluated by examining whether the tests are able to detect these defects. Although most of the researches in this area are related to the procedural paradigm (Papadakis et al. 2019), a few works have dealt with specific features of OO programs. One of the first efforts to analyze mutations for OO-related problems was the work of Kim et al. (2001). They created several operators, called "class mutators", to insert faults related to OO problems. Their approach was evaluated on a few classes with low LOCs. Moreover, they have not introduced any tool to support this approach.

Offutt et al. (2006) provided a more complete approach than the previous one for analyzing mutations for OO features. They provided a set of 25 class mutation operators in form of two subsets: one corresponds to the general OO features like inheritance and polymorphism, and the other one is related to specific OO features in Java, such as default constructor replacement. This approach was extended by MA et al. (2006). They introduced MuJava, which is a complement to the previous approach and also includes an automated tool for the Java language. Although this tool did not initially support common Java language features, such as generic classes/methods, its subsequent updates have covered some of these features. Therefore, this tool is now more practical than other tools to analyze mutations for specific OO features, such as inheritance, polymorphism and dynamic binding (Segura et al. 2011). We can also mention Judy (Madeyski and Radyk 2010) as another tool that has been provided for the mutation of Java programs. It is claimed that, in comparison to MuJava, Judy produces and evaluates mutants more efficiently. However, it only supports traditional mutation operators and does not regard inheritance, polymorphism and dynamic binding mutation operators.

Segura et al. (2011) applied existing mutation approaches and tools to a real, large scale project. Unlike previous works that had analyzed mutation techniques on educational-purpose programs or sample classes in a real project, they did so for all classes in a real project. They also reported several practical limitations in the current tool set for OO mutation analysis (Segura et al. 2011). They concluded that without fast and fully automated tools, mutation testing is impractical because it is a very time-consuming approach. Finally, as an efficient and up-to-date tool with the ability to integrate with build and continuous integration tools, we can mention the tool "Pit" for Java (Coles et al. 2010). This tool only supports traditional and procedural mutation operators, though.

Some adequacy testing approaches use data flow criteria which are based on the relationships between program/class state variables. Various testing approaches have been developed based on these criteria. Various graph-based methods have also been used for coverage analysis of object-oriented programs. In these approaches, first a graph (such as control flow graph, data flow graph, call graph, etc.) is constructed based on program elements. Then, tests are executed to cover different parts of the constructed graph with respect to several defined coverage criteria. One of the most popular category of these criteria is the data flow coverage criteria, which work based on the notion of def-use pairs. These pairs are introduced based on the location of the definition of every variable in the program and its use elsewhere. Such a pair of locations defines a test requirement. A data flow testing approach designates test data to satisfy several def-use pairs imposed by particular data flow criteria.

For OO programs, data flow approaches have usually been derived based on their procedural nature, but a few works have adapted the traditional data flow techniques with new OO features (Su et al. 2010). One of these approaches, called "coupling-based testing", has been proposed by Alexander et al. (2010) for testing polymorphic features. In this approach, a data flow technique is applied to a graph consisting of method calls in the inheritance hierarchy. First, all "couple methods” are extracted from the program. The “couple methods” include two method invocations on an object; the first method invocation defines a state variable of the class and the other one uses it, while between calling these two methods, no one changes the value of the variable. The path between these two invocations is called a “coupling sequence”. The authors used coupling sequences to test polymorphic features. To do this, all class types that could be bound to an object were considered. Then, all coupling sequences were examined by substitution of different types of the classes for the object under test. Based on this idea, some coverage criteria have been introduced to cover all different parts of coupling sequences. One of the weaknesses of this approach is that, it has not been evaluated on large and real-world projects. The other shortcoming is the lack of a tool to support the automation of this approach, which makes it impractical. Lack of efficient and automated tools is a common challenge in most data flow approaches (Su et al. 2010).

It should also be noted that traditional data flow approaches usually derive def-use pairs through static analysis of the source code. Due to the dynamic nature of OO programs, the authors in (Denaro et al. 2010) used a dynamic analysis approach to test OO programs. The authors did not introduce new coverage criteria and only employed more information (gathered from dynamic analysis) to find efficient def-use pairs for state variables. The authors have used traditional, procedural mutation analysis in evaluating their work and have not applied specific OO mutations.

The authors in (Najumudheen et al. 2019) have presented a method to construct a single coherent graph by combining three graphs, i.e., data flow graph, control flow graph, and object dependencies graph. Using this constructed graph, coverage analysis is feasible for OO features like inheritance and polymorphism. Based on this graph, the authors have defined the polymorphism and inheritance coverage criteria. The former represents the ratio of the number of polymorphism methods invocations to all possible polymorphism invocations through running tests; the latter determines that all methods with the required level of access in the descendant classes have been executed by some test. This approach has been evaluated on a few simple benchmark programs with a low number of lines. Due to the complexity and largeness of the constructed graphs for even simple examples, it is not clear how this approach can be practical for large and real-world programs.

The approaches outlined in this section provide general guidelines for most OO programs. Although some of them have been implemented as a tool for some specific programming languages, they can be applied to various OO programming languages. Nevertheless, some approaches have been introduced to enrich test adequacy criteria for specific applications, such as dynamic web applications. For example, the authors in (Zou et al. 2014) have provided test adequacy criteria for dynamic web applications written in PHP and JavaScript languages. In order to evaluate tests, instead of using code coverage criteria on the server-side or UI element coverage on the client-side, separately, they have considered both codes at the same time. According to their approach, first, a virtual Document Object Model (DOM) is created by analyzing the server-side and the client-side code, and then, criteria for covering different parts of this DOM are provided. The authors conducted a study and concluded that their criteria outperformed existing server-side code coverage and client-side UI element coverage criteria because they found more faults. However, approaches like the one proposed in (Zou et al. 2014) depend on specific elements in certain applications, such as the DOM which is an OO representation of an HTML document in a dynamic web application; hence, they cannot be used in other OO domains and programs.

As described, some related approaches have addressed OO specific features by introducing test adequacy criteria, but they have not been evaluated on large software and there is no automated tool to support them; it makes these approaches impractical. Our approach adapts the very popular code coverage criteria to address specific OO features while maintaining their strengths, including simplicity, low execution cost, and high automation.

This paper presented test adequacy criteria for better testing OO programs with emphasis on the OO features, such as object instantiation, inheritance, polymorphism, and dynamic binding. The proposed approach adapts the notion of the traditional code coverage so that the proposed coverage criteria can better address problems arising from the mentioned OO features. In this adaptation, we have attempted to maintain the strength points of the traditional code coverage criteria, such as ease of use, high automation, and low execution cost.

Compared to the traditional code coverage criteria, the proposed criteria, called object coverage criteria, actually take two key points into account: Firstly, they regard the part of the class code that is executed by each object type. The other point is to consider all codes that are inherited from the parent/ancestor classes and represent the states and behaviors of the class under test, either directly or indirectly. Based on these ideas, we first proposed the basic criteria, called "object coverage", which are based on a class and its parent/ancestor classes to address specific OO issues. We have next defined some auxiliary criteria, called "poly-object coverage", which consider a class with its child/descendant classes, to especially address problems related to polymorphism and dynamic binding.

The above criteria have been implemented in a prototype tool, called OCov4J, to support Java language. Using this tool, the introduced criteria have been applied to several different open source and widely used projects, empirically. In this experiment, artificial faulty versions were created from different classes in various inheritance hierarchies. Then, it was examined how many of these faulty versions could be detected by the test suite of each class. The results demonstrated a strong positive correlation between faulty versions detection ratio with the object coverage level. To complete this research, the following topics can be addressed in future work.

Validation of automated OO test generation tools. In recent years, several tools to automate test generation for OO programs, especially Java programs, have been introduced. A number of experiments have also been done to evaluate the ability of these tools to effective test generation. For example, in (Molina et al. 2018; Kifetew et al. 2019; Devroey et al. 2020), an annual competition has been conducted to evaluate the best Java-based automated test generation tools. None of these evaluations have addressed the common issues of OO programs, such as encapsulation, inheritance, and polymorphism. In the mentioned competitions, the traditional code coverage criteria along with mutations generated by the Pit tool (Coles et al. 2010) (the Pit tool only generates traditional procedural mutants) have been used to evaluate the effectiveness of tests. Although during the presentation of our work, the effectiveness of the EvoSuite tool as one of the main tools for automated test generation was explored using simple examples (Subsections 3.1.2 and 3.2.1), we can evaluate the effectiveness of EvoSuite in addition to other related tools in terms of specific OO faults. To do so, we should evaluate tools like EvoSuite, T3 (Prasetya 2015), Tardis (Braione and Denaro 2019), ARTCovPS (Liu et al. 2019), and Randoop (Pacheco and Ernst 2007) by building a benchmark using the object and poly-object coverage criteria.

Test generation using object coverage criteria. The traditional code coverage criteria are used as indicators to generate tests in automated test generation tools. For example, many search-based approaches, which generate tests based on metaheuristic algorithms, employ code coverage indicators (along with other indicators) to design fitness functions used in these algorithms. In fact, the code coverage criteria are used to guide the test generation process. Therefore, one of the future works of this research could be to use the proposed object coverage criteria for test data generation, trying to better address problems related to the OO features. In this regard, extending the EvoSuite approach by adding object coverage criteria as a new fitness indicator can be the subject of a future work.

Implementation of various object coverage criteria. In this study, to focus on the main stream of our idea, only some basic criteria, such as object statement coverage and object line coverage, were proposed and implemented in the OCov4J tool. Since the object coverage criteria are extensions of their code coverage counterparts, other criteria such as object branch coverage, object decision coverage, and object modified condition/decision coverage (MC/DC) can be defined and implemented in OCov4J. A complement empirical evaluation to determine effectiveness of these new object coverage criteria can also be conducted. Furthermore, some adequacy testing criteria in the context of data flow analysis can be combined with the concept of object coverage using different def-use pairs of class variables and inherited class variables in the related inheritance hierarchy.

Evaluation of the proposed criteria using real-world faults. The purpose of this study has not been to show the effectiveness of the object coverage criteria in detecting real-word OO failures. Instead, it has emphasized on execution of different parts of the object under test. Therefore, to evaluate this research, we have used artificial faulty versions constructed by OO mutation approaches. Evaluating the effectiveness of the proposed criteria in finding real OO failures can be one of the future works. Although approaches such as Defects4J (Just et al. 2014) and bugs.jar (Saha et al. 2018) have provided a set of real bugs for Java programs to facilitate controllable studies on testing, none of them have considered OO-related faults and problems. Most of the provided real-word faults are procedural in nature. To evaluate the object coverage criteria in detecting real-world failures, similar to approaches such as Defects4J, we can first gather a set of OO-related faults from widely-used and large-scale open source projects. We can then categorize these faults into different groups such as object instantiation, inheritance, and polymorphism. Finally, we can utilize this set of faults like a benchmark for better evaluating all works related to OO testing, including our object coverage criteria.

Improving OO mutation approaches. As discussed in Subsection 4.3, existing OO mutation approaches (with support for operators to model OO faults) have some shortcomings, such as not generating mutations via simultaneous changes in the class file or not building mutations by changing the codes of the parent/ancestor classes. To compensate for these shortcomings, in this study, in addition to using MuJava, as an existing OO mutation tool, we have created a series of mutants, manually. In addition, we have implemented an auxiliary tool, called MuRunner, to facilitate analyzing these manually generated mutants. This tool automates replacement of faulty classes by original classes, recompilation of modified classes, test execution, and reporting mutation results. One of the directions for future work could be the completion of our improved mutation approach through upgrading the MuRunner tool to support mutant generation in a fully automated manner. This facilitates further empirical evaluations of our object coverage criteria. Moreover, it can be used as a mutation analysis approach by emphasizing on OO specific features.

Competing interests

The authors declare that they have no competing interests.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Availability of data and materials

The data that support the findings of this study are openly available in the GitHub repository mu-runner at https://github.com/sbu-test-lab/mu-runner. Also the coverage extraction tool OCov4J is available as an open-source project at https://github.com/sbu-test-lab/ocov4j.

Authors’ contributions

M.G. has presented the research idea, implemented it and performed related experiments. H.H. designed the research and evaluation methodology, introduced the test data, and supervised the entire research process. Both authors discussed and validated the results and contributed to the final manuscript.

Binder, R. (2000). Testing object-oriented systems: models, patterns, and tools. Addison-Wesley Professional.
Gay, G., Staats, M., Whalen, M., & Heimdahl, M. P. (2015). The risks of coverage-directed test case generation. IEEE Transactions on Software Engineering, 41(8), 803–819.
Gopinath, R., Jensen, C., & Groce, A. (2014, May). Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering (pp. 72–82).
Schwartz, A., Puckett, D., Meng, Y., & Gay, G. (2018). Investigating faults missed by test suites achieving high code coverage. Journal of Systems and Software, 144, 106–120.
Fraser, G., & Arcuri, A. (2015). 1600 faults in 100 projects: automatically finding faults while achieving high coverage with evosuite. Empirical software engineering, 20(3), 611–639.
Fraser, G., & Arcuri, A. (2011, September). Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (pp. 416–419).
Hemmati, H. (2015, August). How effective are code coverage criteria?. In 2015 IEEE International Conference on Software Quality, Reliability and Security (pp. 151–156). IEEE.
Perry, D. E., & Kaiser, G. E. (1990). Adequate testing and object-oriented programming. Journal of object-oriented programming, 2(5), 13–19.
Offutt, J., Alexander, R., Wu, Y., Xiao, Q., & Hutchinson, C. (2001, November). A fault model for subtype inheritance and polymorphism. In Proceedings 12th International Symposium on Software Reliability Engineering (pp. 84–93). IEEE.
Alexander, R. T., Offutt, J., & Bieman, J. M. (2002, December). Syntactic fault patterns in oo programs. In Eighth IEEE International Conference on Engineering of Complex Computer Systems, 2002. Proceedings. (pp. 193–202). IEEE.
Ma, Y. S., Offutt, J., & Kwon, Y. R. (2006, May). MuJava: a mutation system for Java. In Proceedings of the 28th international conference on Software engineering (pp. 827–830).
Ghoreshi, M., & Haghighi, H. (2016). An incremental method for extracting tests from object-oriented specification. Information and Software Technology, 78, 1–26.
Alexander, R. T., Offutt, J., & Stefik, A. (2010). Testing coupling relationships in object-oriented programs. Software Testing, Verification and Reliability, 20(4), 291–327.
Aziz, S. R., Khan, T., & Nadeem, A. (2019). Experimental validation of inheritance metrics’ impact on software fault prediction. IEEE Access, 7, 85262–85275.
Mcheick, H., Dhiab, H., Dbouk, M., & Mcheik, R. (2010, May). Detecting type errors and secure coding in C/C + + applications. In ACS/IEEE International Conference on Computer Systems and Applications-AICCSA 2010 (pp. 1–9). IEEE.
Ammann P, Offutt J. Introduction to software testing. Cambridge University Press; 2016 Dec 13.
Molina, U. R., Kifetew, F., & Panichella, A. (2018, May). Java unit testing tool competition-sixth round. In 2018 IEEE/ACM 11th International Workshop on Search-Based Software Testing (SBST) (pp. 22–29). IEEE.
Kifetew, F., Devroey, X., & Rueda, U. (2019, May). Java unit testing tool competition-seventh round. In 2019 IEEE/ACM 12th International Workshop on Search-Based Software Testing (SBST) (pp. 15–20). IEEE.
Devroey, X., Panichella, S., & Gambi, A. (2020, June). Java unit testing tool competition: Eighth round. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops (pp. 545–548).
Smith, G. (2012). The Object-Z specification language (Vol. 1). Springer Science & Business Media.
Offutt, J., Ma, Y. S., & Kwon, Y. R. (2006, May). The class-level mutants of MuJava. In Proceedings of the 2006 international workshop on Automation of software test (pp. 78–84).
Papadakis, M., Kintis, M., Zhang, J., Jia, Y., Le Traon, Y., & Harman, M. (2019). Mutation testing advances: an analysis and survey. In Advances in Computers (Vol. 112, pp. 275–378). Elsevier.
Kim, S. W., Clark, J. A., & McDermid, J. A. (2001). Investigating the effectiveness of object-oriented testing strategies using the mutation method. Software Testing, Verification and Reliability, 11(4), 207–225.
Segura, S., Hierons, R. M., Benavides, D., & Ruiz-Cortés, A. (2011). Mutation testing on an object-oriented framework: An experience report. Information and Software Technology, 53(10), 1124–1136.
Razali, N. M., & Wah, Y. B. (2011). Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of statistical modeling and analytics, 2(1), 21–33.
Smith, M. D., & Robson, D. J. (1990, November). Object-oriented programming-the problems of validation. In Proceedings. Conference on Software Maintenance 1990 (pp. 272–281). IEEE.
Harrold, M. J., McGregor, J. D., & Fitzpatrick, K. J. (1992, June). Incremental testing of object-oriented class structures. In Proceedings of the 14th international conference on Software engineering (pp. 68–80).
Madeyski, L., & Radyk, N. (2010). Judy–a mutation testing tool for Java. IET software, 4(1), 32–42.
Coles, H., Laurent, T., Henard, C., Papadakis, M., & Ventresque, A. (2016, July). Pit: a practical mutation testing tool for java. In Proceedings of the 25th international symposium on software testing and analysis (pp. 449–452).
Su, T., Wu, K., Miao, W., Pu, G., He, J., Chen, Y., & Su, Z. (2017). A survey on data-flow testing. ACM Computing Surveys (CSUR), 50(1), 1–35.
Denaro, G., Margara, A., Pezze, M., & Vivanti, M. (2015, May). Dynamic data flow testing of object oriented systems. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (Vol. 1, pp. 947–958). IEEE.
Najumudheen, E. S. F., Mall, R., & Samanta, D. (2019, February). Modeling and coverage analysis of programs with exception handling. In Proceedings of the 12th Innovations on Software Engineering Conference (formerly known as India Software Engineering Conference) (pp. 1–11).
Zou, Y., Chen, Z., Zheng, Y., Zhang, X., & Gao, Z. (2014, July). Virtual DOM coverage for effective testing of dynamic web applications. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (pp. 60–70).
Prasetya, I. W. B. (2015, August). T3i: A tool for generating and querying test suites for java. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (pp. 950–953).
Braione, P., & Denaro, G. (2019, May). SUSHI and TARDIS at the SBST2019 Tool Competition. In 2019 IEEE/ACM 12th International Workshop on Search-Based Software Testing (SBST) (pp. 25–28). IEEE.
Liu, B., Ge, H., Chen, J., & Bao, Q. (2019, November). An Automatic Testing Platform for Object-oriented Software based on Code Coverage. In Proceedings of the 2019 the 9th International Conference on Communication and Network Security (pp. 20–24).
Pacheco, C., & Ernst, M. D. (2007, October). Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion (pp. 815–816).
Just, R., Jalali, D., & Ernst, M. D. (2014, July). Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (pp. 437–440).
Saha, R. K., Lyu, Y., Lam, W., Yoshida, H., & Prasad, M. R. (2018, May). Bugs. jar: a large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th international conference on mining software repositories (pp. 10–13).

No competing interests reported.

Download PDF

Journal Publication

published 30 Jun, 2023

Read the published version in Software Quality Journal →

Editorial decision: Major revision
07 Sep, 2022
Reviews received at journal
25 Aug, 2022
Reviewers agreed at journal
18 Jul, 2022
Reviewers invited by journal
14 Jul, 2022
Editor assigned by journal
11 Jun, 2022
Submission checks completed at journal
08 Jun, 2022
First submitted to journal
07 Jun, 2022

You are reading this latest preprint version

Object Coverage Criteria for Supporting Object-oriented Testing

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Problem Context And Motivating Examples

2.1. Issue 1: The Executor Object Type

2.2. Issue 2: Inheritance and Ancestor Classes

2.3. Issue 3. Polymorphism and Descendant Classes

3. Object Coverage Criteria

3.1. Object Coverage Criteria Definition

3.1.1. Example: Measuring Object Coverage Level

3.1.2. Example: Challenging Test Generation Tools

3.2. Poly-object Coverage Criteria Definition

3.2.1. Example: Measuring Poly-object Coverage Level

3.3. The OCov4J Tool

4. Empirical Evaluation

4.1. Benchmark projects

4.2. Test Suites

4.3. Faulty Versions

4.4. Running Tests and Checking Faulty Versions

4.5. Results and Discussion

4.5.1. Poly-object Coverage Criteria Evaluation

4.6. Limitations

4.7. Threats to Validity

5. Related Works

6. Conclusion And Future Works

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1