Object coverage criteria for supporting object-oriented testing

Code coverage criteria are widely used in object-oriented (OO) domains as test quality indicators. However, these criteria are based on the procedural point of view, and therefore do not address the specific features of OO programs. In this article, we extend the code coverage criteria and introduce a new set of criterion, called “object coverage criteria,” which cope with OO features like object instantiation, inheritance, polymorphism, and dynamic binding. Unlike previous criteria, the new criteria regard the actual type of the object under test and some inherited codes from the parent/ancestor classes that represent the object’s states and behaviors. The new criteria have been implemented in a prototype tool called OCov4J for the Java language. Using this tool and conducting an empirical study on 270 classes (with about 50 k lines of code without blank lines and comments) from several large and widely used open source projects, we have found a considerable positive correlation between the object coverage level (defined via the new proposed criteria) and the number of detected specific OO failures. Not only do the proposed criteria provide ease of use, high automation, and low execution cost, but also they can effectively be applied to real-world OO programs.


Introduction
Code coverage criteria or metrics, such as statement or branch coverage, have been widely used in object-oriented (OO) domains to select, stop, and validate tests (Binder, 2000), or to guide the automated test generation process for various OO languages (Fraser & Arcuri, 2015;Gay et al., 2015;Gopinath et al., 2014;Schwartz et al., 2018).Most existing automated test generation tools use code coverage criteria as a quality indicator to find a set of test inputs which cause high code coverage.For example, these criteria have been widely used to design fitness functions in search-based test techniques in order to guide 1 3 the test data generation process as well as evaluating generated tests.One of the prominent works in this area is the EvoSuite approach that generates JUnit tests for Java (Fraser & Arcuri, 2011).
No matter how effective the code coverage criteria are compared to other criteria (like data-flow coverage and mutation-based criteria), the former are commonly used in the software industry as unit test quality measures (Hemmati, 2015).This popularity could be because of their ease of use in addition to the availability of fully automated tools which support them for different programming languages and various technologies, with negligible execution cost.
Existing code coverage criteria have been derived according to the procedural point of view.Therefore, they determine which parts of the program code (mainly code of program's procedures/functions) are executed at least once during the tests.In the OO programming context, code coverage indicates which parts of methods of classes (instead of procedures in procedural programs) are executed.However, OO programs are not just sets of methods embedded in different classes.OO programs considerably rely on user-defined types introduced according to the concepts of abstraction and encapsulation.These types include state space and behavior and form different objects in the execution phase.Previous types or classes may be reused to define new types.This reuse can be realized through inheritance or aggregation.Also, polymorphism along with dynamic binding mechanisms can cause different behaviors from a single object.
All these facilities may result in new types of errors and make the test of OO software more complex than testing previous conventional programs (Alexander et al., 2002;Ghoreshi & Haghighi, 2016;Ma et al., 2006;Offutt et al., 2001;Perry & Kaiser, 1990).Consequently, traditional code coverage criteria might be not equally effective for the OO domain and the procedural domain.The following are some of the main issues with using code coverage criteria for OO programs: • Methods/classes interactions: In comparison to procedural programs, OO programs have significantly shorter and simpler methods; instead of using lengthy and complex methods, these programs implement needed functionalities via interactions between dozens of simple methods and relationships between classes (Alexander et al., 2010).For example, consider a very common approach in OO programs in which a method in a class uses an overridden method from one of its parent or ancestor classes to do a piece of logic and then do its own logic.This is a sample of interaction between two methods of different classes to implement a single functionality.Traditional code coverage criteria do not consider such interactions and regard each method separately and in isolation.This way, if an overridden method is fully executed during the tests, the used code coverage criterion (like statement coverage) is satisfied while the interaction between methods is not addressed at all.By ignoring interactions between different classes and methods, we may easily achieve a very high code coverage level in OO programs because most methods are usually small and simple.• Object instantiations and different object types: Regardless of special cases (such as classes with static methods in Java or C++), we require executable instances to execute classes and their related methods.These instances are created through "object instantiations" in OO programs.Using object instantiations, code in a common class may be executed by different objects of different classes; for example, suppose the situation where the class of the object under test has been derived from a parent/ ancestor common class.Since the actual type of the object executing the code of a common class is not considered by the code coverage criteria, a class that is not executed by objects of its own type may archive a high code coverage level.• Inheritance issue: Inheritance is one of the fundamental concepts in object orientation.Many classes reuse the definition of other classes to define themselves.An inheriting class may change the definition of some inherited methods or extend their definition.Using inheritance, therefore, it is more likely to have new types of faults in programs (Offutt et al., 2001).The possibility of these faults increases by increasing the level of inheritance (Aziz et al., 2019).Since the code coverage criteria only examine the code of the class under test and do not consider the inherited classes, they probably result in insufficient tests for inherited methods.• Polymorphism and dynamic binding issues: polymorphism and dynamic binding allow an object to take other forms of descendant classes in different executions.
These powerful features can lead to different faults like yo-yo problems in OO programs (Offutt et al., 2001).However, faults resulting from polymorphism fall outside the scope of the traditional code coverage criteria because they do not matter what type of object executes which part of the code.In this paper, it is intended to address these polymorphism issues in a basic structural coverage approach.It should be noted that some new coverage criteria have been proposed for such issues, primarily based on the data-flow testing notion.For instance, we can mention the work of Orso and Pezze (1999) or Alexander and Offutt (2000); these studies will be precisely reviewed in Sect. 5.
This paper provides some new test adequacy criteria that address the specific OO features mentioned above.In fact, the new proposed coverage criteria consider issues related to areas like object instantiation, inheritance, polymorphism, and dynamic binding, in addition to the issues of traditional procedural programs.We call these new coverage criteria as "object coverage criteria".These criteria generally act like code coverage criteria.However, they especially consider the actual type of the object under test as well as the part of the code from parent/ancestor classes, which represents or affects the object's states or behaviors.Thus, the new criteria are extensions of the traditional code coverage criteria and subsume them.The process of measuring the object coverage criteria is similar to the measurement of the traditional criteria, and they can easily be applied to any OO program with a fully automated tool.According to the above mentioned features, the object coverage criteria can completely replace the code coverage criteria for OO programs.
In addition to introducing the new criteria, we have implemented a prototype tool, called OCov4J, for automatically measuring and reporting these criteria for the Java language.Although OCov4J is a prototype tool, in practice, it can be applied to many real-life Java projects, such as those used in the evaluation section.Using this tool in an empirical evaluation, we applied the object coverage criteria to different inheritance hierarchies, totally containing 270 Java classes from five large and widely used open source projects along with an educational project.By seeding 668 OO-related faults into some of these classes, and analyzing the results, we found a strong positive correlation between the coverage level, defined by the new criteria, and the number of OO faults which have been revealed.
The content of this article is organized as follows.In the next section, we describe the problem context in three areas with some examples.Section 3 introduces our new test adequacy criteria.Section 4 covers our empirical evaluation process and results.Next, we briefly review some related approaches in Sect. 5. Finally, Sect.6 concludes the paper and introduces some ideas for future work.

Problem context and motivating examples
The code coverage criteria are very common.They are indeed basic criteria for evaluating tests of a program.These criteria are also widely used in industry to determine the effectiveness of unit tests.There are various tools supporting the whole process of instrumenting code, running unit tests, measuring code coverage criteria, and finally, reporting their values.For example, JCov (as a part of OpenJDK) and JaCoCo1 tools are very popular in Java and are used by many open source projects.
The process of calculating coverage metrics for an OO program, such as a Java code, often begins with instrumentation.This can be done through the instrumentation of the source code or the instrumentation of the final executable bytecode.Code instrumentation usually appends additional statements into code to each line of the program (or to each main block, or before each branch of the program).This way, the execution of the added statements implies the execution of the instrumented part of the code.After code instrumentation and test execution, the ratio of the number of the sections executed by tests to the total number of code sections is calculated and reported as the code coverage level.Various coverage criteria, such as "statement coverage," "line coverage," and "branch coverage," can be defined according to the code structures that are targeted to be covered by tests.
In the following, we show some issues and problems of applying the traditional code coverage criteria to OO programming languages via using some simple examples.Although we use Java and C++ programming languages in our examples, the mentioned issues are language independent and are common for most OO languages.

Issue 1: the executor object type
Unlike procedural languages, which use procedures and functions as abstraction mechanisms, the most important abstraction mechanisms in OO languages are user-defined types that define state (data) and behavior.These types are usually created in most OO languages using the notion of "class".In defining new classes, the previous classes can be reused via the aggregation and inheritance mechanisms.Objects are created dynamically based on a mechanism called "instantiation" which realizes an object at runtime.In fact, different behaviors of a class (commonly known as class methods) require an object to be executed.In this paper, we call this object "the executor object".The traditional code coverage criteria only consider the static space of a class and do not consider runtime objects.This can cause a test suite to achieve high code coverage for a class, while any object of the class type does not execute the code of this class, or only a small part of the class is executed.To clarify this issue, consider the following sample classes in Java, which model two types of stacks: The Stack class models a simple stack in Java.Objects from this class set the maximum stack length using the class constructor during instantiation.This class defines two push and pop methods to add/remove elements to/from the stack.The Circu-larStack class models a simple LIFO fixed length buffer.According to this class, when the element array of the buffer is full, new elements are placed at the beginning of this array; and when the index of the current value of the buffer is zero, it jumps to the end of the element array.This class is defined using the previous Stack class as its parent class.In fact, it inherits and redefines the parent's methods push and pop.
It should be noted that we have seeded two faults by commenting lines 10-11 and 15-16 of class Stack that can result in failures at runtime.Now, consider the below test suite CircularStack_TestSuite which only contains one test case for CircularStack: The above test passes and reveals no bug in the stack implementation.Using the line coverage criterion, this test case results in 100% code coverage for class Stack.This means that although class Stack has not been tested by actual Stack objects at all, its code coverage level is 100%.However, if we run a similar test using an object instantiating the Stack class, we may encounter a runtime exception due to our seeded faults.For example, the following simple test suite Stack_TestSuite leads to an IndexOutOfBoundsException error in Java, indicating an attempt to access an invalid index within the element array.As shown in this example, the traditional code coverage criterion incorrectly assumes coverage of a class while it has not directly been tested.This condition can mislead programmers and cause them not to write separate unit tests for the Stack class.In the next section, we consider the type of the executor object in defining object coverage criteria in order to address this issue.

Issue 2: inheritance and ancestor classes
Inheritance is one of the basic concepts and one of the main mechanisms of integration in OO programming languages.Inheritance differs from another type of integration in OO, namely aggregation, in several ways.A fundamental difference is that the encapsulation of an ancestor class (ancestor classes are the types that a class indirectly inherits through its parent or super class) may not be preserved through inheritance, meaning that the new inherited class can access and change the internal representation of the ancestor classes.Offutt et al. (2001) have interpreted inheritance as "internal representation integration" and have enumerated problems that arise from the combination of the new class's states/behavior and the states/behavior inherited from the ancestor classes.
The code coverage criteria regard the execution of each class code separately and in isolation, and do not consider the inherited parts of the parent/ancestor classes.Therefore, problems related to how the class under test interacts with the states and behaviors of the inherited classes are excluded from the scope of these criteria.To explain the issue more precisely, consider the following example containing two classes List and ClearableList.The former models a simple list backed by an array, and the latter models a simple list with an extra method, named clear, for deleting all elements of the list at once.ClearableList inherits the class List and adds the clear method to clear the list.We have seeded a bug into ClearableList by commenting the line number 7 of the class ClearableList.

3
Now consider the below test suite List_TestSuite1 which contains four test cases to validate the implementations of the above classes: Test cases List_Test1, List_Test2, and List_Test3 pass and achieve 100% line coverage for class List.The ClearableList_Test1 test, which is passed too, also provides 100% line coverage for class ClearableList and cannot reveal our seeded bug.Although Cleara-bleList_Test1 only tests the method defined in ClearableList and does not test the inherited methods (like add or remove), it results in 100% line coverage.Nevertheless, the seeded bug in the ClearableList class can easily be detected by a simple test that uses methods inherited from the parent class.For example, imagine the test case ClearableList_Test2, which is another test for the ClearableList class that, in addition to testing the child state space, tests the parent state space by calling the inherited method add.Unlike the previous test, ClearableList_Test2 is failed and reveals a failure in the implementation.Using this test, after the execution of the add method (line 4), one unit is added to the index variable; but when the method clear is called, although it resets the parent state variable elements, it does not reset the value of the parent's state variable index (as mentioned, line 7 of Clear-ableList was commented to create this fault).Hence, the list length in the assertion section of the test becomes equal to one, which causes the test to fail.
Regarding this example, when defining our new coverage criteria, we should consider the parts of the class state and behavior, which are inherited from parent or ancestor classes.

Issue 3: polymorphism and descendant classes
Like inheritance, polymorphism is one of the key concepts in OO languages.It allows an object to take many different shapes.In an inheritance hierarchy, polymorphism allows an object from a particular class to bind to objects of its children or descendant classes.In other words, an object can take the form of its upper classes in the inheritance hierarchy.In addition to polymorphism concepts, dynamic binding allows an object to bind to any descendant type in a dynamic manner at runtime.Dynamic binding causes the internal representation of a type to change dynamically at runtime.Inheritance, polymorphism, and dynamic binding create a very flexible type of integration in OO languages, which is called "abstract integration" by some authors (for example, Alexander et al., 2010).Although one strength point of abstract integration is the robust and flexible design, its complexity can yield new faults that are not easily detectable by conventional test methods (Alexander et al., 2010).Some possible faults in OO programs that occur due to polymorphism and dynamic binding are categorized in (Offutt et al., 2001).
Since the code coverage criteria consider neither inherited classes in the inheritance hierarchy for a class nor the actual type bound to an object, polymorphism and dynamic binding issues are out of the code coverage criteria scope.In the following, we present an example in C++ language to emphasize on the issues with the traditional code coverage criteria.Consider a hierarchy of classes Counter, ResetCounter, and OneBasedCounter as: The class Counter models a simple counter that has the inc method to increase the counter and the value method to get the current value of the counter.The next class in the inheritance hierarchy is the ResetCounter class, which acts like the previous one, but it has one additional method called reset.This method resets the counter value.At the end of the inheritance hierarchy, there is a class called OneBasedCounter.It inherits from the Reset-Counter class, but it has been changed so that its initial value starts at one instead of zero.Now, consider the test suite Counter_TestSuite as bellow: This test suite consists of three tests that provide 100% line coverage for all three mentioned classes.These tests are also passed and do not show any failure.Now, consider the following test CounterPoly_Test, which is like OneBasedCounter_Test, but it uses polymorphism to form the shape of an object of OneBasedCounter as a ResetCounter class: As shown in the CounterPoly_Test test, in line 2, an object with the type of ResetCounter is declared, but it is bound to the OneBasedCounter class (these sections of line 2 are highlighted).It is interesting that this test is not passed.By reviewing the source code of ResetCounter, a bug related to polymorphism and dynamic binding issues is found.The ResetCounter class does not use the virtual keyword to define its reset method (line 4 in ResetCounter).To fix this fault, we should change the code like follows: If we do not use the Virtual keyword in line 4 of class ResetCounter through the definition of the reset method, the dynamic call of the overridden method in the subclasses will be disabled when using an object in a polymorphic manner.In our example, in line 4 of CounterPoly_Test, when the method reset is called on the object c which is dynamically bound to the descendant class OneBasedCounter, the original definition in ResetCounter is called instead of calling the reset method in OneBasedCounter class; this call resets the value of the counter to zero rather than one.Although this is a simple example, similar issues lead to faults in real-world applications, as the authors (Mcheick et al., 2010) have shown that the misuse of the virtual keyword is one of the sources of common bugs in C++ languages.As shown in this example, all classes in an inheritance hierarchy may be tested, and a high level of traditional code coverage criterion may be obtained.But there may be problems with using these classes in a polymorphic manner.The traditional code coverage criteria do not help programmers in this regard and do not provide information about whether classes are sufficiently tested in a polymorphic manner.

Object coverage criteria
In the previous section, some problems were reviewed that show, due to the procedural nature of the code coverage criteria, they are not aligned with the key concepts of the OO paradigm, such as object instantiation, inheritance, polymorphism, and dynamic binding.In this section, we propose an approach to adapt the traditional code coverage criteria with these specific OO concepts, so that the new coverage criteria, called "object coverage criteria," keep the advantages of their previous counterparts.These strength points include simplicity, automation capability, and usability with low execution cost.
The object coverage criteria generally work like the traditional criteria; however, the following two aspects are considered when measuring the new proposed criteria: • The type of the executor object: Unlike the traditional criteria, the type of the object which executes each unit of the code (for example, each line, statement or branch of code) is considered.For example, a part of the class code may be executed by objects of the same class or objects of another class.The proposed criteria distinguish between these two cases.• All inherited classes forming the internal representation of the class under test: Unlike the traditional criteria, for each class under test, the part of the inherited classes (parent and ancestor classes) that represent the whole states/behaviors of the class is used to calculate the object coverage level.

Object coverage criteria definition
Consider an inheritance hierarchy like what is shown in Fig. 1.Class C inherits the Parent C class and n ancestors Ancestor 1 , …, Ancestor n .The object coverage criteria can informally be introduced as follows: First, imagine a new class C f as a flatten of Class C, with its parent and all ancestors.
In other words, C f contains the C code along with all the accessible inherited code from Parent C and n ancestors Ancestor 1 , …, Ancestor n .By "All inherited code," we mean all methods (class constructors are also considered methods) that are defined in Parent C or each of n ancestors.By "accessible inherited code," we mean those inherited methods that are accessible in C. For example, in Java, the accessible inherited code includes all methods or constructors which are defined in the parent or ancestor classes as nonprivate, i.e., public, protected, or package-private.Now, the object coverage criteria for class C are equivalent to the traditional code coverage criteria for class C f , provided that all pieces of code (like statements, lines, blocks or branches) in C f are executed at least once by an object which is instantiated with class C.
Like traditional code coverage criteria, the object coverage criteria contain a family of criteria each emphasizing on a particular unit/structure of the code.A coverage criterion is measured according to the percentage of the corresponding units/structures which are executed by the generated test data.For example, we can define object coverage criterion for statements, lines, branches, and other parts of either the code or the control flow graph structure which is derived from the source code.For the sake of simplicity, in the following, we only define an object coverage criterion for statements, formally; criteria for other structures like lines, basic blocks, and branches can formally be defined in a similar way.
Definition 1. Object statement coverage criterion According to this criterion, the set of test requirements (abbreviated by TR) for a given class C is equal to all statements of C, along with all statements that • exist in either its parent or its ancestor classes, and • are accessible from C. By this criterion, each statement in TR should be executed at least once by an object bound to class C using the provided test suite.
The given formal definition is based on the notion of test requirements which have been introduced in Ammann and Offutt (2016).As mentioned in Ammann and Offutt (2016), for every coverage criterion, we can also define a coverage level for each test suite TS.This simply is the ratio of the number of test requirements in TR, which are covered by TS, to the size of TR.
Definition 2. Object statement coverage level Consider a class C, a test suite TS, and TR as the set of test requirements derived based on the "object statement coverage criterion" for class C. The "object statement coverage level" of TS is the percentage of elements in TR that are executed by an object bound to C through the execution of TS.
The concept of "criteria subsumption" makes it possible to compare various coverage criteria (Ammann & Offutt, 2016).The coverage criterion x subsumes the coverage criterion y if and only if every test suite that satisfies x also satisfies y (Ammann & Offutt, 2016) (here, satisfaction means achieving 100% coverage level).By this definition, we can simply argue that the "object statement coverage criterion" subsumes the traditional "statement coverage criterion" because, using the former, the set of test requirements will include all statements of the class under test and will impose these statements to be executed at least once by the given test suite.These are indeed the statements that should be covered by the traditional statement coverage criterion.
It is obvious that the object coverage criteria work exactly the same as the traditional code coverage criteria for every class outside the inheritance hierarchy (i.e., a class that neither inherits any other class nor is inherited by any other class).This is due to the fact that such a class does not have any inherited code (because it does not inherit any class), and its code can only be executed by objects with the type of the class itself (because it is not inherited by any other class).
Another point that should be considered is that some OO languages, like Java, C++ , and Python, provide static methods which belong to a class in order to support utility or helper functions and procedures.They act like a procedure or function in the procedural programming paradigm.Therefore, we do not need any instance object to invoke a static method; instead, we can invoke it statically through the name of its class.This way, since these methods do not belong to any object, there is no executor object to execute them; hence, the proposed criteria will act as the traditional criteria, and only the execution of static codes (regardless of the type of the executer object) will be considered.
The last point to note is that, when we refer to an inheritance hierarchy that ends with the class under test (see Fig. 1), only domain classes are included; domain classes are classes designed and implemented to solve the domain-dependent problem.In this condition, basic classes, such as the Object class, from which all Java classes are implicitly inherited, are not included.In addition, if a class inherits from an external class (an external library), that class will not exist in the inheritance hierarchy of the class under test.This type of inheritance is known as cross-domain inheritance.We do not consider classes that are inherited in a cross-domain manner for the following two reasons: 1.As with traditional code coverage approaches, external library code is not considered in our proposed approach to maintain the focus of the proposed criteria on the classes under test and their relationships, and to avoid the complexity of inherited classes outside of the classes under test.2. Cross-domain inheritance relationships are generally not recommended in objectoriented programming.This is because they can create tight couplings between unrelated classes, making code harder to maintain and modify.Additionally, this type of inheritance often violates the "Single Responsibility Principle".Therefore, it is commonly recommended to avoid using this type of inheritance and use composition or relevant design patterns, instead.This recommendation is mentioned in popular books such as "Effective Java" by Bloch (2008) and "Design Patterns" by Gamma et al. (1995).

Example: measuring object coverage level
We now show how the "object statement coverage level" can be measured for some Java classes represented by examples in Sect. 2. Consider the Stack class, which is in a simple inheritance hierarchy with the CircularStack class.As seen before, the CircularStack_TestSuite test suite achieved 100% line coverage level for the Stack class, while it was unable to reveal the fault in line 10 of this class code.However, since this test suite does not create any object of the Stack class, the object statement coverage level is equal to 0 for this class.Considering Stack_TestSuite, an object of the Stack class type executes statements in lines 6, 7, 12, and 17 of the class, so the object statement coverage level is 100% for the Stack class (note that Stack does not have any parent; hence, we only consider statements in the Stack class itself for measuring object coverage).In addition, this test is failed and indicates a failure in the class.As shown in this example, the traditional criterion shows a high percentage of coverage for the Stack class, which can mislead developers and prevent them from writing sufficient tests for this class; the new coverage criterion addresses this shortcoming.
As another example, consider the List class and its child, ClearableList, represented in Sect.2.2.Here, we want to measure the object line coverage level for ClearableList.We first do it by considering List_TestSuite1.Using this test suite, the traditional line coverage level becomes 100% for ClearableList because all lines of this class are executed by the tests in List_TestSuite1.Since ClearableList inherits class List, in order to determine the object line coverage level, we should first flatten List with ClearableList in the form of a new temporal class, called ClearableList_Flat, as follows (the following code is pseudocode and not a valid Java code because the:: operator is not supported in Java in this way.We have used this code only to illustrate how to obtain the coverage): As shown above, the accessible inherited code from class List is added to Cleara-bleList_Flat using the scope resolution operator with symbol "::".Tests in List_TestSuite1 execute the statements in lines 3, 4, and 19 of ClearableList_Flat (which are from the inherited code) along with the statements in lines 22 and 25 from the main class Cleara-bleList.Although this test suite results in 100% traditional line coverage level, the object coverage level is equal to about 38.5% because lines 7, 8, and 9 of the inherited method List::add, with lines 12, 13, 14, 15, and 16 from List:remove, are not executed by any object of class ClearableList; hence, the object coverage level is 5/13 ≈ 38.5%.Furthermore, this test suite is unable to reveal the seeded fault in the ClearableList class.Now, consider the ClearableList_Test2 test suite, which, in addition to the methods executed by the previous tests, executes the inherited method List::add.Therefore, it achieves 61.5% object line coverage level.ClearableList_Test2 fails and reveals the fault in the code.

Example: challenging test generation tools
One of the main applications of the code coverage criteria is their use as an indicator to guide the automated generation of test data.In recent years, various approaches and tools have been introduced to generate test data for OO programs based on the code coverage criteria (Gay et al., 2015).One of the prominent tools in this field is Evosuite, a whole test suite generation tool.Evosuite has been selected as a highly efficient tool with a high level of code coverage in various challenges and researches (Devroey et al., 2020;Fraser & Arcuri, 2015;Kifetew et al., 2019;Molina et al., 2018).
Although the examples presented in Sect. 2 had very simple structure and were only used to explain and clarify issues related to applying the code coverage criteria to OO languages, they may challenge test generation tools in practice.In order to demonstrate this issue, we use EvoSuite to generate a test suite for one of these examples.Consider the ClearableList class once again.As mentioned before, this class inherits from the List class and adds a new method that can delete all contents of the list.We have extracted tests for this class using the latest version of the EvoSuite tool.We have used the correct version of the class by uncommenting line 7.The extracted test suite includes one test data that results in 100% traditional statement/line coverage for the class ClearableList.This test suite is as follows: However, the object line coverage of this test suite is about 42%.This indicates that although the class code has been fully executed during the test run, we have not fully tested the inherited methods.If we run the above test suite against the faulty version of Cleara-bleList (by commenting line 7 in this class), the test suite will pass and will not reveal the injected fault.As this simple example shows, automated tools may have serious problems in revealing OO-related faults; this issue should be examined in further research.

Poly-object coverage criteria definition
As stated in the previous section, two key points should be taken into account when defining an object coverage criterion: one is to consider which object type executes the class code and the other is to consider the code inherited from the parent and ancestor classes in addition to the main class code.Since the problems related to polymorphism and dynamic binding are usually dependent on object type and inherited state/behavior code, the defined object coverage criteria can already address some of these problems.We have defined the object coverage criteria based on parent/ancestor classes of the class under test, while polymorphism problems for a class usually arise from child/descendant classes.Therefore, in order to use the object coverage criteria, it is necessary to consider the object coverage level of child and descendant classes, as well.To simplify the use of our new criteria for addressing polymorphism issues, in this section, we define the "poly-object coverage criteria," which specifically consider a class in possible different polymorphic uses.These new criteria are based on our proposed object coverage criteria.
The poly-object coverage criteria work the same as the object coverage criteria.However, unlike the latter, which are defined for a class by considering super-classes (i.e., parent and ancestor classes), the former are defined for a class with respect to its subclasses (i.e., its children and descendant classes).To define the poly-object coverage criteria, imagine class C has a child or descendant class, called D. The poly-object coverage criteria for base class C and subclass D are fully satisfied if and only if every part of C, which is accessible from D, is executed at least once by an object with type D during test execution.Using the notion of test requirements (Ammann & Offutt, 2016), we can formally define these criteria as follows.Similar to Definition 1, for the sake of simplicity, we just formally define a criterion for statements; criteria for other structures like lines, basic blocks, and branches can formally be defined in a similar way: Definition 3. Poly-object statement coverage criterion According to this criterion, the set of test requirements (TR) for a given class C and one of its subclasses, D, is equal to all statements which are defined in C and accessible from D. By this criterion, each statement in TR should be executed at least once by an object bound to class D using the provided test suite.
We can generalize the above definition by considering a set of classes, D 1 .D 2 ...D n , as subclasses or descendant classes of base class C: Definition 4. Poly-object statement coverage criterion for a set of subclasses According to this criterion, the set of test requirements (TR) for a given class C and a set of classes {D 1 , D 2 , ..., D n } as subclasses or descendant classes of C in an inheritance hierarchy is equal to all statements which are defined in C and accessible from any of classes {D 1 , D 2 , ..., D n } .By this criterion, for each class D i in D 1 , D 2 , ..., D n , each statement in TR should be executed at least once by an object bound to D i using the provided test suite.
The coverage level for all poly-object coverage criteria can be defined similar to Definition 2. In general, coverage level is equal to the ratio of the number of satisfied test requirements to the total number of test requirements.

Example: measuring poly-object coverage level
We now give an example of how to measure the poly-object coverage level by using the Counter and OneBasedCounter classes presented in Sect.2.3.As stated earlier, although these classes work properly when they act separately, a polymorphic bug exists in the ResetCounter class that will occur if an object of the OneBasedCounter class is bound to a variable with type of the ResetCounter class.In addition, the test suite Counter_TestSuite, which achieves 100% traditional line/statement coverage for both classes, is not able to reveal this bug.To calculate the level of poly-object coverage of the Counter base class by the Counter_TestSuite test suite, we do the following: 1.As Counter is the base class, and {ResetCounter, OneBasedCounter} is the set of descendant classes, the accessible code (lines) of Counter for descendant classes are all statements that exist in the constructor, inc and value methods.2. For each descendant class DC, we determine which lines from the accessible lines (calculated in the previous step) are executed by objects with type of DC: • For descendant class ResetCounter: the statements in lines 5 and 6 in the Counter class are executed by ResetCounter objects.• For descendant class OneBasedCounter: the statements in lines 5 and 6 in the Counter class are executed by OneBasedCounter objects.
3. The poly-object coverage level is the ratio of the number of executed statements to all accessible statements of the base class for each descendant class, which is equal to 4/6 (about 67%).Now if we add the test case CounterPoly_Test to the previous test suite Counter_TestSuite, the poly-object coverage level increases to 5/6 (about 83%).Furthermore, this new test suite fails and reveals the seeded polymorphic fault.We also used the EvoSuite tool to generate a test suite for our three counter classes (without any seeded fault).By adding our seeded polymorphic bug (by removing the virtual keyword in ResetCounter), this tool was unable to detect this bug, and all of the test cases in the generated test suite were passed.

The OCov4J tool
To support the object coverage criteria for the Java language, we have implemented the OCov4J tool as a prototype tool.OCov4J is available as an open source project, published on the GitHub online code repository. 2 The source files, the executable version of the tool (jar files), and a guide on how to install and use the tool can be found on the GitHub repository.
OCov4J receives a target code (or executable project as a jar file), and one or more test suites in the JUnit format (this tool can also be used with another test unit framework in Java, though), as inputs.Next, it instruments the code on the fly, and then calculates various coverage levels related to different object coverage criteria through analyzing the information gathered during unit test execution and examining the hierarchy of classes.
Unlike regular instrumentation libraries, which record execution information for lines (or other parts of the code), the instrumentation approach in OCov4J additionally records information related to the executor object and the runtime context.The OCov4J Instrumentation is applied at the bytecode level; therefore, it does not require the source code and can apply on any compiled application in Java and other bytecode languages like Groovy and Scala.This way, it allows us to apply object coverage criteria when only the executable files of the program are available, or in other words, we do not access the source code.Nevertheless, OCov4J extracts the embedded information about the source code from the given bytecode.Using this information (like line numbers), it can generate reports which are useful for developers and debuggers.Moreover, by using the bytecode on the fly (in-memory), OCov4J saves execution time due to excluding the need to re-compile or save files on disk.
Figure 2 shows the OCov4J architecture.This tool uses the Agent architecture in Java to modify bytecodes on the fly.Java agents are programs that run within the Java virtual machine (JVM).These programs can be embedded into JVM to perform a variety of tasks, such as gathering information about a running application or monitoring parts of an application.OCov4J attaches to the JVM execution process as an external agent and acts as an interface for loading class bytecodes into memory.During this process, OCov4J first loads bytecodes of a class.Then, by changing these bytecodes, it adds the additional code needed to calculate the object code criteria to the bytecodes.Finally, the instrumented bytecode is delivered to the Java class loader.ASM,3 which is a framework for analysis and manipulation of Java bytecodes, is used to modify and transfer bytecodes.Using the Java Agent, architecture results in the low execution cost due to the avoidance of re-compiling and reloading classes.In addition, this architecture provides high flexibility and makes OCov4J to easily attach to any execution process in the JVM and extract data required to obtain the coverage criteria.
The initial version of OCov4J supports the "statement coverage criterion" as well as the "line coverage criterion" of both categories "object coverage criteria" and "poly-object coverage criteria".Other related criteria, such as branch coverage and decision coverage, are going to be added in future versions.This tool is compatible with most Java language features such as inner classes, generic classes/methods, and lambda expressions.So, it can be applied to large and real applications, as we will show this capability in our evaluation process.
As an example of how to use OCov4J, consider class ClearableList for which we have manually measured the object line coverage criterion in Sect.3.1.1.To automatically calculate this criterion, we assume that the ClearableList_Test1 class is a JUnit test.Now, we first attach the OCov4J Jar file to the Java process while executing the tests with the following commands.Lines 1 and 2 below execute the unit tests on class ClearableList: We used option -javaagent, highlighted above, to attach the Jar file of OCov4J to the JVM process.The command then causes the JUnit core to run the tests specified in test suite ClearList_Test1.After running these tests, OCov4J saves the coverage information in some comma separated values (CSV) files in the current directory.These CSV files can be used for later processing in spreadsheets tools.OCov4J provides some commands to view coverage level values.For example, by executing line 03 of the above code, the object line coverage level is shown on terminal.

Empirical evaluation
The main purpose of this research is to adapt the traditional code coverage criteria in order to address specific OO features, while at the same time, the resulting new criteria have high degree of automation, simplicity, and low cost of execution, as before.In this section, the proposed object coverage criteria are empirically evaluated using a set of different OO classes as benchmarks.These classes have been selected from different open source programs in Java.We use our prototype tool, i.e., OCov4J, to measure object coverage criteria.The results are then compared with the results of traditional coverage criteria (obtained by the JaCoCo tool) to find which criteria better reveal OO-related failures.
The major goal of this empirical evaluation is to investigate the correlation between object coverage criteria and the ability to reveal specific OO-related failures.The secondary goal is to examine how our poly-object coverage criteria address the polymorphic and dynamic binding issues.To evaluate these goals clearly and accurately, we define three research questions: • RQ1: Is the effectiveness of a test suite for detecting OO-related failures more correlated with the "object coverage level" compared to the "traditional coverage level"?• RQ2: Is the effectiveness of a test suite for detecting polymorphic failures more correlated with the "poly-object coverage level" compared to the "traditional coverage level"?• RQ3: Do the "new object coverage criteria" outperform the "traditional coverage criteria" in evaluating tests to find OO-related failures and problems?
To find the answers to these questions, several classes under test (CUTs) from different open source Java programs/libraries have been selected.We have then generated a test suite for each CUT such that the resulting test suites yield high level of coverage for traditional code coverage criteria.Next, a set of OO faults have been seeded in the source code of every CUT or its related classes in the inheritance hierarchy to make specific OO faulty versions (for this purpose, we have used auto-generated mutants with some manually seeded faults).Finally, we have calculated object coverage criteria for each test suite as well as the number of faulty versions detected by this test suite in order to examine the correlation between these two categories of values.In the following subsections, we review the details of benchmarks, the evaluation process, and finally, the results of the evaluation.

Benchmark projects
A total of 270 classes (excluding private classes, intermediaries, and internal classes) from six different open source projects have been included in the empirical evaluation.Apart from JTetris which is a small project with educational purposes, five open source and widely used projects with active community have been selected.
Table 1 shows the project name, the package used during evaluation, the number of classes, the depth of the inheritance tree (DIT), and the lines of code (LOC) (excluding comment and blank lines) for each selected project.The MetricsReloaded 4 tool has been used to calculate LOC and DIT of each class.A summary description of benchmark projects is provided below: • JTetris 5 : This project is a simple implementation of the Tetris computer game in Java.
This implementation is based on the formal specification provided in Smith ( 2012 To examine the "object coverage criteria," 40 classes from the mentioned projects have been selected.We call these CUTs as "target classes".These classes lie in different inheritance hierarchies.Using EvoSuite, a test suite has been generated for each target class to measure its ability to detect OO-related failures.In addition, 24 classes whose majority are different from target classes have been used to evaluate the "poly-object coverage criteria".We call these 24 classes as "base classes".Note that each "target class" has its own inheritance hierarchy, and to calculate "object coverage criteria," the parent/ancestor classes of each target class are used.Moreover, each "base class" has its own inheritance hierarchy, and for measuring the "poly-object coverage criteria," the children and descendant classes are used.The set of classes involved in calculating our coverage criteria, including the target and base classes, consists of a total of 270 classes.Target, base, and other classes in the inheritance hierarchy are used to create faulty versions of each project; the details are presented in Sect.4.3. It should be noted that each target class has at least one parent (otherwise, as mentioned in Sect.3.1, for classes without a parent, the new object coverage criteria will work like the standard coverage criteria).Moreover, we use some target classes with several ancestor classes so that different types of OO problems can be modeled in a faulty version.Also, each base class has at least one child so that it can be used in polymorphic manner.Finally, we use base classes with different numbers of descendant classes to cover more polymorphic faults.By selecting different real projects and different target and base classes in various inheritance hierarchies, we attempt to reduce threats to external validity of our empirical validation.

Test suites
For each target class in each inheritance hierarchy, a test suite has been created using the EvoSuite tool.For half of the selected classes, i.e., 20 classes, the generated test suite has 100% traditional statement coverage.Eleven other classes have coverage more than 70%.Seven classes have a coverage between 50 and 70%, and only for two classes, the coverage is slightly less than 50%.Thus, in total, most of the target classes, i.e., more than 77% of them, achieve high code coverage levels (between 70 and 100%).Attempts have been made to ensure that the tests have a high level of code coverage so that they more likely detect failures in our faulty versions of projects.
We have used EvoSuite 1.1.0with default configuration to generate test suites.Also, the maximum search budget time for each test suite generation has been 1 min.After generating a test suite for each target class, the object coverage criteria for each target class are measured by adding the OCov4J agent to the JUnit test execution process.

Faulty versions
Fault seeding in which artificially faults are inserted into programs is commonly used to compare testing approaches (Papadakis et al., 2019).To measure how much a test suite has been able to reveal OO-related failures in a target class, some tiny changes have been made to the target class code along with its parent and ancestor classes.This way, fault seeded versions of projects, called "faulty versions," have been built.To inject different types of OO defects into the faulty versions of projects, we have used the classification of OO faults in (Alexander et al., 2002;Offutt et al., 2001) as well as approaches for generating OO mutations in (Kim et al., 2001;Ma et al., 2006;Offutt et al., 2006).According to the mentioned approaches, OO faults and problems (related to concepts such as inheritance, polymorphism, and dynamics-binding) can be categorized as follows: • Inappropriate and inconsistent use of inherited state variables: Many errors due to inheritance may occur when the child class misuses the inherited state space of the parent or ancestor classes.In the fault model introduced in (Offutt et al., 2001), these errors are classified as "state definition anomaly" and "state definition inconsistency".IHD and IHI operators have been used in mutation approaches such as the approach of Offutt et al. (2006) in order to generate such general errors.These operators hide the inherited state variable by deleting or adding a variable whose name is the same as the name of a variable in the parent or ancestor classes.IHI hides the state variable by adding a variable with the same name, while IHD hides it by removing the variable declaration in the child class.The former operator adds a declaration of a variable with the same name as the inherited variable to the child class; hence, references to the inherited variable in the child class now refer to the new variable and are not passed to the ancestor class.The latter operator involves removing the definition of a variable v from the child class (for example, by commenting the declaration line), when a variable with the same name exists in the parent or ancestor classes.By removing this definition, references to v in the child class are mistakenly transferred to the parent or ancestor classes.• Incompatible invocations of inherited methods: Some other inheritance-related errors are because of the incorrect usage of the parent state space in defining overridden methods.There are also errors due to the either direct or indirect invocation of these methods.In the fault model of Offutt et al. (2001), these errors are respectively classified as "state definition inconsistency due to state variable hiding" and "indirect inconsistent state definition".To create such general errors by mutation approaches such as the approach of (Offutt et al., 2006), operators IOD and IOR have been used.IOD and IOR cause failures by deleting and renaming overridden methods in the child class, respectively.For instance, the IOD operator removes the definition of an overridden method in the child class.Therefore, any calls to this method in the child class will now be transferred to the parent/ancestor method.• Incorrect and inconsistent invocation of parent/ancestor constructors: The way of initializing a class can cause some common, potential OO failures.Executing the constructor of a class will result in executing the constructor of the parent and ancestor classes until the execution reaches the root class.Also, each class may explicitly execute a specific version of its parent class constructor, so inappropriate invocation through these sequences of constructor innovations may lead to data anomalies.In addition, calling other methods of a class in the constructor can cause some potential failures because these methods can be overridden by subclasses.In the fault model of Offutt et al. (2001), these errors are classified as "anomalous construction behavior" and "incomplete construction".Operators IPC and JDC have been introduced to model such errors in the OO mutation approach of Offutt et al. (2006).The former operator eliminates the parent constructor, and the later makes the default constructor to be executed.These changes model the problems which take place because of the interaction between the class constructor and inherited constructors.• Polymorphism and using inconsistent types: Polymorphism and dynamic binding can lead to potential problems through executing overridden methods in different contexts (i.e., different instances of descendant classes' objects).These problems may occur especially when the inheritance depth is more than two.For example, the yo-yo problem is a well-known case of this failure type that is mentioned to be difficult to find (Alexander et al., 2010).In the fault model of Offutt et al. (2001), these errors are classified as "inconsistent type use".Several operators have been introduced to generate such general errors in the mutation approach of Offutt et al. (2006).For example, the PNC operator changes the object instantiation from a class to a subclass of this class.Operator PMD changes the declaration type of a variable to the parent or an ancestor of the class.• Misusing programming language keywords in accessing object state and inherited state: Although this error type is not included in the fault model of Offutt et al. (2001), Ma and Offutt consider these errors common among developers and introduce mutation operators for them (Offutt et al., 2006).These operators include ISK, JTD, and JSC which are responsible for deleting the keywords super (for problems related to the inherited code access), this (for problems related to the object states), and static (for problems in accessing the shared space between classes), respectively.Note that these types of faults may lead to errors that are semantically equivalent to the errors in the aforementioned categories.Therefore, it is necessary to check the generated faulty versions to avoid equivalent mutants.
Although we could implement our fault seeding process by using only one mutation tool that supports OO mutation operators (e.g., MuJava (Ma et al., 2006) which implements the mutation approach of (Offutt et al., 2006)), we have also used a manual approach alongside with MuJava to produce faulty versions.This was done due to the lack of mature tools in this field.Among all the OO mutation tools, only MuJava is accessible and can be used for real cases, such as our benchmarks presented in the previous subsection.However, MuJava, like other OO mutation approaches, generates mutants by applying only a tiny change to one statement of a single class file; this has some shortcomings for modeling specific OO faults as listed below: 1.While making a small change in one statement of one class file usually yields a valid mutant in procedural mutations, this approach may lead to invalid mutants in the OO paradigm.For example, in MuJava, the JTD operator tries to create an OO mutant by removing the keyword this from the beginning of the variable name so that the value assigned to this variable remains local and does not bind to the class state variable (which can be a common fault as mentioned in our OO faults categories).Now, consider class Bars in the following example, where applying the JTD operator in line 6 can yield a mutant.This mutant could not be compiled because of the final initializing rule in declaring state variable barcount with the final keyword (line 2).This rule enforces a state variable to be initialized when the object is instantiated.Therefore, to create such mutants, we should apply two tiny changes at the same time in the code: removing the final keyword in line 2, and removing this keyword in line 6.Therefore, in order to do OO mutations, in some situations, it is necessary to make multiple changes in different parts of the class.
The existing approaches generate OO mutants for a particular class by only considering this class in isolation and applying their operators to this class code.However, many OO failures (especially polymorphic failures) originate from a fault in the parent or ancestor classes.Although such faults will not affect the correctness of the class itself, it may lead to OO failures in the descendant classes (for example, refer to the counters example in Sect.2.3).Thus, when we generate some mutants for a class, we may need to apply changes in its parent or ancestor classes.3. To model some OO faults, it is necessary to make small changes in the target class and its parent/ancestor classes, simultaneously.For example, the IPC operator in MuJava generates mutants by removing the invocation of one of the parent class constructors from the constructor of the class.When we remove this invocation, the Java compiler replaces the call of the default constructor of the parent class (this constructor does not have any parameter) with the removed parent constructor call; if the parent class does not have a default constructor, the change made to the child class will cause a compile error.In this condition, in addition to changing the child class file (removing the parent constructor call), we need to add the default constructor to the parent class, simultaneously.Consider the class ClassUnderTest below, for which we want to create a mutant by removing the parent constructor invocation in line 4, as highlighted.Suppose Parent-Class does not have a default constructor, so, to make the mutant valid and compliable, we have to add a default constructor to ParentClass at the same time, as highlighted in line 3 of the ParentClass code.
In regard of the above points, we have used a hybrid approach for the fault seeding process.In addition to using the MuJava tool to generate OO mutants, some more faulty versions have manually been created by analyzing the target classes along with the parent/ancestor classes.In general, to generate faulty versions covering all mentioned different types of OO faults, we followed the following steps: 1. First, we generated all the OO mutants that can be produced using only the inheritance, polymorphism, and dynamic binding operators of MuJava.2. Next, we manually generated faulty versions in the different overmentioned categories (specified at the beginning of this subsection) by changing two or more parts of a class at the same time, modifying the parent/ancestor classes, or changing several classes at the same time, including the target class and the classes in the inheritance hierarchy.3. Finally, we checked all faulty versions generated (either automatically or manually) for a class to remove equivalent versions.
Below are two examples of automatic and manual mutants used in the evaluation process.Class Field of project Apache Validator represents a mutant that has automatically been generated by MuJava.Line 3 has been removed from this class so that the parent constructor would not be called.The DPOIndicator class (from the Ta4J project) shows a manual mutant generated by the approach described in this section.Two simultaneous changes have been applied to this class to make the mutant.First, the this keyword has been removed from the beginning of variable barCount in line 6 to change the scope of this variable.However, since variable barCount in line 2 is declared with the final keyword, this keyword must also be removed so that the mutant can be compiled.
Performing above steps and discarding the equivalent faulty versions, 668 faulty versions have been generated for target classes.In fact, about 16.7 faulty versions have been produced for each class, on average.Most of these mutants (more than 72%) have automatically been generated by the MuJava tool.Table 2 shows the details of the generated faulty versions for each target class.In addition to the project name and the target class name, there are the following columns in this table: • DIT: The inheritance depth of the target class in the inheritance hierarchy • All faulty versions: The total number of all faulty versions generated for the target class, either using MuJava or manually • Auto-generated faulty versions: The total number of faulty versions automatically generated by MuJava for the target class • Manual faulty versions: The total number of faulty versions that have been manually generated by the approach introduced in this section • Auto-generated ratio: The ratio of auto-generated faulty versions to total faulty versions It should be noted that the number of faulty versions for each class depends on the sum of LOCs of the target class and its parent/ancestor classes, the number of overridden methods in this class, the inheritance depth of this class, and finally, the number of variables inherited from the parent/ancestor classes.
Another point worth mentioning is that, for most classes, a high percentage of "faulty versions" were generated automatically, and this percentage is under 50% for just 4 classes.This is due to the low lines of code of the target class.In these classes, most functionality is provided through ancestor classes.Therefore, the MuJava tool cannot generate many mutants, and faulty versions can be only generated through the manual process discussed in this section by applying changes in ancestor classes.Also we should note that, in general, the number of OO mutants is significantly less than the number of mutants generated by common, procedural mutation approaches.For example, in case of the famous triangle program which has 30 lines of code, about 950 mutants can be created using different types of traditional mutation operators like changing arithmetic operators or logical operators (Ma et al., 2006).However, in a study of 256 classes of the Apache BCEL framework, it has been shown that about 14.5 OO mutants can be generated for each class using the MuJava tool (Offutt et al., 2006).

Running tests and checking faulty versions
Each faulty version generated by the process mentioned in the previous subsection is either a modified Java class file or a collection of several modified Java class files contained in a directory.For each faulty version, we added the modified Java files in the source directory of the related project.Then, we compiled the modified project and ran the target class test suite against the compiled (modified) project.If there was any test failure, we labeled the faulty version as "detected," which meant the test suite had been able to detect the seeded fault in the project.
We had to do the same steps for all remaining faulty versions to indicate if the faulty version was detectable or not.Because these steps were repetitive and costly, we developed a helper tool, called MuRunner, to automate this process.This tool was integrated with the native Java compiler and the Maven build tools for compiling projects.MuRunner also supports different versions of JUnit to run tests.It is a command line tool which receives the root directory of "faulty versions" as input.Then, it retrieves the list of all faulty versions in the related directory.The tool performs all the required steps, which include replacing modified Java classes in the project, recompiling the modified project, running the JUnit test suite against the compiled project, and finally, collecting the results.These results show which faulty versions have been detected by the tests.MuRunner is publicly available through GitHub.In addition, all faulty versions produced during this evaluation are available alongside the tool as a sample project.11Using MuRunner, OCov4J, and provided samples, everyone can redo the experiment done in this research or perform additional ones.

Results and discussion
In this subsection, we try to provide answers to the three research questions asked at the beginning of Sect. 4. To do so, as discussed in the previous subsections, we generated automated test suites and then obtained the "traditional coverage level," "object coverage level," and "poly-object coverage level" for these test suites.In addition, we generated mutants related to OO issues for each SUT, both automatically and manually.Also, we determined how much of these issues were revealed by the test suites through running them.

Evaluation of RQ1
Table 3 shows some of the data that we gathered through the evaluation phase to answer question RQ1.For each target class, in addition to the project name and the target class name, the following fields are specified in this table: To informally check whether there is a positive correlation between object coverage level and the faulty versions detection ratio, we use a scatter diagram.This diagram is depicted in the left side of Fig. 3 based on columns OCov and Detection ratio (all) in Table 2.According to this scatterplot, there is a positive linear correlation between the object coverage level and the percent of the OO faulty versions which were detected by the test suite.
Although the scatterplot shows a positive correlation, to find the strength of this correlation, we should obtain the correlation coefficient value.To choose the appropriate method to calculate this coefficient, we examine some of the characteristics of the resulting experimental data.
As the first characteristic, the left side of the plotter diagram approximately shows a linear relationship between two variables OCov and Detection ratio (all).Now, using the Shapiro-Wilk test, which is one of the most powerful normality tests (Razali & Wah, 2011), we check if both variables follow a normal distribution.The null hypothesis H 0 of this test is that the variable is normally distributed.According to the test, p-values for variables OCov and Detection ratio (all) are, respectively, 0.056 and 0.096.Since p-value for the two variables is greater than the significance level α (0.05), we accept H 0 ; therefore, both variables are normally distributed.Another characteristic of data that is observed in the plot diagram is the absence of considerable outliers.
Based on the mentioned characteristics, we use the Pearson method to obtain the correlation coefficient between the object coverage level and faulty versions detection ratio.The Pearson correlation coefficient for these variables is 0.792 where the degree of freedom (df) is 38; df is the number of samples minus by 2. These values indicate a high positive correlation between object coverage level and failure detection ratio.Also, it should be noted that this result is statistically significant, because the level of significance for a one-tailed test with significance level α of 0.05 and df = 38 is 0.257, and our Person correlation coefficient (0.792) is much greater than 0.257.If we use the results in Table 3 to draw a scatterplot for the traditional statement coverage level (column Cov in Table 3) and faulty versions detection ratio as depicted in the right side of the Fig. 3, no considerable  correlation will be observed.Moreover, the Pearson correlation coefficient between these two variables is − 0.010 which is considered negligible correlation (Razali & Wah, 2011).
• Answer 1.The results, based on large Java programs, suggest that there is a significant positive correlation between the level of our new "object coverage criteria" and the "number of OO failures found by tests".Moreover, the results show that this correlation is very low for the "traditional coverage criteria".
As mentioned in Sect.4.3, we developed a manual approach to generate more diverse mutants with better coverage of OO faults.However, it may be argued that manually generated mutants could bias Answer 1.To answer this claim, first, note that most mutants have been generated automatically by MuJava (as shown in Table 2, 72% of all mutants have been produced automatically).Second, as will be discussed at the end of this section, the positive correlation is still significant if the evaluation is limited to auto-generated mutants merely.
To examine this issue in more detail, columns "Detected faults (auto)" and "Detection ratio (auto)" in Table 3 show the results of our experimental evaluation, only by taking into account all 473 auto-generated mutants.Also, Fig. 4 shows the scatterplot between object Fig. 3 The scatterplot between object coverage level/traditional coverage level and faulty versions detection ratio coverage level in column OCov and faulty versions detection ratio (calculated considering just auto-generated mutants) in column Detection ration (auto).The depicted scatterplot turns out a positive correlation between these two variables.
To compare the auto-generated correlation with the overall correlation presented before, we considered the Pearson correlation coefficient to determine the strength of the autogenerated correlation.The value of this coefficient for the auto-generated mutants is 0.779, which is slightly less than the correlation coefficient value for all mutants, i.e., 0.792.In addition, the correlation coefficient is statistically significant because it is much greater than 0.257 (as mentioned before, 0.257 is the level of significance for a one-tailed test with significance level α of 0.05 and df = 38).The right side of Fig. 4 shows the scatterplot between traditional coverage level and faults detection ratio regarding all auto-generated mutants, which, like the corresponding diagram that takes all mutants into account (right side of Fig. 3), shows no significant correlation between traditional coverage level and detected OO-related faults.

Evaluation of RQ2
In addition to the main object coverage criteria, we proposed the "Poly-object coverage criteria" in order to address problems related to polymorphism and dynamic binding.These criteria are essentially the same as the object coverage criteria, but the former are introduced for a class (as a base class) by considering its children and descendant classes.
In order to evaluate the effectiveness of these new criteria in relation to polymorphic problems, we raised the research question RQ2.To answer this question, we considered 21 classes from the benchmark projects introduced in Sect.4.1.Each base class in our experiment has at least one child class and may also have other descendant classes.For each base class bc, we have selected the following versions from the set of faulty versions, which have been created for the descendant classes of bc (constructed as mentioned in Sect.4.3): • Faulty versions of descendant classes of bc which have been generated by fault injection into the code of bc • Faulty versions of descendant classes of bc that falls into the "Polymorphism and inconsistent types use" category (based on our fault categories specified in Sect.4.3).This category addresses polymorphism and dynamic binding problems.
Fig. 4 The scatterplot between object coverage level/traditional coverage level and faulty versions detection ratio (limited to auto-generated mutants) Table 4 shows the results of this evaluation for base classes.Column NOD indicates the number of descendant classes of the base class (including its children).Column PolyCov shows the poly-object coverage level of the base class, considering its descendant classes as shown in column Descendants.The next column, Detected faults, represents the number of detected faulty versions vs. the number of all faulty versions.Finally, the last column implies the detection ratio of faulty versions.
Using a similar approach to the previous section, we observe a high positive correlation between columns PolyCov and Detection ratio in Table 3.The scatter diagram shown in the left side of Fig. 5 clearly shows the correlation between the poly-object coverage level and the polymorphic-related-fault detection ratio.Like the previous section, due to the characteristics of data in columns PolyCov and Faults detection ratio, we considered the Pearson correlation coefficient to determine the strength of the observed correlation.The value of this coefficient for the resulting data is equal to 0.821.Therefore, this correlation coefficient is statistically significant because the level of significance for a one-tailed test with significance level α of 0.05 and the df = 22 is 0.330, and our Pearson correlation coefficient (0.821) is greater than 0.330.
• Answer 2. The results, based on large Java programs, suggest that there is a significant positive correlation between the level of our new "poly-object coverage criteria" and the "number of polymorphic failures found by tests".Furthermore, the results show that this correlation is very low for the "traditional coverage criteria".

Evaluation of RQ3
The final research question, RQ3, has been raised to assess how well our approach meets the primary objective of this research.
As argued in Sect.3, the proposed object criteria subsume the traditional coverage criteria.This means that when, for example, the criterion "Object coverage statement" for a test suite approaches the value of 100%, its corresponding traditional criterion will definitely approach 100% as well.Therefore, it is difficult to compare cases with high degree of "object coverage level" to conclude whether the new criterion is more effective than the traditional criterion in detecting OO-related failures.On the contrary, it can be helpful to compare cases that have a low level of object coverage, but, at the same time, a high level of traditional coverage.In the following, we will demonstrate that, in these situations, not only the new object criteria perform better than the traditional criteria, but also the traditional criteria are completely misleading.Now, consider target classes in our experiment that have a high level of traditional coverage, i.e., more than 90% traditional coverage.Among these target classes, we have selected those whose object coverage level is very low, that is, less than 20%.Table 5 lists the chosen target classes along with their object coverage level, traditional coverage level, and OO-related faults detection ratio.
These classes make up 27% of all target classes.Although these classes have an average of 93% of the traditional coverage level, they only detect a small number of OO-related failures; i.e., they recognize an average of 14% of failures.Therefore, as it is clear, the traditional coverage level is completely misleading and does not indicate the OO-related failures detection ratio.Meanwhile, the average object coverage level is 19% for these classes, which is as low as the failures detection ratio.In order to better examine the correspondence between the object coverage level and the detection rate of OO-related failures, Table 6 has divided results of all the target classes into five categories.This grouping has been done according to the object coverage level from very low to very high level.For each group, the average level of object coverage, and the average failure detection ratio have been determined.As shown in the table, as the object coverage increases, the number of detected failures (reflected by the "Detection ratio average" column of the table) also increases.
• Answer 3. The object coverage criteria are more suitable than the traditional coverage criteria when measuring the effectiveness of a test suite in detecting OO-related failures.

Limitations
The object code coverage criteria are emphasized on the execution of different parts of the code that represent the state and behavior of an object.These parts include the main class code along with the code of the inherited classes, while considering the actual type of the Fig. 5 The scatterplot between poly-object coverage level/traditional coverage level and faulty detection ratio object under test at runtime.These criteria can only show that the different parts of the code, which are related to the object, are executed at least once.In other words, these criteria, like the traditional code coverage criteria, only enforce the execution of different parts of the code, and do not necessarily show the presence or absence of program's failures.One of the main means for revealing failures (especially failures based on the logic of programs or classes) is to use the assertion part of test cases.This part plays the role of the test oracle.Like the traditional code coverage criteria, the object coverage criteria are not able to effectively evaluate the test case assertions.This is why our experimental evaluation has been conducted based on very simple faults and mutants, which are more likely to be detected by assertions of auto-generated tests when parts of the code that contain faults are executed.
Therefore, our evaluation results cannot show that achieving a high object coverage level for a test suite necessarily leads to revealing real OO failures.This claim is beyond the scope of this research, and examining the correlation between high object coverage levels and high capability to detect real-word OO failures could be the subject of a future work.
Since we have used OO mutations, i.e., MuJava class mutations, to generate mutants, and we have shown the correlation of the object coverage criteria with the detection ratio of these mutants, one could claim that the object coverage criteria are not necessarily required, when there exist mutation analysis techniques which support OO-related mutants.However, in addition to existing OO mutation approaches, like MuJava, we have used various techniques to generate faulty versions.These techniques, which are not practically used in existing OO mutation tools, make simultaneous changes in some part of the class under test, as well as its parent and ancestor classes.They also apply simultaneous changes in several family classes.Furthermore, like the traditional code coverage approaches, our "object coverage" approach could be used very simply, with high automation and with negligible execution cost.In contrast, there are not many automated tools for OO mutations, and also, like all mutation techniques, OO mutation methods have high execution cost (Segura et al., 2011).
The last limitation of this work is not comparing the proposed tool, OCov, with other competing tools.In our evaluation, we have conducted an empirical analysis using OCov to show that the object coverage criteria generally outperform the traditional coverage criteria when used to assess effectiveness in detecting OO-related failures.However, to provide further evidence on the effectiveness of our approach, it is important to compare its performance against its competitors' performance.However, unfortunately, to the best of our knowledge, there is no OO-related coverage-based tool in the literature that is comparable to the proposed approach and can be applied to the large empirical studies used in our empirical evaluation.More precisely, we have encountered the following challenges for such a comparison: • First, our proposed coverage criteria are based on a simple structural coverage of the code and do not depend on any semantic information of the code (such as data flow in the code).However, most approaches for OO testing are usually based on notions such as data flow.Although an approach such as that of Smith and Robson (1990) can be compared to ours, it is an old approach and no tools are currently available to support it.• Although some data flow-based approaches work in a different way than the proposed approach, they have the same goal.For example, the approach of Alexander et al. (2010) was admittedly developed to better detect errors associated with polymorphism.This method provides different coverage criteria in this context that can be compared with our "poly coverage criteria".But, we were unable to find and access a tool that automated this approach.Also, re-implementing a tool based on this approach is very time-consuming and error-prone.• Another important point is that many test tools have been developed based on a specific technology or programming language.For example, the method of Harold et al. (1992) can only be applied to the C++ language, or the approach of Zou et al. ( 2014) is based on the concept of "Document Object Model" and is only applicable to web applications based on JavaScript and PHP.Consequently, they cannot be applied to the sample Java programs used in our evaluation process.

Threats to validity
One of the internal validity threats could be potential faults in the OCov4J tool implementation.We have tested this tool thoroughly.In addition, we have published the tool as an open source software for further evaluation and extension.As part of testing OCov4J, we have compared this tool with a mature traditional code coverage tool, JaCoCo.To do so, we have applied both tools to some classes without any parent or child.Based on our arguments in Sect.3.1, applying the object coverage criteria and the traditional code coverage criteria to these classes must lead to the same results.Fortunately, OCov4J and JaCoCo tools have performed according to this expectation.There are three points to note with respect to external validity threats.Firstly, it is the benchmarks chosen for our evaluation (refer to Sect.4.1).We attempted to select as many different benchmark projects as possible and chose real-world and widely used projects in different scopes like validators, file decoders, data structures, and analytic software.Moreover, in selecting classes from each project, we have chosen classes in different inheritance hierarchies with different depths.The second threat to external validity could be because of our approach to generate faulty versions (refer to Sect.4.3).This can raise the possibility of bias and make the generality of the results questionable.In regard of this concern, we followed high referenced approaches like Offutt et al. (2001), Ma et al. (2006), andOffutt et al. (2006) to manually seed different types of OO faults.In addition to the manual faulty versions, we also used mutations that were automatically generated by MuJava with its different mutant operators for OO features, such as inheritance and polymorphism.We also tried to not use equivalent mutants by examining all auto-generated mutations.We have published our faulty versions as a benchmark alongside our tools, OCov4J and MuRunner, as open source tools to facilitate the reproduction of results and do more experimental evaluation.And finally, the third point, discussed in detail in the previous subsection, is the impossibility of fully comparing the proposed method with existing OO testing methods, which leads to a threat of external validity.

Related works
Initially, Smith and Robson (1990) showed that testing OO programs or OO classes differs from testing procedural programs.One of the most important differences is indirect testing of a class through different objects made from the class and its family.They proposed a method for testing classes.Their method, which was a combination of specification-based and attribute-based approaches, addressed object instantiations in OO programs, but it did not specifically regard other OO features, such as inheritance and polymorphism.
One of the early beliefs about OO testing approaches was that the testing time and cost are reduced when inheritance is used.That is, when a parent class is adequately tested, the test of its child classes could be much easier because it is enough to test just the new methods that have been added to the child classes.Though these beliefs were shattered by Perry and Kaiser (1990) who analyzed the adequacy of tests for OO programs, such as programs that use single inheritance, method overriding, or multiple inheritance, the results of their analysis showed that not only the amount of effort of testing OO programs is not reduced, but also, in order to achieve "test adequacy," we have to pay more.Harold et al. (1992) expanded the work of Perry and Kaiser by determining which features should be retested in a C++ inheritance hierarchy.They presented an algorithm to categorize the inherited methods in a child class and show which of them should be re-tested in this class from the beginning.
Another approach for test adequacy criteria is based on mutation analysis.In this approach, artificial defects are seeded into the program, and then, the test adequacy is evaluated by examining whether the tests are able to detect these defects.Although most of the researches in this area are related to the procedural paradigm (Papadakis et al., 2019), a few works have dealt with specific features of OO programs.One of the first efforts to analyze mutations for OO-related problems was the work of Kim et al. (2001).They created several operators, called "class mutators," to insert faults related to OO problems.Their approach was evaluated on a few classes with low LOCs.Moreover, they have not introduced any tool to support this approach.Offutt et al. (2006) provided a more complete approach than the previous one for analyzing mutations for OO features.They provided a set of 25 class mutation operators in form of two subsets: one corresponds to the general OO features like inheritance and polymorphism and the other one is related to specific OO features in Java, such as default constructor replacement.This approach was extended by MA et al. (2006).They introduced MuJava, which is a complement to the previous approach and also includes an automated tool for the Java language.Although this tool did not initially support common Java language features, such as generic classes/methods, its subsequent updates have covered some of these features.Therefore, this tool is now more practical than other tools to analyze mutations for specific OO features, such as inheritance, polymorphism, and dynamic binding (Segura et al., 2011).We can also mention Judy (Madeyski & Radyk, 2010) as another tool that has been provided for the mutation of Java programs.It is claimed that, in comparison to MuJava, Judy produces and evaluates mutants more efficiently.However, it only supports traditional mutation operators and does not regard inheritance, polymorphism, and dynamic binding mutation operators.Segura et al. (2011) applied existing mutation approaches and tools to a real, large-scale project.Unlike previous works that had analyzed mutation techniques on educational-purpose programs or sample classes in a real project, they did so for all classes in a real project.They also reported several practical limitations in the current tool set for OO mutation analysis (Segura et al., 2011).They concluded that without fast and fully automated tools, mutation testing is impractical because it is a very time-consuming approach.Finally, as an efficient and up-to-date tool with the ability to integrate with build and continuous integration tools, we can mention the tool "Pit" for Java (Coles et al., 2016).This tool only supports traditional and procedural mutation operators, though.
Some adequacy testing approaches use data flow criteria which are based on the relationships between program/class state variables.Various testing approaches have been developed based on these criteria.Various graph-based methods have also been used for coverage analysis of OO programs.In these approaches, first, a graph (such as control flow graph, data flow graph, and call graph) is constructed based on program elements.Then, tests are executed to cover different parts of the constructed graph with respect to several defined coverage criteria.One of the most popular categories of these criteria is the data flow coverage criteria, which work based on the notion of def-use pairs.These pairs are introduced based on the location of the definition of every variable in the program and its use elsewhere.Such a pair of locations defines a test requirement.A data flow testing approach designates test data to satisfy several def-use pairs imposed by particular data flow criteria.
For OO programs, data flow approaches have usually been derived based on their procedural nature, but a few works have adapted the traditional data flow techniques with new OO features (Su et al., 2017).
One of the first works that addressed polymorphism issues based on data-flow analysis is Orso and Pezzi (1999).They have extended the classical data-flow criteria by considering polymorphic calls of each method in a def-use path.In a def-use path, a class variable is defined and its value is subsequently used throughout a sequence of method invocations.Now, if one of these methods can be executed polymorphically, the approach considers a new test case for each polymorphic execution by substituting this execution in the def-use path.This approach has a simple and well-defined basis; however, the authors have only evaluated the proposed approach using a single simple example and have not investigated it on a real case.Additionally, no tool has been introduced for automating this approach.
Another widely cited approach, called "coupling-based testing," has been proposed by Alexander et al. (2010) for testing polymorphic features.This study is actually an extension of a previous work by the same authors, Alexander and Offutt (2000), which has more coverage criteria compared to the older work and has been evaluated on multiple cases.In this approach, a data flow technique is applied to a graph consisting of method calls in the inheritance hierarchy.First, all "couple methods" are extracted from the program.The "couple methods" include two method invocations on an object; the first method invocation defines a state variable of the class and the other one uses it, while between calling these two methods, no one changes the value of the variable.The path between these two invocations is called a "coupling sequence".The authors used coupling sequences to test polymorphic features.To do this, all class types that could be bound to an object were considered.Then, all coupling sequences were examined by substitution of different types of the classes for the object under test.Based on this idea, some coverage criteria have been introduced to cover all different parts of coupling sequences.One of the weaknesses of this approach is that it has not been evaluated on large and real-world projects.The other shortcoming is the lack of a tool to support the automation of this approach, which makes it impractical.Lack of efficient and automated tools is a common challenge in most data flow approaches (Su et al., 2017).
It should also be noted that traditional data flow approaches usually derive def-use pairs through static analysis of the source code.Due to the dynamic nature of OO programs, the authors (Denaro et al., 2015) used a dynamic analysis approach to test OO programs.The authors did not introduce new coverage criteria and only employed more information (gathered from dynamic analysis) to find efficient def-use pairs for state variables.The authors have used traditional, procedural mutation analysis in evaluating their work and have not applied specific OO mutations.
The authors (Najumudheen et al., 2019) have presented a method to construct a single coherent graph by combining three graphs, i.e., data flow graph, control flow graph, and object dependencies graph.Using this constructed graph, coverage analysis is feasible for OO features like inheritance and polymorphism.Based on this graph, the authors have defined the polymorphism and inheritance coverage criteria.The former represents the ratio of the number of polymorphism methods invocations to all possible polymorphism invocations through running tests; the latter determines that all methods with the required level of access in the descendant classes have been executed by some test.This approach has been evaluated on a few simple benchmark programs with a low number of lines.Due to the complexity and largeness of the constructed graphs for even simple examples, it is not clear how this approach can be practical for large and real-world programs.
The approaches outlined in this section provide general guidelines for most OO programs.Although some of them have been implemented as a tool for some specific programming languages, they can be applied to various OO programming languages.Nevertheless, some approaches have been introduced to enrich test adequacy criteria for specific applications, such as dynamic web applications.For example, the authors (Zou et al., 2014) have provided test adequacy criteria for dynamic web applications written in PHP and JavaScript languages.In order to evaluate tests, instead of using code coverage criteria on the server-side or UI element coverage on the client-side, separately, they have considered both codes at the same time.According to their approach, first, a virtual Document Object Model (DOM) is created by analyzing the server-side and the client-side code, and then, criteria for covering different parts of this DOM are provided.The authors conducted a study and concluded that their criteria outperformed existing server-side code coverage and client-side UI element coverage criteria because they found more faults.However, approaches like the one proposed in Zou et al. (2014) depend on specific elements in certain applications, such as the DOM which is an OO representation of an HTML document in a dynamic web application; hence, they cannot be used in other OO domains and programs.
As described, some related approaches have addressed OO specific features by introducing test adequacy criteria, but they have not been evaluated on large software and there is no automated tool to support them; it makes these approaches impractical.Our approach adapts the very popular code coverage criteria to address specific OO features while maintaining their strengths, including simplicity, low execution cost, and high automation.

Conclusion and future works
This paper presented test adequacy criteria for better testing OO programs with emphasis on the OO features, such as object instantiation, inheritance, polymorphism, and dynamic binding.The proposed approach adapts the notion of the traditional code coverage so that the proposed coverage criteria can better address problems arising from the mentioned OO features.In this adaptation, we have attempted to maintain the strength points of the traditional code coverage criteria, such as ease of use, high automation, and low execution cost.
Compared to the traditional code coverage criteria, the proposed criteria, called object coverage criteria, actually take two key points into account: firstly, they regard the part of the class code that is executed by each object type.The other point is to consider all codes that are inherited from the parent/ancestor classes and represent the states and behaviors of the class under test, either directly or indirectly.Based on these ideas, we first proposed the basic criteria, called "object coverage," which are based on a class and its parent/ancestor classes to address specific OO issues.We have next defined some auxiliary criteria, called "poly-object coverage," which consider a class with its child/descendant classes, to especially address problems related to polymorphism and dynamic binding.
The above criteria have been implemented in a prototype tool, called OCov4J, to support the Java language.Using this tool, the introduced criteria have been applied to several different open source and widely used projects, empirically.In this experiment, artificial faulty versions were created from different classes in various inheritance hierarchies.Then, it was examined how many of these faulty versions could be detected by the test suite of each class.The results demonstrated a strong positive correlation between faulty versions detection ratio with the object coverage level.To complete this research, the following topics can be addressed in future work.

Validation of automated OO test generation tools
In recent years, several tools to automate test generation for OO programs, especially Java programs, have been introduced.A number of experiments have also been done to evaluate the ability of these tools to effective test generation.For example an annual competition has been conducted to evaluate the best Java-based automated test generation tools.None of these evaluations has addressed the common issues of OO programs, such as encapsulation, inheritance, and polymorphism.In the mentioned competitions, the traditional code coverage criteria along with mutations generated by the Pit tool (2016) (the Pit tool only generates traditional procedural mutants) have been used to evaluate the effectiveness of tests.Although during the presentation of our work, the effectiveness of the EvoSuite tool as one of the main tools for automated test generation was explored using simple examples (Sects.3.1.2and 3.2.1),we can evaluate the effectiveness of EvoSuite in addition to other related tools in terms of specific OO faults.To do so, we should evaluate tools like EvoSuite, T3 (2015), Tardis (2019), ART-CovPS (2019), andRandoop (2007) by building a benchmark using the object and polyobject coverage criteria.

Test generation using object coverage criteria
The traditional code coverage criteria are used as indicators to generate tests in automated test generation tools.For example, many search-based approaches, which generate tests based on metaheuristic algorithms, employ code coverage indicators (along with other indicators) to design fitness functions used in these algorithms.In fact, the code coverage criteria are used to guide the test generation process.Therefore, one of the future works of this research could be to use the proposed object coverage criteria for test data generation, trying to better address problems related to the OO features.In this regard, extending the EvoSuite approach by adding object coverage criteria as a new fitness indicator can be the subject of a future work.

Implementation of various object coverage criteria
In this study, to focus on the main stream of our idea, only some basic criteria, such as object statement coverage and object line coverage, were proposed and implemented in the OCov4J tool.Since the object coverage criteria are extensions of their code coverage counterparts, other criteria such as object branch coverage, object decision coverage, and object modified condition/decision coverage (MC/DC) can be defined and implemented in OCov4J.A complement empirical evaluation to determine effectiveness of these new object coverage criteria can also be conducted.Furthermore, some adequacy testing criteria in the context of data flow analysis can be combined with the concept of object coverage using different def-use pairs of class variables and inherited class variables in the related inheritance hierarchy.
Evaluation of the proposed criteria using real-world faults The purpose of this study has not been to show the effectiveness of the object coverage criteria in detecting real-word OO failures.Instead, it has been emphasized on execution of different parts of the object under test.Therefore, to evaluate this research, we have used artificial faulty versions constructed by OO mutation approaches.Evaluating the effectiveness of the proposed criteria in finding real OO failures can be one of the future works.Although approaches such as Defects4J (2014) and bugs.jar(2018) have provided a set of real bugs for Java programs to facilitate controllable studies on testing, none of them has considered OO-related faults and problems.Most of the provided real-word faults are procedural in nature.To evaluate the object coverage criteria in detecting real-world failures, similar to approaches such as Defects4J, we can first gather a set of OO-related faults from widely-used and large-scale open source projects.We can then categorize these faults into different groups such as object instantiation, inheritance, and polymorphism.Finally, we can utilize this set of faults like a benchmark for better evaluating all works related to OO testing, including our object coverage criteria.
Improving OO mutation approaches As discussed in Sect.4.3, existing OO mutation approaches (with support for operators to model OO faults) have some shortcomings, such as not generating mutations via simultaneous changes in the class file or not building mutations by changing the codes of the parent/ancestor classes.To compensate for these shortcomings, in this study, in addition to using MuJava, as an existing OO mutation tool, we have created a series of mutants, manually.In addition, we have implemented an auxiliary tool, called MuRunner, to facilitate analyzing these manually generated mutants.This tool automates replacement of faulty classes by original classes, recompilation of modified classes, test execution, and reporting mutation results.One of the directions for future work could be the completion of our improved mutation approach through upgrading the MuRunner tool to support mutant generation in a fully automated manner.This facilitates further empirical evaluations of our object coverage criteria.Moreover, it can be used as a mutation analysis approach by emphasizing on OO specific features.

Fig. 1
Fig. 1 An inheritance hierarchy ending with class C

•
OCov: The object statement coverage level of the used test suite for the target class • Cov: The traditional statement coverage level of the used test suite for the target class • Detected faults (all): The number faulty versions detected by the test suite vs. the total number of all faulty versions (either produced by MuJava or generated manually) • Detected faults (auto): The number of auto-generated faulty versions detected by the test suite vs. the total number of auto-generated faulty versions (mutants generated by MuJava) • Detection ratio (all): The ratio of the number of detected faulty versions to the number of all faulty versions (either produced by MuJava or generated manually) • Detection ratio (auto): The ratio of the number of detected faulty versions to the number of auto-generated faulty versions (mutants generated by MuJava)

Mohammad
Ghoreshi is a PhD student in Software Engineering at Shahid Beheshti University.He is passionate about exploring new ways to improve software quality and reliability through innovative testing methods and tools.Hassan Haghighi received his PhD degree in Computer Engineering-Software from Sharif University of Technology in 2009 and is currently a Professor in the Faculty of Computer Science and Engineering at Shahid Beheshti University, Tehran, Iran.His main research interests include formal methods, software testing, and data quality.

Table 1
A summary of benchmark projects

Table 2
Target classes and related faulty versions

Table 3
Experimental results for each target class

Table 4
Experimental results for the poly-object coverage level Project

Table 5
Experimental results for target classes with a high Cov and a low OCov

Table 6
Experimental results for target classes grouped by the Ocov level