Processes are central elements of software development, and they have changed in the past decades, from views inspired in manufacturing to more innovative approaches, like Agile and DevOps. New approaches tend to focus more on people and practice (Jacobson et al., 2007), and have usually been created by practitioners, not methodologists. Processes and practices are tools (Cockburn, 2004; Pfleeger, 1999), and there is evidence that the interactions between users and their methods are alike to their interactions with their tools (Riemenschneider et al., 2002). Given that usability characterizes artifacts that are easier and more attractive to use, usability might improve process and practice adoption, and also make those adopted processes and practices more sustainable. Therefore, applying usability principles and heuristics to software development processes and practices might help adoption initiatives and improve the experience of the people involved. For example, feedback is a usability principle applied in iterative processes, allowing teams to collect information about their product, processes and practices, in order to learn and improve.
There is little research on process and practice usability. Although several researchers have proposed process quality models (Feiler & Humphrey, 1992; Guceglioglu & Demirors, 2005; Kroeger et al., 2014), very few include usability among them.
To assess the impact opportunities addressed by this research, it is enough to regard how real-world teams and organizations strive to adopt Agile and DevOps practices and embrace their mindsets. Many agile transformation initiatives struggle to accomplish their objectives (Dikert et al., 2016) and practice adoption levels are not what might be expected given the popularity of agile methods (Kuhrmann et al., 2019; Paez et al., 2018). This produces a negative impact on process improvement initiatives and negatively affects costs and motivation. Also, many improvement initiatives are planned and conducted in top-down fashion without considering the perspective of the people affected (J. S. Brown & Duguid, 2000). Process and practice improvement through adoption is hard, even for effective organizations. Lack of clear and concrete guidance makes these challenges even more difficult.
In order to support the application of usability concepts to process and practice, the Usability Model for Software Development Process and Practice (UMP) was developed following the Design Science Research approach (Johannesson & Perjons, 2014). UMP is formed by characteristics and metrics. The model can be applied to specific processes and practices to evaluate their usability and identify related improvement opportunities. UMP’s goal is to help improve the usability of software development processes and practices, to improve the work experience of software developers and the overall effectiveness of process and practice improvement and adoption initiatives.
One of the main requirements established for UMP is that it should be reliable, meaning that evaluations by different people should produce consistent results when applied to the same process or practice. For example, if several people evaluate a given process or practice, the values produced for each metric should not present high variance because that might imply that the UMP is not able to reliably describe the usability features of the process or practice. This type of reliability is usually defined as inter-rater reliability (Hallgren, 2012). Reliability is significant in this context because the model, to be effective, must be used by different people, particularly given that software development, is in general a collective endeavor (this also applies to improvement initiatives in organizations). If different people produce very different metric values for the same process or practice, there is high risk the values cannot be effectively unified and thus impair effective interpretation of the metric values and prevent further improvement actions. Particularly, issues of low reliability might point to “instability of the measuring instrument when measurements are made between coders” (Hallgren, 2012), and thus, provide opportunity for improving the UMP.
The Design Science Research approach guides the development and validation of artifacts aimed at solving specific practical problems. The purpose of this paper is to present two validation studies conducted to assess inter-rater reliability for the UMP’s 23 metrics. We also share the challenges that emerged in applying statistics traditionally recommended in software engineering literature like those from the Kappa family, the learnings they provided and describe the alternative statistics eventually applied to resolve those challenges.
The first study was conducted by applying the UMP to Scrum, which was selected for the initial study for several reasons. First, to pursue the research line initiated with the UMP feasibility study in which two external experts evaluated Scrum (Fontdevila et al., 2017). Secondly, being a process framework, Scrum was considered an object of evaluation larger than most practices and at the same time, simple enough for holistic evaluation. It is also the most popular agile method and some of its stated values resonate positively with usability. For example, transparency, one of Scrum’s values (Schwaber & Sutherland, 2017) matches visibility, one of the UMP characteristics.
The second study was conducted on TDD and BDD, and they were selected to complement Scrum (a process oriented example) with two practices, which tend to be more fine-grained. Also, some of the UMP’s characteristics like Feedback seemed to describe part of the appeal of these practices, while some specific metrics like Conceptual model correspondence seemed to highlight some of the challenges perceived with TDD adoption, particularly issues with the test-first approach (Beck & Andres, 2004).
The rest of this paper is structured as follows: Section 2 presents the background related work; Section 3 describes the overall research strategy applied to develop the UMP following the DSR framework and the research method applied to assess UMP’s metrics inter-rater reliability; Section 4 describes the UMP’s characteristics and metrics in detail; Section 5 and Section 6 present the Scrum study and the TDD-BDD study, respectively. Finally Section 7 outlines the conclusions and the lines that are pending for future research.