Can end-user rationale improve the quality of low-rating software applications: A rationale mining approach

doi:10.21203/rs.3.rs-1869525/v1

Download PDF

Research Article

Can end-user rationale improve the quality of low-rating software applications: A rationale mining approach

https://doi.org/10.21203/rs.3.rs-1869525/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Rationale refers to making human judgments, sets of reasons, or intentions to explain a particular decision. Nowadays, crowd-users argue and justify their decisions on social media platforms about market-driven software applications, thus generating a software rationale. Such rationale information can be of pivotal importance for the software and requirements engineers to enhance the performance of existing software applications by revealing end-users tactic knowledge to improve software designing and development decision-making. For this purpose, we proposed an automated approach to capture and analyze end-user reviews containing rationale information, focusing on low-rating applications in the amazon store using Natural Language Processing (NLP) and supervised machine learning (ML) classification methods. In the literature, high-rating applications have been emphasized while ignoring low-rating software application that causes potential biasness. Therefore, we examined 59 comparatively low-ranked market-based software applications from the Amazon app store covering various software categories to capture and identify crowd-users justifications. Next, using a developed grounded theory and content analysis approach, we studied and recorded how crowd-users analyze and explain their rationale based on issues encountered, attacking or supporting arguments registered, and updating or uninstalling software applications. Also, to achieve the best results, an experimental study is conducted by comparing various ML algorithms, i.e., MNB, LR, RF, MLP, KNN, AdaBoost, and Voting classifier, on the end-users rationale data set by preprocessing the input data, applying feature engineering, balancing the data set, and then training and testing the ML algorithms with a standard cross-validation approach. We obtained satisfactory results with MLP, voting, and RF Classifiers, having 93%, 93%, and 90% average accuracy, respectively. Also, we plot the ROC curves for the high-performing ML Classifier to identify and capture classifiers yielding the best performance with an under-sampling or oversampling balancing approach. Additionally, we obtained the average Precision, Recall, and F-measure values of 98%, 94%, 96%, 97%, 95%, and 96% for identifying supporting & decision rationale elements in the user comments, respectively. The proposed research approach outer-perform the existing rationale approaches with better Precision, Recall, and F-measure values.

Software rationale

Natural Language Processing

Grounded theory

Content analysis approach

crowd-users

Textual data

preprocessing

Machine learning

It is pivotal for software and requirements engineers to get direct feedback from end-users through different social media platforms to understand their needs and preferences. Since it will allow system designers and development engineers to understand better crowd-users’ needs, which in turn help to improve the overall quality of software applications. For this purpose, the increasing usage of social media platforms, such as Amazon, Google Play & iPhone app stores, user forums, and Twitter, has emerged as a natural source of information for crowd users’ comments and suggestions [1, 2, 3]. Many user reviews are posted daily on e-commerce websites and app stores for popular products, such as Facebook, WhatsApp, etc. It creates an information overload problem as the number of reviews grows exponentially. Many websites have implemented a feedback mechanism to allow users to offer feedback on reviews (helpful or not) [4]. It remains to be seen whether the app reviews found in app stores will provide software engineers with critical information for understanding user requirements, app design issues, debugging information, and help evolve the software products. Researchers have spent the last few years studying what valuable information can be found in the app reviews and how to mine and organize it most effectively [5]. As a result, the available and useful nature of a huge volume of these data makes app reviews the most commonly studied type of online user feedback [6]. In recent years, the social media revolution has given the online community an opportunity and facility to express their views, opinions, and intentions about events, policies, services, and products. Intent identification involves determining customer intent based on online user reviews [7]. Also, previous studies demonstrate that crowd comments from end-users mobile stores [8], Tweets [9], and posts [10, 11] include important and well-informed information regarding the user's interaction [12], requests for new features [13, 14], and end-user rationale information [15, 16], which is used to make informed requirements decision-making [17, 18].

Recently, software researchers started to mine rationale information from social media platforms. For this purpose, the researchers sought to investigate developers' conversations and user assessments of open-source software, which they found fascinating and insightful for requirements engineers to identify and capture rationale information and recover crowd-users tactic knowledge [19, 20, 17]. Therefore rationale provides important information during the software development process necessary for understanding and sustaining the software system. Thus, getting design and software rationales are considered fundamental tools because they can incorporate the reasons for a design decision and the justification for that decision, which makes them quite useful [21, 22] for software decision-making. Furthermore, the study [3] emphasizes that social networks, user forums, and app stores are increasingly significant in identifying requirements-related information. Therefore, software companies need to pay greater attention to end-user input in software design, development, and evolution to satisfy stakeholders' needs and improve overall software quality. Also, the results of many studies show that most of the evaluation includes useful information for the software development teams. For example, flaws or issues that need to be solved [23], new features with collective end-users opinions [24], change requests for the application under discussion [25], or innovative ideas to be implemented in future versions [26] of the software application under discussion. Therefore, to better understand end-users rationale on social media platforms in deciding or leaving low-ranked software applications. We critically analyzed crowd-user reviews on the Amazon store to recover common patterns that can be used to automate the decision-making process and recover the end-user tactic knowledge.

For this purpose, we set out to answer the question: Can end-user rationale in the amazon software store help improve the quality of low-ranked software applications? By manually skimming, we identified end-users expressing their thoughts or opinions about software applications and registering issues or problems on the social media platform. For example, “Won’t work on Blackberry Classic, even though Amazon App store offered it. Crashed after loading repeatedly. I was not able to get to the printing part”, and “I could never get it to work with my kindle, after downloading it I get the message that a pin nbr would be sent to my email address, but I never got it and could not go any farther without it. I went through these 6 or 8 times. I still never got the PIN, so I gave up”. Also, records attacking arguments against the software application under discussion based on their experiences, such as “Not sure what I liked or didn’t as it never worked” and “Stay away from this app and developer. Danger Will Robinson!” Similarly, crowd-users report neutral arguments that do not give relevant information to requirements engineers or software applications. Hence, it might be discarded to filter out irrelevant information, such as “Will find out” and “It’s my go-to for ideas.” Furthermore, end-users register their conclusive decisions about the software application under discussion, such as “Description of it said it would work? Did not! Returned! I switch”. The systematic identification and capturing of rationale information for the low-ranked software applications in the Amazon store can assert some advantages for software development teams. For example, it helps increase transparency in requirements and software decision-making by revealing crowd-user tactic knowledge. Furthermore, it helps improve the overall performance of the software applications by identifying and capturing the reasons or issues that cause end-users to agree to leave the software application. Additionally, a deeper understanding of crowd users, their motivations, and identifying alternative options would enable software development teams to make better decision-making and improve the overall software quality.

The primary contributions of the proposed research work are: We first conducted a detailed and comprehensive empirical analysis of the crowd-users reviews on the Amazon software store by targeting low-ranked software applications to understand the pattern and logic of end-user rationale. Next, after identifying the common patterns, we developed a novel grounded theory that would help annotate a large volume of crowd-user reviews in our data set. Furthermore, using the proposed coding guideline and content analysis approach, we qualitatively identify how crowd-users formulate rationale concepts against the low-ranked software applications and capture the frequency of each rationale element, i-e, attacking arguments, decisions on software applications, neutral opinions, supporting claims, and issues encountered. Finally, we examined how different supervised ML algorithms can be employed on the annotated data set to automatically identify and capture rationale elements in the low-rating software applications. Additionally, we assess the accuracy of different ML classifiers by plotting the ROC curves to determine the best classifier with distant data balancing approaches.

The paper is structured in the following: Section 2 discusses the proposed research methodology and research questions, and section 3 focuses on the annotation mechanism of the crowd-user reviews to identify user rationale elements. Section 4 elaborates on the automated classification of crowd-users comments into different rationale elements and describes the results obtained. Section 5 discusses in detail the research findings and the threats to validity. Section 6 presents related work, while section 7 concludes the paper and highlights the future directions.

In this section, we first elaborated on the proposed research questions aimed to answer through the proposed research methodology. Next, we described the research data and the identified rationale elements to demonstrate the proposed research approach.

2.1 Proposed Research questions

In this paper, we aim to explore crowd-users comments in the Amazon software store by focusing on low-ranked software applications and capturing frequently used rationale elements in the end-user comments. Explore alternatives to improve the performance of low-ranked software applications and employ different ML classifiers to automatically capture and identify useful rationale information for the requirements engineers to improve the overall quality of low-ranked software applications.

RQ1. How do crowd users express and register rationale information in low-ranked software applications?

RQ2. What rationale can elements be identified from the user reviews targeting low-ranked software applications?

RQ3. What is the accuracy of different configurations of machine learning techniques in automatically identifying rationale concepts from user reviews to enhance requirements decision-making?

In a nutshell, with RQ1, we comprehensively analyze user comments to identify the common patterns in how crowd-users register rationale for the low-ranked software applications, resulting in the development of a novel grounded theory. For RQ2, utilizing the developed grounded theory and content analysis approach, we identify different rationale elements commonly used to represent the rationale information in the low-ranked software applications. In contrast, the purpose of RQ3 is to evaluate and compare the performance of different baseline machine learning classifiers in automatically identifying rationale elements in the end-users comments to scale and automate the proposed approach.

2.2 Research method

The proposed research approach comprises four main phases and goals, as shown in Fig. 1; each methodological step is elaborated below.

2.2.1 Research data Gathering and Development

To run the proposed approach, we manually analyzed and reviewed distant software applications in the Amazon software application to be selected for the proposed method to recover uniform rationale patterns used by the crowd-users to express their opinions towards that software application. For this purpose, we aim to target those software applications that have earned low ratings from the crowd users, i.e., less than or equal to 3 stars. In total, we selected 59 software applications from 13 different categories in the Amazon software store, representing diverse software categories, such as education, games, photo & videos, etc. The details of each software application selected against each category are shown in Table 1.

Table 1

Summary of the Curated Dataset of crowd-user rationale
No.	Applications category	Software Applications	User reviews
1	Business Applications	1: Hammer Print 2: Sketch Guru 3: Sketch Book 4: Logo Maker & Logo Creator	3025
2	Communication Applications	1: JusTalk - Free Video Calls and Fun Video Chat 2: TextMe - Free Text and Calls 3: Skype 4: AddMeSnaps 5: Free Text, Text anyone	10922
3	Education Applications	1: TED TV 2: Caspers Company 3: World Now 4: Amazon Silk: Duolingo.com	8968
4	Food and Drinks Applications	1: Food Network Kitchen 2: FoodPlanner 3: Italian Recipes by Fawesome.tv 4: ChefTap: Recipe Clipper, Planner	5521
5	Game Applications	1: Mobile Strike 2: Drive Car Spider Simulator 3: Chapters: Interactive Stories 4: Crazy Animal Selfie Lenses 5: Cradle of Empires - Match 3 in a Row & Egyptian game	8748
6	Movies and TV Applications	1: AnthymTV – It's Free Cable 2: Hauppauge myTV 3: StreamTV 4: Trashy Movies Channel 5: TVPlayer – watch live TV & on-demand	5982
7	Music and Audio Applications	1: Karaoke Party by Redkaraoke 2: Mp3 Music Download 3: Red Karaoke Sing & Record 4: Voice Changer 5: KaraFun - Karaoke & Singing	889
8	Novelty Applications	1: Ghost Radar®: CLASSIC 2: Xray Scanner 3: Cast Manager 4: Age Scanner Classic 5: Clock in Motion	3104
9	Photos and videos Applications	1: AirBeamTV Screen Mirroring Receiver 2: AirScreen 3: Snappy Photo Filter And Stickers 4: ScreenCast 5: RecMe Free Screen Recorder	5089
10	Productivity Applications	1: PDF Max Pro – Read 2: Floor Plan Creator 3: OfficeSuite Free	5421
11	Sports and Exercise Applications	1: FOX Sports: Stream live NASCAR, Boxing 2: fuboTV: Watch Live Sports, TV Shows, Movies & News 3: NBC Sports 4: CBS Sports Stream & Watch Live	16630
12	Utilities Applications	1: PrintBot 2: Floor Plan Creator 3: tinyCam Monitor FREE 4: Tv Screen Mirroring 5: Optimizer & Trash Cleaner Tool for Kindle Fire Tablets	3689
13	Lifestyle Applications	1: Screen Mirroring 2: DOGTV 3: Fanmio Boxing 4: 3d Home designs layouts 5: Home Design 3D - Free	4303
Total applications		59	77202

We crawled the crowd-users reviews from the Amazon software for the selected software applications from April to June 2021 to construct the research data for the proposed approach. Each extracted crowd-user review from the Amazon store includes the author's

name, main review content, review heading, and metadata containing the reviewer's rating of the applications. For this purpose, we used the open-source Instant Data Scraper¹ tool to extract the data from the amazon software store. Instant Data Scraper is a browser extension tool that automatically grabs data from websites and web pages. It uses artificial intelligence to determine which data on an HTML page is the most important. It enables users to save extracted data in various commonly used formats. Export the research data in multiple forms, i-e, excel, or CSV, preview the extracted data, support pagination in web pages, edit names of the extracted data columns, and dynamic data load detection. In total, we extracted 77202 crowd-user reviews from the Amazon store against the selected 59 software applications, and details are shown in Table. 1.

2.2.2 Grounded Theory approach

In the second phase of the proposed approach, a coding guideline document is developed by manually examining and evaluating a test sample of crowd-user comments utilizing Strauss and Corbin's Grounded theory approach [27]. Where grounded theory is a systematic and qualitative way to construct a rationale theory based on the shreds of evidence frequently found in the data set. During this phase, a coding guidelines document is created that describes the rationale theory for the proposed approach as a final output of the phase following the reconciliation step, as shown in Fig. 1. Lately, it has been used to create the truth (labeled) set during the content analysis phase of the proposed approach. Furthermore, it would help reduce the disagreement between the different coders when annotating the data set for the ML experiments [28, 29]. The developed coding guideline contains definitions of each rationale concept with examples and instructions on how to label the user comments in the data set. Also, the coding process is conducted iteratively. The first three authors of the paper discuss the coding guideline, refine it, and remove the disagreements through discussion and reconciliation to improve and stabilize the coding guideline.

2.2.3 Analyzing content manually

In the third phase of the proposed approach, we manually annotated the crowd-user comments in the data set using the content analysis methodology by a team of human coders, following the guidelines presented by Neuendorf [29] and Maalej and Robillard [28]. We aim to answer RQ2 by identifying the possible rationale elements in the user comments and set up the truth (labeled) sets to answer RQ3 to automate the proposed approach. Certain pivotal tasks are performed during the content analysis phase: Firstly, we created a stratified random sample of 11416 crowd-user comments from the different low-ranked software applications grouped against each category, as shown in Table. 1. The annotation sample is 15 percent of the total dataset. Secondly, we constructed an annotation form as a Microsoft Excel document. We arranged the 11416 crowd-users comments, which need to be labeled by the human coders using the developed coding guideline and the identified rationale elements. Thirdly, the selected human coders receive the corresponding annotation form and coding guidelines to annotate each crowd-user comment in the annotation form. Finally, the annotation forms from the individuals are merged to identify the truthfulness and reliability of their annotation task by identifying the inter-coder agreement and Cohen’s kappa [30]. Furthermore, the disagreement between the coders is resolved, which arrived during the individual annotation process by reconciliation and discussion, resulting in a conflict-free labeled data set.

2.2.4 Automated classification

In the final phase of the proposed research approach, we aim to identify how well we can detect end-user rationale concepts automatically from the crowd-user reviews. For this purpose, we performed the following steps: First, we performed the pre-processing step using the Natural Language Processing (NLP) Toolkit² and Stanford Parser³ to the crowd-user's comments to remove special characters, punctuations, stop words, and stemming. Usually, the textual data in social media platforms include noise in numerous forms such as special characters, punctuation, numeric values, etc. For this, text pre-processing is a method to clean the web textual data and prepare it for data feeding into the machine learning model. Next, after text pre-processing, we adopted feature engineering,

which converts textual data into the numerical form understood by the machine learning algorithms. For this purpose, we employed TFIDF and CountVectorizer approaches, applied them to the textual data, and computed their descriptive information by returning a features matrix. Additionally, we used the Label Encoder to transform non-numerical labels into numerical labels for automated classification. Next, we applied two standard resampling approaches to balance the data set, i-e, oversampling and under-sampling, because the data set was highly imbalanced, as shown in Figure. 2. Furthermore, we utilized the machine learning library Sklearn's⁴ and constructed supervised ML classifiers to automatically identify and capture different rationale elements (claim-attacking, claim-supporting, claim-neutral, decisions, and issues) captured in the manual annotation process for the low-rating software applications. Next, we used the standard cross-validation approach to train and validate the ML classifiers. Finally, to capture the performance of machine learning algorithms in automatically capturing rationale elements in the end-user comments, we used standard evaluation matrices, such as Receiver Operating Characteristic (ROC), precision, recall, and a harmonic mean known as the F1-Score.

In this section, we focus on the labeling (tagging) mechanism of the crowd-users reviews used to identify and capture rationale elements in the low-rating software applications on the Amazon platform.

3.1 Statements of Users

In the amazon software store, a crowd-user review can be categorized based on the nature of the software application. In particular, for the proposed approach, we are interested in extracting end-user comments containing rationale information of type claim-attacking, claim-supporting, claim-neutral, decisions or justifications, and issues for low-star software applications and aiming to gauge the impact of end-user rationale on the low star software applications. Furthermore, the proposed approach will ensure the potential stakeholders that the software development companies will quickly and timely resolve the reported issues related to the corresponding software applications. Consequently, preference will be given to the registered alternative features in the next release of the software application and its services. For this purpose, we identified and labeled different types of rationale elements from the amazon app store by conducting a detailed analysis and evaluation for end-user reviews. In addition, the relevant rationale codes discovered during the annotation process emerged into concepts, resulting in the rationale theory for software engineering. The annotation codes included in the coding guideline were determined by their frequent presence in the end-user reviews and relevancy to the goal of the proposed research approach. In line with that, if a rationale code appears only a few times during the annotation process, it emerges to the corresponding relevant concept. Similarly, if one or two coders initially identified a rationale code, later, if it is found irrelevant to the proposed approach's goal, it was either discarded or removed. Based on the defined theory, we identified the following codes for low-ranked software applications in the Amazon software store: claim-attacking, claim-supporting, claim-neutral, decisions or justifications, and issues. Each of the identified annotation codes is defined in detail below with examples.

Issue: In the rationale document, the code of "issue" is assigned to a crowd-user comment extracted from the Amazon store if it registers a clear and understandable issue, problem, or concern regarding the software application under discussion. During the manual analysis of the crowd-user comments in the Amazon store, it is found that end-users report issues or problems from many different perspectives, which are based on their experience with the software application under discussion. For example, a problem with a software feature, a bug in the software application, the too frequent appearance of ads, a problem with the software quality attribute, or might be a problem with a software artifact (software documentation, computability issues, software driver issues, or software support documentation). This concept is illustrated by: "Locked up everything!!! Was frozen upon

installation. Had to restart everything. Wish I never downloaded it". An example of configuration issues is "I will gladly change my rating if I can get it to work. I am trying to print from my kindle HD fire. I was able to get my printer configured, but nothing is printing. The page comes through blank" and "Does not work for me. Keeps crashing and does not recognize my printer". An example of a bug reported in the software can be "this Print app Doesn't work with Kindle so forget it if you have an app for your Tablet the only print app Amazon has that works perfectly is "Share Print," which works great colors fine size fine page fine excellent app," while an example of an issue with the software driver can be like "No print driver for my printer, so I went back to go e-print." A large number of advertisements when using the software applications has been described as a significant cause for a possible issue type by end-users, for example, "Too much trouble to install the app. My printer was listed but would not read LG G4. To install, I would have to enter my printer serial and the password for the network. Too much of a hassle. Very disappointed in the app because of too many ads." Similarly, the compatibility issues examples reported by the crowd-users in the comments are: "the app was not compatible with Kindle Fire" and "Does not connect to Dell printer, the Hammermill app is not compatible. Very disappointed".

Decision

During the manual analysis of end-user comments using the content analysis approach, we capture crowd-users reporting various reasons and certain decisive actions in the crowd-user reviews. For example, switching the app, uninstalling software apps, fake advertisements, deleting the app or removing the corresponding software application. Therefore, in the rationale document, we assign the code “decision” to the user comments that reported any of the above decisive actions. Also, many end-users recorded in the end-user comments to either delete or remove the software application without thoroughly explaining it. For example, “Description of it said it would work???? Did not! Returned! I switch” and “In the advertisement, they claimed to be free, but when it came down to the end, we were informed that we needed the ‘premium upgrade’ for it to work and there was a charge for that, so I’ve uninstalled. Will use something else”. Similarly, end-users frequently reported in the reviews where they categorically mentioned why they had decided to switch to another similar software application. For example, “when I first downloaded the app, it worked fine. But after that, it stopped working, and I had to download another app and switch this app”. Similarly, end-users mentioned the situation when they decided to update back to the older version of the software or upgrade to the new stable version. For example, “Worked great until they updated to 10.2. Doesn’t work like the older version? The only people I can send an attachment are people I don’t know? Unless the app doesn’t use. Update, I found the older versions on GOOGLE. I like the older version. I deleted the 10.2 and downloaded the 9.5 version. I have a good working app. Now I’m back to a good functionality program”.

Attacking: The code of “attacking” is assigned to the crowd-users comments in the data set when end-users registered attacking, offensive, or opposing comments or opinions in response to the software application under discussion in the Amazon software application platform. However, the end-user comments identified with an issue code that records a problem with a software feature or overall software application differ from the attacking claim that records an aggressive, offensive, or opposing argument in response to the software application. Some examples of attacking arguments in the data set are listed below: “Not sure what I liked or didn’t as it never worked,” “I would have to say that this is a great product; unfortunately, it did not work for me because my printer was not listed and the generic driver did not work for me. But it was very easy to install and set up,” “I couldn’t get this to work, unfortunately. I have written to them to see if someone can help me. We will see”, “didn’t work for me,” and “Is not working; hopefully, Amazon personnel can help me to get Lt to work.” Such rationale and requirements-related information help software developers identify conflicting features or issues underneath the argumentation theory [17, 18].

Supporting: The code of “supporting” is assigned to the crowd-users comments in the data set when end-users registered supporting, auxiliary, or praising comments or opinions in response to the software application under discussion in the Amazon software application platform. Furthermore, crowd-users submit supporting arguments in response to the software application under discussion when they are satisfied and happy with the performance and functionalities of the software application. Such rationale information is pivotal for requirements engineering to understand crowd-user opinions about the software application better and help them make future requirements decision-making [10][17]. Some examples of attacking arguments in the data set are: “Worked right the first time. This is a wonderful app! It will be such a convenience and great time saver”, “I was impressed with this software and loved the ease of using it! It was exactly what I needed, and I did not have to be a rocket scientist to use it or get it to work correctly”, and “Excellent app! User friendly, Easy to use, & I like the fact that I can access my files on my computer, my tablet, or my cellphone”. Also, some crowd-users are happy and satisfied with the software application and recommend it to other users, such as “I will be a little suspicious if this would work for my Kindle Fire HD, and I cannot say enough how wonderful this is; it really really, works, I would recommend this to all tablets....love it”.

Neutral

The code of “neutral” is assigned to the crowd-users comments in the data set when end-users registered neutral, not supporting, or attacking end-users comments or opinions in response to the software application under discussion in the Amazon software application platform. Such rationale information is irrelevant to the software engineers and might be noise for the automated ML approach [10]. Therefore, we can employ a binary machine learning classifier that filters out irrelevant arguments to improve the rationale identification process. Some examples of neutral rationale elements in the data set are “Not a bad app” or “okay not watch much” and “Fair - Bad There are a few nifty decent things about this app, but they are overshadowed by “in-app purchases.” Also, suggest read list of permissions required before D/L.”

3.2 Users Statement Labeling

To annotate the crowd-user comments in the Amazon software store, we utilized the Content analysis technique [32] by critically analyzing each user comment to identify the relevant rationale code (decision, issue, claim-supporting, claim-attacking, and neutral-claim). The aim of annotating crowd-users feedback is to identify the frequency of rationale elements in the Amazon software store and prepare a truth set that will be used to train and validate ML classifiers to answer RQ3. For this purpose, the first three authors of the research paper evaluated and analyzed each user comment in the data set individually to identify and capture rationale elements. Furthermore, to minimize the ambiguities, disagreements, and conflicts between the coders, we developed a coding guideline that contains explicit instructions, definitions, and examples about the capture rationale elements in the Amazon store.

3.2.1. Annotation of Crowd-User Rationale

Annotating the end-user feedback collected from the Amazon software store, a coding guideline and the coding form containing a complete set of end-user rationale comments are provided to each coder. For this purpose, the title and main content of each user comment were assessed by each coder individually to determine whether it expressed an issue, a decision, supported, attacked, or neutral claim. To give an overview of the process, Fig. 2 displays the screenshot of the coding form. The columns “Enduser_Name,” “Review_Rating,” “Review_Title,” and “Review_Contents” represents the user feedback information, and the Column “UserRationale_Type” denotes the possible rationale type, which needs to be identified by the coder of the type decision, issue, claim-supporting, claim-attacking, or claim-neutral. In the coding form, the end-user feedback from the Amazon software store was grouped by the software application and selected for the annotation process. For example, the corresponding end-user reviews for a particular application would appear first in the coding form, followed by crowd-users reviews for the remaining software applications in a specific order. Also, to facilitate the coders, the link to each software application is provided in the coding form to quickly access the original end-user reviews if there is any misunderstanding or confusion. At the same time, the annotators can stop and resume the coding process at any time. The average time consumed by the coders to annotate the end-users comments in the Amazon software store was 15 working hours. After completing the individual coding task, we emerged the coding results to identify and analyze the inter-coder disagreement by applying the reconciliation process, as shown in Figure. 1. The inter-coding agreement between the annotators was reported at 88%. Cohen’s kappa was 66%, considered a substantial agreement between the coders on the Cohen’s kappa scale. As a result of this step, conflict-free truth sets (labeled dataset) are developed, which can be used to input the ML classifiers to automatically capture and identify rationale elements in the end-users reviews.

Table 2

Frequencies of the end-users rationale reviews
End-user reviews in amazon store	Supporting	Attacking	Issues	Decision	Neutral	Total Annotated reviews
End-user reviews in amazon store	5117	2946	2191	503	659	11416

3.3 Frequency of the User’s rationale Comments

We employed the content analysis approach to drive deeper into the research data set to understand its nature, capture the frequencies of rationale elements, and the importance of rationale elements for requirements and software engineers to improve the future versions of the software applications under discussion. The proposed research study found that end-user comments in the Amazon store are considered a key source of rationale information, as shown in Figure. 3. Whereas 94% of user comments contain rationale information of the type (decision, issue, claim-supporting, and claim-attacking) while only 6% of user comments are classified as neutral claims. Furthermore, we found that the most frequent and prominent rationale element is supporting-claim, which has an overall percentage of 45% (5117 end-user comments) in the data set, as shown in Figure. 3, and Table 2, respectively. The crowd-users usually register supporting claims in response to the software application they are currently using and are satisfied with the current functionalities [10]. For this research study, we focused on the software applications that have secured low ranking from the crowd-users but still, the percentage of supporting claims is high in the data set. It opens a potentially new research direction to explore the existence of fake reviews [33, 59] in the Amazon app store to validate the large volume of supporting arguments in the data set.

The second most important and prominent rationale element identified and captured in the amazon software store is attacking claims, which is 36% (2946 comments) of overall 11416 end-user comments. It is expected because crowd-users register attacking arguments in response to the software applications if they are unsatisfied with their features or the overall software application [10]. Next, the third most pivotal and prominent rationale element identified and recovered in the amazon software store is the issue that is 19% (2129 comments) of the overall 11416 end-user comments annotation sample. Such information is quite valuable and pivotal for the requirements engineers to understand better the problems or issues faced by the crowd-users when utilizing the software application [17, 18]. Furthermore, 6% (659) of the crowd user’s comments in the data set were identified and recovered as neutral claims. We can further improve the performance of machine learning algorithms by first filtering the noise or neutral comments from the data set and then classifying user comments into the rationale elements [10]. Finally, the least prominent rationale element in the data set is the decision, which is 4% (659) end-user comments of an overall 11416 end-user comments annotation sample. With the decision rationale element, software developers and requirements engineers can take pivotal requirements-related decisions to improve the performance of the existing software applications, i-e, why certain crowd-users ended up leaving the software application [15, 18]. Additionally, while analyzing the end-user comments in the Amazon store, we found that crowd-user register negative comments against the software application, while its corresponding rating score is either 5 or 4, representing supporting-claim. This need to be investigated in future work using a machine learning algorithm to identify fake rating information about the software application.

Moreover, we examined the difference in reviews’ verbosity (measured in words) of the crowd-users rationale reviews collected from the amazon software store. In Figure. 3, we can see in graph “A” that more than 55% of the crowd-user comments contain less than 40 words. While the percentage of crowd-user rationale comments having 89 words or more is less than 7%. A user rationale comment reporting the number of end-user comments seems more verbose than a review reporting the number of words in the other textual document. Such information can help improve the performance of machine learning algorithms.

In the literature, it has been an emphasis that end-users record a volume of feedback on the social media platforms [3] [17], which becomes challenges, time-consuming, and difficult to manually identify and capture rationale information to improve the requirements for decision-making and enhance user satisfaction [18]. For this purpose, we collected 77202 end-user comments for 59 different software applications in the Amazon store. We selected 11416 end-user feedback using a stratified random sample approach. We prepared a test sample to evaluate the performance of different ML algorithms in automatically identifying rationale elements in the crowd-user comments. For this, we selected different ML algorithms from the literature based on their better performance in mining text information from textual documents. The ML algorithms shortlisted for this purpose are Multinomial Naïve Bayes (MNB), Linear Support Vector Machine (SVM), Logistic Regression (LR), KNN, MLP Classifier, Gradient Boosting, Voting Classifier, Random Forest, and Ensemble Methods. Also, to balance the data set, we used standard resampling approaches, i-e, Oversampling, and under-sampling, as shown in Figure. 2, indicating the number of comments identified across each rationale label. Further, we used the standard 10-K Fold cross-validation [34] approach to train and validate ML algorithms. Additionally, instead of systematically evaluating all the possible configurations according to the relevant research, one of the main objectives was to find configurations that result in accurate classifiers for classification. This article aims to analyze and appraise the accuracy of different ML techniques. However, when deploying distant machine learning models with different configurations, we achieved a relatively high precision, recall, and F1 score. The details of the machine learning experiment are discussed below:

4.1 Experimental Setup

For this ML experiment, we employed text preprocessing and feature engineering techniques to assess and analyze the various classifiers and reveal their performance in automatically identifying rationale elements in the end-user reviews. Hence, before text preprocessing, feature engineering, data balancing, and training, we first selected different machine learning algorithms based on their good performance on textual data mentioned in the literature [15, 17, 35]. The selected algorithms are MNB, SVM, LR, KNN, MLP, GB, Voting classifier, RF, and Ensemble Methods. A voting classifier is a classification approach that makes predictions by combining multiple classifiers' results and performing some classification tasks well. Voting Classifier has two different kinds of voting methods. In hard voting, the projected output class is the one that receives the most significant number of votes. The output class in soft voting is the forecast based on the probability assigned to that class. In contrast, Ensemble techniques are ML algorithms that build a collection of classifiers and then categorize incoming data points based on a (weighted) vote on their predictions. Therefore, some agile techniques of ensemble methods combine data splits and multiple algorithms to produce ML results with higher accuracy. Furthermore, each ML algorithm was trained and validated with the end-user rationale feedback to automatically classify the crowd-users reviews into five categories of software rationale: attacking-claim, issue, supporting-claim, neutral-claim, and decision. The experiments were conducted in a Python environment, as discussed below:

1) Preprocessing

For an ML experiment, cleaning input data (textual data) is considered pivotal before developing an ML model or classifier. For this purpose, we performed a series of pre-processing steps; remove the HTML tags if there are any in the crowd-user comments; filter out URLs if they exist in the end-user comments; transform the end-comments text into lowercase, and finally, remove the brackets, punctuation, alphanumeric, and other special symbols or characters from the textual documents. Also, to improve the performance of ML algorithms, we reduce words in the data set to their root by using a text normalization technique called lemmatization. With the manual analysis of end-user comment using the content analysis approach, we identified that certain stop words in the user comments commonly represents some rationale elements. For example, “does” and “could” stop words are used as a possible indicator for the issue rationale element, while “not” and “doesn’t” stop words are used as a potential indicator for attacking claims. Similarly, the stop words “will,” “was,” and “have” appears to be the possible indicator for the decision rationale element. A similar practice is also reported in the literature [17, 18, 36].

Additionally, we used Label Encoder from class the python “sklearn.pre-processing”, which translates each value from a column text to a number to make the data ready and perusable for the machine learning algorithms. Furthermore, each ML algorithm has experimented with distant textual features and parameters to capture and reveal its performance in classifying rationale elements automatically. For instance, we compute the TFIDF features for each textual document and return a (documents, features) matrix that can be used as an input to the ML algorithm. The objective is to transform the text into meaningful numbers used to fit machine algorithms for prediction. Also, we employ the CountVectorizer technique to categorize users’ rationale documents, which gives equal weight to each word in a corpus.

2) Data Imbalance

In supervised ML, imbalanced data sets are considered a critical technical challenge nowadays [37]. A data imbalance reflects a lack of equal distribution of annotation classes within a data set. Moreover, our data set is somewhat imbalanced, as shown in Fig. 5; when annotating the end-user comments in the data set, most user comments (45%) were categorized as supporting claims, while only 4% of user comments are identified as decision rationale elements. Therefore, training a machine learning classifier on an imbalanced textual data set would force it to skew towards the majority class simples by disregarding the minority classes, i.e., the classes with a limited number of occurrences in the data set. To handle the imbalanced dataset, we employed the below two approaches to balancing the data set frequently and widely used in the software literature [37], i-e, Oversampling, and Under-sampling. We use these two data balancing techniques to improve the performance of the ML models and predict more accurate results on minority data comparing the imbalanced dataset. Oversampling is a non-heuristic technique that points to balanced class distribution through the random repeat of minority class examples [38]. At the same time, Under-sampling is a non-heuristic technique that targets the balance class distribution through the arbitrary exclusion or elimination of majority class samples [39]. Furthermore, to decide, which data balancing (oversampling or under-sampling) approach is more suitable for the training of ML classifiers in the experiment, we utilized Receiver Operating Characteristic (ROC) [40] and Precision-Recall [41] curves. For this purpose, we identify the percentage of True Positives (TP) against the percentage of False Positive (FP) for each ML classifier used in the experiment. In line with that, Fig. 6 shows ROC curves for MLP and RF ML classifiers, which explore both over and under-sampling approaches to recover the best resampling approach. We selected these two ML classifiers as an example because of their better performance in classifying crowd-user reviews into different rationale elements in the experiment. From the experiment, we found that ML classifiers utilizing oversampling consistently outperform ML algorithms operating under-sampling. It might be due to the loss of critical and valuable textual information samples when using the under-sampling approach [56].

3) Assessment & Training

To train and validate the supervised ML algorithms, we used a stratified 10-fold cross-validation approach to the textual data set from the amazon store. Nine folds of the cross-validation approach were employed to train the ML algorithms, and one fold of cross-validation was used to validate the algorithm. The testing and training process is repeated 10-times by rotating the training and testing folds. The benefit of the cross-validation approach for training and validating an ML classifier is to check how well a model works if limited data is available. It may also be used as a re-sampling method for assessing a model when we only have a limited amount of data to work with it. Currently, the stratified K-fold cross-validation approach is most commonly and frequently used to train and validate ML classifiers. Each fold has roughly the same proportion of labels representing each class. To assess the effectiveness of the classifiers, we compute and summarize the average results obtained from the 10-fold cross-validation runs. For this purpose, we utilized Precision (P), recall (R), and F1-score measures to evaluate the supervised ML algorithms and compare their performance. The P and R are computed with the below formulas

$${P}_{K}= \frac{{TP}_{K}}{{TP}_{K}+{FP}_{K}}{ R}_{K}= \frac{{TP}_{K}}{{TP}_{K}+{FN}_{K}}$$

The P is the ratio of true positives (correctly classified end-user arguments) to all crowd-users comments (both correctly and incorrectly classified user comments). Similarly, R measures the reliability of the machine learning classifiers in recognizing relevant information. TPk represents the number of end-users rationale feedback correctly classified of type k, FPk represents the number of users rationale feedback wrongly classified as type k, and FNk represents the number of users rationale feedback incorrectly classified as not type of k. In contrast, F1 represents the harmonic mean between Pk and Rk.

4) Results

The optimized results of distant machine learning classifiers used to identify rationale elements in the end-users feedback are shown in Table. 3. It can be seen from Table. 3 that the MLP, Voting, and RF classifiers give the highest accuracy in capturing and classifying end-users rationale elements in the end-user feedback on amazon software application store and outperform other machine learning classifiers, which are 0.93%, 93%, and 0.90%, respectively. The results obtained from MLP, voting, and RF classifiers are quite similar, as shown in Table. 3. The MLP-TFIDF, MLP-CountVectorizer, and Voting-CountVecortizer classifiers yield the highest precision, recall, and F-measure values for classifying supporting rationale elements, 98%, 94%, and 96%, respectively. Next, MLP-CountVectorizer, RF-TFIDF, and Voting-CountVecortizer classifiers give higher precision, recall, and F-measure values for classifying neutral rationale elements, 75%, 90%, and 81%, respectively. Moreover, MLP-TFIDF gives a higher recall value (91%) for classifying neutral rationale elements but performs poorly in predicting precision and F-measure values (66 and 76, respectively). Also, MLP-TFIDF, Voting-CountVectorizer, and MLP-CountVectorizer algorithms outer-perform other machine learning algorithms when classifying issue rationale elements and give the highest F-measure value of 93%. Similarly, MLP-TFIDF and RF-TFIDF classifiers outer-perform other machine learning classifiers when identifying decision rationale elements and predict higher Precision, Recall, and f-measure values of 97%, 96%, and 96%, respectively. Finally, MLP-TFIDF, Voting-CountVecortizer, and MLP-CountVectorizer capture higher Precision, Recall, F values for identifying attacking rationale elements 93%, 91%, and 92%, respectively.

Table 3

The performance of different machine learning algorithms with precision, recall and F1 to classify user comments into rationale elements
Labeled Tags	ML Algorithms and Features	Precision	Recall	F-Measure
Supporting	MLP-TFIDF	98	94	96
	Random Forest-TFIDF	97	89	93
	MLP-CountVectorizer	98	94	96
	Voting- CountVectorizer	98	94	96
	Random Forest - CountVectorizer	96	90	93
Neutral	MLP- CountVectorizer	74	90	81
	Random Forest -TFIDF	75	85	80
	MLP-TFIDF	66	91	76
	Voting- CountVectorizer	74	90	81
	Support Vector- CountVectorizer	52	83	64
Issue	MLP-TFIDF	92	94	93
	Random Forest -TFIDF	94	86	90
	Voting- CountVectorizer	93	93	93
	Random Forest - CountVectorizer	90	90	90
	MLP- CountVectorizer	93	93	93
Decision	Random Forest -TFIDF	97	95	96
	MLP-TFIDF	96	96	96
	Voting- CountVectorizer	81	94	88
	Support Vector -TFIDF	80	92	86
	MLP- CountVectorizer	81	96	88
Attacking	MLP-TFIDF	93	90	91
	Voting- CountVectorizer	93	91	92
	Random Forest -TFIDF	80	93	86
	MLP- CountVectorizer	93	91	92
	Random Forest – CountVectorizer	91	84	87
Stratified K-fold Cross-Validation (Split Size = 10)
Machine Learning Classifiers			Accuracy
Multinomial NB Classifier			72
Logistic Regression Classifier			85
Linear Support Vector Machine			84
Random Forest Classifier			90
Multi-layer perceptron classifier			93
Voting Classifier			93

In a nutshell, although all the classifiers selected for the machine learning experiment perform well and produce high accuracy in accomplishing a multi-class text classification of crowd-user reviews collected from the Amazon software App Store to identify rationale elements. In particular, MLP, Voting, and RF ML algorithms perform relatively better and predicted higher precision, recall, and F-measure values for the distant rationale elements (supporting, decision, attacking, neutral, and issues) identified by the proposed approach to improve the performance of low-ranked software applications in amazon software app store, as shown in Table. 3. Based on the experimental results shown in Table. 3, we conclude that either MLP, Voting, or RF can be selected as the best ML classifier to identify various rationale elements in the crowd-user’s comments in the social media platform and improve the performance of low-ranked software applications by focusing on the large volume of relevant information identified for the requirements and software engineers. Furthermore, our proposed approach outperforms previous similar research approaches [15, 17, 20] regarding classification accuracy, precision, recall, and F-measure. It can be seen in Table. 3, we achieved higher accuracy, precision, recall, and F-measure values than the previous rationale mining approaches.

Furthermore, to analyze the baseline configuration of the machine learning algorithms to classify crowd-user comments into different rationale elements, we investigated the learning curves of the most optimized machine learning classifiers, i-e, MLP, and RF.

Using a learning curve, we can visualize how the size of the training instance sets impacts the classification accuracy. Additionally, we assessed and identified how much training time is required for each training size. Figures <link rid="fig6">7</link>-a and 7-c show the learning curve of RF and MLP ML classifiers that classify crowd-user comments into different rationale elements. The MLP classifier (Fig. 7-c) offers the best configuration, also selected as the best classifier for classifying user comments into various rationale elements. Similarly, Figs. <link rid="fig6">7</link>-b and 7-d show the time required by the RF and MLP classifiers to train the ML algorithm. The MLP classifier (Fig. 7-d) offers the best configuration to train the ML classifiers.

The crowd-user rationale information collected from the social media platforms proves to be essential and beneficial for software and requirements engineers in making requirements-related discussions. For this purpose, we collected and identified different rationale elements from the user comments registered in the amazon software app store, such as decision, neutral, supporting, attacking, and issue for low-ranked software applications to improve their performance and quality. In this section, we summarize our research findings based on the rationale information identified from the crowd-user comments using different ML algorithms and revising the proposed research questions.

5.1 Can end-user rationale in the amazon store help improve the quality of low-ranked software applications? To better understand the polarization, the broad spectrum of end-user rationale, and different crowd-users viewpoints against the low-ranked software applications, software vendors companies might find the large volume of user rationale documents in the various social media platforms helpful in making certain requirements-related discussions making to improve the overall quality of software applications under discussion. By manually investigating and analyzing the end-user comments in the Amazon store using the content analysis approach, we identified that crowd-users comments are of pivotal importance for requirements engineers to improve the quality of low-ranked software applications because they pose certain important requirements-related information, which can be considered another important source of information together to the existing traditional stakeholders. Although, it challenges the software developers to survey and analyze the increasing crowd-user feedback in the amazon software store to solicit useful information and improve the overall software quality. For this purpose, using the content analysis and grounded theory approach, we identify different rationale and requirements-related information, such as issues, decisions, supporting, attacking, and neutral, which help to improve the software quality. Furthermore, to automate and overcome the challenge of a large amount of user feedback on the social media platform, we deployed different ML algorithms to capture and identify the distant rationale elements and report them to the software development team to timely improve the software issues arrived at the potential stakeholders and improve the software quality of low-ranked software applications.

5.2 How rationale are represented in low-ranked software applications? Based on the detailed analysis of the end-user comments in the Amazon software application store, we identify key rationale elements using content analysis and a grounded theory approach that drive decision-making for software requirements engineering. The critical rationale elements captured in the amazon store are issues, decisions, attacking, supporting, and neutral claims. We thoroughly analyzed a test sample of user comments and determined the frequency of user comments using the content analysis approach, as shown in the Figure. 3. We find exciting and interesting information from the Amazon software app store crowd-user feedback. Although we focused on low-ranked software applications in the amazon store, much end-user feedback (45%) is still identified as supporting claims. In the future, we are interested in experimenting with the directly distributed crowd-users, gathering their feedback about low-ranked software applications, and then comparing it with the end-user feedback gathered from the amazon software store regarding rationale elements and their frequencies. We can then identify the tradeoff between the low-ranking and rationale information determined from the end-user comments that are still unexplored according to our knowledge. The second and third most important and prominent rationale elements identified and captured in the amazon software store are attacking claims and issues, 36% and 19%, respectively. It is expected because crowd-users register attacking arguments in response to the software applications if they are unsatisfied with their features or the general software application [14]. Furthermore, crowd users report problems or issues faced with software applications under discussion. This information is of paramount importance for software and requirements engineers to improve the quality of low-ranked software applications and achieve higher user satisfaction. We identified the issue as the 3rd most prominent rationale element, unlike Khan et al. [17] approach, where the issue rationale element is the least prominent rationale element. Furthermore, we identified a decision rationale element that is considered pivotal for software and requirements decision-making, in which crowd-users elaborate on why they leave the software application under discussion in the amazon software store. This critical information can be considered an essential alternative source to the exiting stakeholders and documented requirements material for requirements decision-making and improving the overall quality of the software applications.

5.3 How precisely and correctly we can identify end-user’s rationale comments in the Amazon store for low-ranked software applications? With the detailed and comprehensive analysis of crowd-user comments in the Amazon software App Store using the content analysis approach, we identify interesting information for software and requirements engineers that is beneficial in improving the quality of low-ranked software applications. However, the challenge is the large amount of end-user feedback submitted daily in the amazon software app store, which is time-consuming and hard to process manually [15, 18]. For this purpose, we employed ML algorithms to automatically identify and capture rationale elements identified earlier using a content analysis approach by manually analyzing crowd-user comments in the amazon store. We collected 77202 end-user comments from 59 different software applications across 13 distant categories of software applications by aiming at low-ranked software applications in the amazon software app store. Moreover, we prepared an annotated sample data set comprised of 11416 end-user comments to test the performance of baseline ML algorithms: MNB, SVM, LR, RF, KNN, MLP, Gradient Boosting, and voting algorithms. All the machine learning classifiers perform well and produce high accuracy in accomplishing a multi-class text classification of crowd-user reviews collected from the Amazon software App Store to identify rationale elements. In particular, MLP, Voting, and RF ML algorithms perform relatively better and predict higher precision, recall, and F-measure values for the distant rationale elements (supporting, decision, attacking, neutral, and issues) identified by the proposed approach to improve the performance of low-ranked software applications in amazon software app store. Based on the experimental results shown in Table. 3, we conclude that either MLP, Voting, or RF can be selected as the best ML classifier to identify different rationale elements in the crowd-users comments on the social media platform and improve the performance of low-ranked software applications by focusing on the large volume of relevant information identified for the requirements and software engineers. Furthermore, the proposed approach outperforms previous similar research approaches [15, 17, 20] regarding classification accuracy, precision, recall, and F-measure. It can be seen in Table. 3, we achieved higher accuracy, precision, recall, and F-measure values than the previous rationale mining approaches.

5.4 Threats to validity: In the proposed ML approach, the researchers who labeled the end-user feedback also participated in the design and analysis of the experimental work. However, the authors conducted the user feedback process systematically and iteratively. Still, there is a probability that the research paper authors have subconsciously tried for a second guess. Another possible threat is whether the classification results in our proposed approach paper do follow the research data. In particular, the manual analysis of the end-user comments is prone to possible biases. However, we resolved this threat by developing detailed coding guidelines describing each rationale concept and their specific detailed examples.

Detailed analysis and evaluation of end-user comments were conducted to identify which characteristics reveal the most valuable information about user rationale and how to improve the quality of low-ranked software applications. We obtained accurate and efficient results using various ML techniques, but a complete evaluation of all textual feature combinations was not performed. Furthermore, the results could be improved if different parameters were optimized for the MLP, Random Forest algorithms, and the Voting-TFIDF applied by statistical feature selection techniques. Additionally, we can employ statistical tests, such as Chi-square analyses, to select influential textual features to determine the features [42]. The approach helps to reduce the redundant and irrelevant features to improve the classification results while preserving as much information as possible [43]. We demonstrated with our proposed approach that amazon user feedback is a promising source to improve the performance of low-ranked software applications.

Furthermore, we considered only 13 categories of software applications containing 59 software applications in total, which is a relatively small number compared to the overall 487,083 applications available on Amazon. Still, we believe our research findings are generally applicable to all software applications on Amazon's application store. Another possible threat is we considered low-ranked software applications from the amazon software app store to improve their quality by critically analyzing end-user feedback. Moreover, to generalize the proposed approach, we will include software applications from different software rankings or ratings (high, middle, and low) to test the performance of the proposed method. Also, by this approach, we further improve the size of the research data set to expedite the state-of-the-art deep learning algorithms to improve further the performance of ML algorithms in identifying rationale elements.

This section discusses the related work that employed classification, rationale, and argumentation theory to analyze crowd-user feedback for requirements and software engineering.

Zhao et al. [58] mapped recent NLP studies related to requirements engineering. To assist human analysts in performing various linguistic analyses on textual requirements documents, such as detecting language issues, identifying key domain concepts, software feedback, and establishing requirements traceability links, engineering researchers and developers are turning to NLP techniques, tools, and resources in the RE process. Recently, software researchers have started to analyze automatically, filter, and synthesize software feedback and rationale from social media platforms, enabling software developers and analysts to improve software quality and enhance user satisfaction. Dutoit et al. explain the activities involved with creating, maintaining, and utilizing rationale models by aiming at decision-making and negotiations for software design and requirements rationale [47, 48]. While Burge et al. [49] describe how software design rationale can be captured and used for decisions throughout the software development life cycle and how to use it for decision-making. Liang et al. [45] proposed a three-layer rationale model comprising a solution, issues, and the artifact layer. Later, they presented an algorithm that captures and identifies software design rationale from various design documents. Khan et al. [17, 18] proposed a crowd-based requirement engineering approach by argumentation, which analyzes crowd-users conversation in the Reddit forum and identifies rationale and requirements-related information underneath argumentation theory. They presented different algorithms with the ML approach to recover conflict-free new features or issues and make informed requirements decisions about the future releases of the software application under discussion. Rogers et al. [44, 46] utilize ML classifiers to capture and identify distant rationale elements from the open-source software bug reports by experimenting with different textual ontological features to enhance the performance of the ML algorithms. Lee [21] discusses the rationale behind design from a human-computer interaction perspective. According to the author, design rationale systems need to address seven technical and business issues, like cost-effective use, identifying domain knowledge, and integrating it.

More recently, Khan et al. [10] proposed an ML approach that first filters end-user comments from the Reddit forum into user comments with rationale and without rationale and then classifies crowd-user comments that are identified with rationale into more fine-grained rationale elements, such as issues, new features, claim-supporting, claim-attacking. Also, Alkadhi et al. [20, 36] employed machine learning techniques to capture and recover rationale information in the Internet relay chat messages and classify them into fine-grained rationale elements: issues, alternatives, pro and cons arguments, and decisions. Moreover, Kurtanovic and Maalaj [15, 19] used machine learning classifiers on the Amazon software store to automatically mine and capture rationale concepts from end-users feedback and classify them into various rationale categories: justification, alternatives, decision, criteria, and issues. The proposed research approach is considered complementary to these research works in terms of mining rationale elements from the user feedback to identify and recover rationale information for the low-ranked software applications, improve the classification results (which is already achieved), and retrieve interesting and exciting information how to improve the overall quality of low-ranked software application to satisfy potential customers.

Argumentation mining [31] is a comparatively new research direction inspired by natural language processing to process natural language text. The most significant advantage of argumentation mining is that it automatically captures argumentative structure in the text document, i.e., the premises, the conclusion, and even the complete sentence. Furthermore, it helps to identify the possible relationship between the two textual sentences [50, 51]. Recently, researchers have started to apply argumentation mining on various social media platforms with a lot of success, such as user forums [52], Twitter [53], and crowd-user comments [54]. Cabrio and Villata [52] described an argumentation-based methodology that explores the crowd-users debates in the online user forum to capture and identify the winning arguments by first capturing the possible relationship between the two user comments utilizing the text entailment technique and then using the bipolar argumentation framework and its semantics to obtain the final winning arguments. Furthermore, using argumentation theory, Cocarascu and Toni [55] employ argumentation mining with machine learning algorithms to recover deceptive crowd-user comments with better classification results. The proposed research approach is different from the approaches mentioned above. We are interested in improving the quality of low-ranked software applications by recovering the rationale information in the crowd-user reviews in the amazon software app store. Also, the proposed work targets software developers and requirements analysts to improve the low-ranked software applications by making informed decisions for the future releases of the software application under discussion.

In this research study, we proposed an automated approach that aims to capture, classify, and analyze end-user reviews containing rationale information, i-e, decisions, issues, etc., by focusing on low-star rating applications in the amazon software store using Natural Language Processing and supervised ML classification methods. In the literature, high-rating applications have been emphasized while ignoring low-rating software application that causes potential biasness. Therefore, we examined 59 comparatively low-ranked market-based software applications from the Amazon app store covering various software categories to capture and identify crowd-users justifications. Next, using a developed grounded theory and content analysis approach, we studied and recorded how crowd-users analyze and explain their rationale based on issues encountered, attacking or supporting arguments registered, and updating or uninstalling software applications. Furthermore, to achieve the best result, a comprehensive experimental study was conducted by comparing different ML algorithms (MNB, LR, RF, MLP, & KNN) and ensemble methods (AdaBoost & Voting Classifier) on the annotated end-users rationale data set by first preprocessing the input data to remove the irrelevant data (stop words, special characters, etc.), employ two standard resampling approaches to balance the data set, i-e, oversampling, and under-sampling, extract different features (TFIDF and BOW) from the textual data in the data set, and then train & test the ML algorithms by applying a standard cross-validation approach. In a nutshell, all the ML classifiers perform well and produce high accuracy in accomplishing a multi-class text classification of crowd-user reviews collected from the Amazon software App Store to identify rationale elements. In particular, MLP, Voting, and RF ML algorithms perform relatively better and predict higher precision, recall, and F-measure values for the distant rationale elements (supporting, decision, attacking, neutral, and issues) identified by the proposed approach to improve the performance of low-ranked software applications in amazon software app store. We conclude that either MLP, Voting, or RF can be selected as the best ML classifier to identify different rationale elements in the crowd user’ comments on the social media platform and improve the performance of low-ranked software applications by focusing on the large volume of relevant information identified for the requirements and software engineers. Furthermore, our proposed approach outperforms previous similar research approaches [15, 17, 20] regarding classification accuracy, precision, recall, and F-measure. In summary, we obtained better results with MLP, voting-classifier, and RF Classifier having 93%, 93%, and 90% average predictive accuracy. Also, we plot the ROC curves for the high-performing ML Classifier to identify and capture classifiers yielding the best performance with an under-sampling or oversampling balancing approach. Additionally, we obtained the average Precision, Recall, and F-measure values of 98%, 94%, 96%, 97%, 95%, and 96% for identifying supporting & decision rationale elements in the user comments, respectively.

In the future, the applicability and scalability of the proposed methodology need to be explored by collecting more crowd-user data for software applications covering all the software ratings (high, middle, and low) in the amazon store across different categories. A rationale-based toolchain needs to be developed that implements the proposed framework to improve the quality of low-ranked software applications by promptly incorporating the end-user rationale in the decision-making process. Furthermore, we can employ argumentation mining [31] and text entailment [52] that reduce the overhead of the manual annotation process by automatically identifying the structure of end-user comments, i-e, the premises, conclusion, and even the complete sentence. Additionally, it helps determine the relationship between the two-corresponding crowd-user comments [50], such as the relationship between the two following end-user comments that are supporting, attacking, or neutral. Another future direction is to redesign and restructure the crowd-user comments and discussion in the current social media platforms to quickly and efficiently extract a large amount of rationale and valuable information by introducing argumentation [13, 60] theory and rationale concepts [15]. Another direction is to compare our research work with previous related research work and develop an automated approach that identifies vital stakeholders [61] who frequently contributed rationale and requirement-related information in the Amazon software app store. Such key stakeholders can be tracked regularly to improve the software quality in time and with fewer efforts. Also, to further enhance the performance of the proposed approach in mining rationale and requirements-related information, we employ state-of-the-art deep learning and transfer learning algorithms, such as BERT, LSTM, RNN, etc. Similarly, in the future, a prioritization [62] approach can be proposed to rank [63] the captured issues based on their corresponding supporting and attacking arguments in an agile software development setting.

Authorship contributions: T.U. developed the method, detailed investigation and manuscript writing; N.D. Khan curated the research data set, J.A.K. revised the methodology, supervised and revised the manuscript writing (revised draft). N.A. revised the experiments and analysis, and. edited the manuscript where necessary (Final draft). All uthors have read and agreed to the published version of the manuscript.

Funding: The authors have not disclosed any funding.

Data Availability: Enquiries about data availability should be directed to the authors.

Conflict of interest: The authors would like to announce that there is no conflict of interest.

Ethical approval: This article does not contain any examinations with human members or creatures performed by any of the others.

Khan, J. A.; Liu, L.; Jia, Y.; and Wen, L. "Linguistic analysis of crowd requirements: an experimental study," in 2018 IEEE 7th International Workshop on Empirical Requirements Engineering (EmpiRE), 2018: IEEE, pp. 24-31.
Maalej, W.; Nayebi, M.; Johann, T.; and Ruhe, G. "Toward data-driven requirements engineering," IEEE Software, vol. 33, no. 1, pp. 48-54, 2015.
Khan, J. A.; Liu, L.; Wen, L.; and Ali, R. "Crowd intelligence in requirements engineering: Current status and future directions," in International working conference on requirements engineering: Foundation for software quality, 2019: Springer, pp. 245-261.
Malik MSI.; “Predicting users’ review helpfulness: the role of significant review and reviewer characteristics[J]. Soft Computing,” 2020, 24 (18): 13913-13928.
Dąbrowski J.; Letier E.; Perini A et al, “Analysing app reviews for software engineering: a systematic literature review[J]. Empirical Software Engineering,” 2022, 27 (2): 1-63.
Lim S.; Henriksson A.; Zdravkovic J, “Data-driven requirements elicitation: A systematic literature review[J]. SN Computer Science,” 2021, 2 (1): 1-35.
Khattak A.; Habib A.; Asghar MZ, et al.; “Applying deep neural networks for user intention identification[J]. Soft Computing,” 2021, 25 (3): 2191-2220.
Sarro, F.; Harman, M.; Jia, Y.; and Zhang, Y. "Customer rating reactions can be predicted purely using app features," in 2018 IEEE 26th International Requirements Engineering Conference (RE), 2018: IEEE, pp. 76-87.
Guzman, E.; Ibrahim, M.; and Glinz, M. "A little bird told me: Mining tweets for requirements and software evolution," in 2017 IEEE 25th International Requirements Engineering Conference (RE), 2017: IEEE, pp. 11-20.
Khan, J. A.; Liu, L.; and Wen, L. "Requirements knowledge acquisition from online user forums," IET Software, vol. 14, no. 3, pp. 242-253, 2020.
Morales-Ramirez, I.; Kifetew, F. M.; and Perini, A. "Analysis of online discussions in support of requirements discovery," in International Conference on Advanced Information Systems Engineering, 2017: Springer, pp. 159-174.
Bakiu, E. and Guzman, E. "Which feature is unusable? Detecting usability and user experience issues from user reviews," in 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), 2017: IEEE, pp. 182-187.
Panichella, S.; Sorbo, A. Di; Guzman, E.; Visaggio, C. A.; Canfora, G.; and Gall, H. C. "How can i improve my app? Classifying user reviews for software maintenance and evolution," in 2015 IEEE international conference on software maintenance and evolution (ICSME), 2015: IEEE, pp. 281-290.
Dhinakaran, V. T.; Pulle, R.; Ajmeri, N.; and Murukannaiah, P. K. "App review analysis via active learning: reducing supervision effort without compromising classification accuracy," in 2018 IEEE 26th International Requirements Engineering Conference (RE), 2018: IEEE, pp. 170-181.
Kurtanović, Z. and Maalej, W. "On user rationale in software engineering," Requirements Engineering, vol. 23, no. 3, pp. 357-379, 2018.
Jarczyk, A. P.; Löffler, P. and Shipman, F. M. "Design rationale for software engineering: a survey," in Proceedings of the Hawaii International Conference on System Sciences, 1992, vol. 25: Citeseer, pp. 577-577.
Khan, J. Ali; Liu, L.; Wen, L.; and Ali, R. "Conceptualising, extracting and analysing requirements arguments in users' forums: The CrowdRE‐Arg framework," Journal of Software: Evolution and Process, vol. 32, no. 12, p. e2309, 2020.
Khan, J. A.; Xie,Y.; Liu, L. and Wen, L. "Analysis of requirements-related arguments in user forums," in 2019 IEEE 27th International Requirements Engineering Conference (RE), 2019: IEEE, pp. 63-74.
Kurtanović, Z. and Maalej, W. "Mining user rationale from software reviews," in 2017 IEEE 25th International Requirements Engineering Conference (RE), 2017: IEEE, pp. 61-70.
Alkadhi, R.; Lata, T.; Guzmany, E. and Bruegge, B. "Rationale in development chat messages: an exploratory study," in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017: IEEE, pp. 436-446.
Lee, J. "Design rationale systems: understanding the issues," IEEE expert, vol. 12, no. 3, pp. 78-85, 1997.
Alkadi, R.; Johanssen, J. O.; Guzman, E. and Bruegge, B. "REACT: an approach for capturing rationale in chat messages," in 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017: IEEE, pp. 175-180.
Pagano, D. and Bruegge, B. "User involvement in software evolution practice: A case study," in 2013 35th International Conference on Software Engineering (ICSE), 2013: IEEE, pp. 953-962.
Guzman, E. and Maalej, W. "How do users like this feature? a fine-grained sentiment analysis of app reviews," in 2014 IEEE 22nd international requirements engineering conference (RE), 2014: IEEE, pp. 153-162.
Iacob, C. and Harrison, R. "Retrieving and analyzing mobile apps feature requests from online reviews," in 2013 10th working conference on mining software repositories (MSR), 2013: IEEE, pp. 41-44.
Carreno, L. V. G. and Winbladh, K. "Analysis of user comments: an approach for software requirements evolution," in 2013 35th international conference on software engineering (ICSE), 2013: IEEE, pp. 582-591.
Corbin, J. and Strauss, A. Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage publications, 2014.
Maalej, W. and Robillard, M. P. "Patterns of knowledge in API reference documentation," IEEE Transactions on Software Engineering, vol. 39, no. 9, pp. 1264-1282, 2013.
Neuendorf, K. A. "The content analysis guidebook (1st Eds.)," ed: Thousand Oaks, CA: Sage, 2001.
Cohen, J. "Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit," Psychological bulletin, vol. 70, no. 4, p. 213, 1968.
Peldszus, A. and Stede, M. "From argument diagrams to argumentation mining in texts: A survey," International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), vol. 7, no. 1, pp. 1-31, 2013.
Neuendorf, K. A. "The content analysis guidebook Sage Publications," Inc., Thousand Oaks, 2002.
Martens, D. and Maalej, W. "Towards understanding and detecting fake reviews in app stores," Empirical Software Engineering, vol. 24, no. 6, pp. 3316-3355, 2019.
Kohavi, R. "A study of cross-validation and bootstrap for accuracy estimation and model selection," in Ijcai, 1995, vol. 14, no. 2: Montreal, Canada, pp. 1137-1145.
Santos, R.; Groen, E. C. and Villela, K. "An Overview of User Feedback Classification Approaches," in REFSQ Workshops, 2019.
Alkadhi, R.; Nonnenmacher, M.; Guzman, E. and Bruegge, B. "How do developers discuss rationale?," in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018: IEEE, pp. 357-369.
Chawla, N. V.; Japkowicz, N. and Kotcz, A. "Special issue on learning from imbalanced data sets," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 1-6, 2004.
Chawla, N. V.; Bowyer, K. W.; Hall, L. O. and Kegelmeyer, W. P. "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
Kotsiantis, S.; Kanellopoulos, D. and Pintelas, P. "Handling imbalanced datasets: A review," GESTS International Transactions on Computer Science and Engineering, vol. 30, no. 1, pp. 25-36, 2006.
Hanley, J. A. and McNeil, B. J. "The meaning and use of the area under a receiver operating characteristic (ROC) curve," Radiology, vol. 143, no. 1, pp. 29-36, 1982.
Keilwagen, J.; Grosse, I. and Grau, J. "Area under precision-recall curves for weighted and unweighted data," PloS one, vol. 9, no. 3, p. e92209, 2014.
Greenwood, P. E. and Nikulin, M. S. A guide to chi-squared testing. John Wiley & Sons, 1996.
Guyon I. and Elisseeff A., "An introduction to variable and feature selection," Journal of machine learning research, vol. 3, no. Mar, pp. 1157-1182, 2003.
Rogers, B.; Gung, J.; Qiao, Y. and Burge, J. E. "Exploring techniques for rationale extraction from existing documents," in 2012 34th international conference on software engineering (ICSE), 2012: IEEE, pp. 1313-1316.
Liang, Y.; Liu, Y.; Kwong, C. K. and Lee, W. B. "Learning the “Whys”: Discovering design rationale using text mining—An algorithm perspective," Computer-Aided Design, vol. 44, no. 10, pp. 916-930, 2012.
Rogers, B.; Qiao, Y.; Gung, J.; Mathur, T. and Burge, J. E. "Using text mining techniques to extract rationale from existing documentation," in Design Computing and Cognition'14: Springer, 2015, pp. 457-474.
Bruegge, B. and Dutoit, A. A. Object-oriented software engineering; conquering complex and changing systems. Prentice Hall PTR, 1999.
Dutoit, A. H.; McCall, R.; Mistrík, I. and Paech, B. Rationale management in software engineering. Springer Science & Business Media, 2007.
Burge, J. E.; Carroll, J. M.; McCall, R. and Mistrik, I. Rationale-based software engineering. Springer, 2008.
Lippi, M. and Torroni, P. "Argumentation mining: State of the art and emerging trends," ACM Transactions on Internet Technology (TOIT), vol. 16, no. 2, pp. 1-25, 2016.
Palau, R. M. and Moens, M.-F. "Argumentation mining: the detection, classification and structure of arguments in text," in Proceedings of the 12th international conference on artificial intelligence and law, 2009, pp. 98-107.
Cabrio, E. and Villata, S. "A natural language bipolar argumentation approach to support users in online debate interactions," Argument & Computation, vol. 4, no. 3, pp. 209-230, 2013.
Bosc, T.; Cabrio, E. and Villata, S. "Tweeties Squabbling: Positive and Negative Results in Applying Argument Mining on Social Media," COMMA, vol. 2016, pp. 21-32, 2016.
Cocarascu, O. and Toni, F. "Mining bipolar argumentation frameworks from natural language text," 2017.
Cocarascu, O. and Toni, F. "Detecting deceptive reviews using argumentation," in Proceedings of the 1st International Workshop on AI for Privacy and Security, 2016, pp. 1-8.
Tizard, J.; Wang, H.; Yohannes, L. and Blincoe, K. "Can a conversation paint a picture? Mining requirements in software forums," in 2019 IEEE 27th International Requirements Engineering Conference (RE), 2019: IEEE, pp. 17-27.
Levy Y.; Stern R.; Sturm A.; et al, “An impact-driven approach to predict user stories instability[J]. Requirements Engineering,” 2022, 27 (2): 231-248.
Zhao L.; Alhoshan W.; Ferrari A, et al.; “Natural language processing for requirements engineering: a systematic mapping study[J]. ACM Computing Surveys (CSUR),” 2021, 54 (3): 1-41
M.I. Marwat, J.A. Khan, D.M.D. Alshehri, M.A. Ali, H. Ali and M. Assam. "Sentiment Analysis of Product Reviews to Identify Deceptive Rating Information in Social Media: A SentiDeceptive Approach." KSII Transactions on Internet and Information Systems (TIIS) 16, no. 3 (2022): 830-860.
Khan JA, Yasin A, Assam M, et al. Requirements decision-making as a process of Argumentation: A Google Maps Case Study with Goal Model. International Journal of Innovations in Science & Technology 2021; 3(4): 15–33.
Khan FM, Khan JA, Assam M, Almasoud AS, Abdelmaboud A, Hamza MAM. A Comparative Systematic Analysis of Stakeholder’s Identification Methods in Requirements Elicitation. IEEE Access 2022; 10: 30982–31011.
Khan JA, Rehman IU, Khan YH, Khan IJ, Rashid S. Comparison of Requirement Prioritization Techniques to Find Best Prioritization Technique. International Journal of Modern Education & Computer Science. 2015 Nov 1;7(11).
Khan JA, Rehman IU, Ali L, Khan S, Khan IJ. Requirements prioritization using analytic network process (anp). International Journal of Scientific & Engineering Research. 2016 Nov;7(11).

Download PDF

Version 1

posted

You are reading this latest preprint version

Can end-user rationale improve the quality of low-rating software applications: A rationale mining approach

Status:

Version 1

Abstract

Figures

1. Introduction

2. Proposed Research Methodology

2.1 Proposed Research questions

2.2 Research method

2.2.1 Research data Gathering and Development

2.2.2 Grounded Theory approach

2.2.3 Analyzing content manually

2.2.4 Automated classification

3. Processing For User Rationale

3.1 Statements of Users

3.2 Users Statement Labeling

3.2.1. Annotation of Crowd-User Rationale

3.3 Frequency of the User’s rationale Comments

4. Automated Classification Of User Rationale

4.1 Experimental Setup

5. Discussion

6. Related Work

7. Conclusions And Future Work

Declarations

References

Status:

Version 1