In the literature, it has been an emphasis that end-users record a volume of feedback on the social media platforms [3] [17], which becomes challenges, time-consuming, and difficult to manually identify and capture rationale information to improve the requirements for decision-making and enhance user satisfaction [18]. For this purpose, we collected 77202 end-user comments for 59 different software applications in the Amazon store. We selected 11416 end-user feedback using a stratified random sample approach. We prepared a test sample to evaluate the performance of different ML algorithms in automatically identifying rationale elements in the crowd-user comments. For this, we selected different ML algorithms from the literature based on their better performance in mining text information from textual documents. The ML algorithms shortlisted for this purpose are Multinomial Naïve Bayes (MNB), Linear Support Vector Machine (SVM), Logistic Regression (LR), KNN, MLP Classifier, Gradient Boosting, Voting Classifier, Random Forest, and Ensemble Methods. Also, to balance the data set, we used standard resampling approaches, i-e, Oversampling, and under-sampling, as shown in Figure. 2, indicating the number of comments identified across each rationale label. Further, we used the standard 10-K Fold cross-validation [34] approach to train and validate ML algorithms. Additionally, instead of systematically evaluating all the possible configurations according to the relevant research, one of the main objectives was to find configurations that result in accurate classifiers for classification. This article aims to analyze and appraise the accuracy of different ML techniques. However, when deploying distant machine learning models with different configurations, we achieved a relatively high precision, recall, and F1 score. The details of the machine learning experiment are discussed below:
4.1 Experimental Setup
For this ML experiment, we employed text preprocessing and feature engineering techniques to assess and analyze the various classifiers and reveal their performance in automatically identifying rationale elements in the end-user reviews. Hence, before text preprocessing, feature engineering, data balancing, and training, we first selected different machine learning algorithms based on their good performance on textual data mentioned in the literature [15, 17, 35]. The selected algorithms are MNB, SVM, LR, KNN, MLP, GB, Voting classifier, RF, and Ensemble Methods. A voting classifier is a classification approach that makes predictions by combining multiple classifiers' results and performing some classification tasks well. Voting Classifier has two different kinds of voting methods. In hard voting, the projected output class is the one that receives the most significant number of votes. The output class in soft voting is the forecast based on the probability assigned to that class. In contrast, Ensemble techniques are ML algorithms that build a collection of classifiers and then categorize incoming data points based on a (weighted) vote on their predictions. Therefore, some agile techniques of ensemble methods combine data splits and multiple algorithms to produce ML results with higher accuracy. Furthermore, each ML algorithm was trained and validated with the end-user rationale feedback to automatically classify the crowd-users reviews into five categories of software rationale: attacking-claim, issue, supporting-claim, neutral-claim, and decision. The experiments were conducted in a Python environment, as discussed below:
1) Preprocessing
For an ML experiment, cleaning input data (textual data) is considered pivotal before developing an ML model or classifier. For this purpose, we performed a series of pre-processing steps; remove the HTML tags if there are any in the crowd-user comments; filter out URLs if they exist in the end-user comments; transform the end-comments text into lowercase, and finally, remove the brackets, punctuation, alphanumeric, and other special symbols or characters from the textual documents. Also, to improve the performance of ML algorithms, we reduce words in the data set to their root by using a text normalization technique called lemmatization. With the manual analysis of end-user comment using the content analysis approach, we identified that certain stop words in the user comments commonly represents some rationale elements. For example, “does” and “could” stop words are used as a possible indicator for the issue rationale element, while “not” and “doesn’t” stop words are used as a potential indicator for attacking claims. Similarly, the stop words “will,” “was,” and “have” appears to be the possible indicator for the decision rationale element. A similar practice is also reported in the literature [17, 18, 36].
Additionally, we used Label Encoder from class the python “sklearn.pre-processing”, which translates each value from a column text to a number to make the data ready and perusable for the machine learning algorithms. Furthermore, each ML algorithm has experimented with distant textual features and parameters to capture and reveal its performance in classifying rationale elements automatically. For instance, we compute the TFIDF features for each textual document and return a (documents, features) matrix that can be used as an input to the ML algorithm. The objective is to transform the text into meaningful numbers used to fit machine algorithms for prediction. Also, we employ the CountVectorizer technique to categorize users’ rationale documents, which gives equal weight to each word in a corpus.
2) Data Imbalance
In supervised ML, imbalanced data sets are considered a critical technical challenge nowadays [37]. A data imbalance reflects a lack of equal distribution of annotation classes within a data set. Moreover, our data set is somewhat imbalanced, as shown in Fig. 5; when annotating the end-user comments in the data set, most user comments (45%) were categorized as supporting claims, while only 4% of user comments are identified as decision rationale elements. Therefore, training a machine learning classifier on an imbalanced textual data set would force it to skew towards the majority class simples by disregarding the minority classes, i.e., the classes with a limited number of occurrences in the data set. To handle the imbalanced dataset, we employed the below two approaches to balancing the data set frequently and widely used in the software literature [37], i-e, Oversampling, and Under-sampling. We use these two data balancing techniques to improve the performance of the ML models and predict more accurate results on minority data comparing the imbalanced dataset. Oversampling is a non-heuristic technique that points to balanced class distribution through the random repeat of minority class examples [38]. At the same time, Under-sampling is a non-heuristic technique that targets the balance class distribution through the arbitrary exclusion or elimination of majority class samples [39]. Furthermore, to decide, which data balancing (oversampling or under-sampling) approach is more suitable for the training of ML classifiers in the experiment, we utilized Receiver Operating Characteristic (ROC) [40] and Precision-Recall [41] curves. For this purpose, we identify the percentage of True Positives (TP) against the percentage of False Positive (FP) for each ML classifier used in the experiment. In line with that, Fig. 6 shows ROC curves for MLP and RF ML classifiers, which explore both over and under-sampling approaches to recover the best resampling approach. We selected these two ML classifiers as an example because of their better performance in classifying crowd-user reviews into different rationale elements in the experiment. From the experiment, we found that ML classifiers utilizing oversampling consistently outperform ML algorithms operating under-sampling. It might be due to the loss of critical and valuable textual information samples when using the under-sampling approach [56].
3) Assessment & Training
To train and validate the supervised ML algorithms, we used a stratified 10-fold cross-validation approach to the textual data set from the amazon store. Nine folds of the cross-validation approach were employed to train the ML algorithms, and one fold of cross-validation was used to validate the algorithm. The testing and training process is repeated 10-times by rotating the training and testing folds. The benefit of the cross-validation approach for training and validating an ML classifier is to check how well a model works if limited data is available. It may also be used as a re-sampling method for assessing a model when we only have a limited amount of data to work with it. Currently, the stratified K-fold cross-validation approach is most commonly and frequently used to train and validate ML classifiers. Each fold has roughly the same proportion of labels representing each class. To assess the effectiveness of the classifiers, we compute and summarize the average results obtained from the 10-fold cross-validation runs. For this purpose, we utilized Precision (P), recall (R), and F1-score measures to evaluate the supervised ML algorithms and compare their performance. The P and R are computed with the below formulas
$${P}_{K}= \frac{{TP}_{K}}{{TP}_{K}+{FP}_{K}}{ R}_{K}= \frac{{TP}_{K}}{{TP}_{K}+{FN}_{K}}$$
The P is the ratio of true positives (correctly classified end-user arguments) to all crowd-users comments (both correctly and incorrectly classified user comments). Similarly, R measures the reliability of the machine learning classifiers in recognizing relevant information. TPk represents the number of end-users rationale feedback correctly classified of type k, FPk represents the number of users rationale feedback wrongly classified as type k, and FNk represents the number of users rationale feedback incorrectly classified as not type of k. In contrast, F1 represents the harmonic mean between Pk and Rk.
4) Results
The optimized results of distant machine learning classifiers used to identify rationale elements in the end-users feedback are shown in Table. 3. It can be seen from Table. 3 that the MLP, Voting, and RF classifiers give the highest accuracy in capturing and classifying end-users rationale elements in the end-user feedback on amazon software application store and outperform other machine learning classifiers, which are 0.93%, 93%, and 0.90%, respectively. The results obtained from MLP, voting, and RF classifiers are quite similar, as shown in Table. 3. The MLP-TFIDF, MLP-CountVectorizer, and Voting-CountVecortizer classifiers yield the highest precision, recall, and F-measure values for classifying supporting rationale elements, 98%, 94%, and 96%, respectively. Next, MLP-CountVectorizer, RF-TFIDF, and Voting-CountVecortizer classifiers give higher precision, recall, and F-measure values for classifying neutral rationale elements, 75%, 90%, and 81%, respectively. Moreover, MLP-TFIDF gives a higher recall value (91%) for classifying neutral rationale elements but performs poorly in predicting precision and F-measure values (66 and 76, respectively). Also, MLP-TFIDF, Voting-CountVectorizer, and MLP-CountVectorizer algorithms outer-perform other machine learning algorithms when classifying issue rationale elements and give the highest F-measure value of 93%. Similarly, MLP-TFIDF and RF-TFIDF classifiers outer-perform other machine learning classifiers when identifying decision rationale elements and predict higher Precision, Recall, and f-measure values of 97%, 96%, and 96%, respectively. Finally, MLP-TFIDF, Voting-CountVecortizer, and MLP-CountVectorizer capture higher Precision, Recall, F values for identifying attacking rationale elements 93%, 91%, and 92%, respectively.
Table 3
The performance of different machine learning algorithms with precision, recall and F1 to classify user comments into rationale elements
Labeled Tags | ML Algorithms and Features | Precision | Recall | F-Measure |
Supporting | MLP-TFIDF | 98 | 94 | 96 |
Random Forest-TFIDF | 97 | 89 | 93 |
MLP-CountVectorizer | 98 | 94 | 96 |
Voting- CountVectorizer | 98 | 94 | 96 |
Random Forest - CountVectorizer | 96 | 90 | 93 |
Neutral | MLP- CountVectorizer | 74 | 90 | 81 |
Random Forest -TFIDF | 75 | 85 | 80 |
MLP-TFIDF | 66 | 91 | 76 |
Voting- CountVectorizer | 74 | 90 | 81 |
Support Vector- CountVectorizer | 52 | 83 | 64 |
Issue | MLP-TFIDF | 92 | 94 | 93 |
Random Forest -TFIDF | 94 | 86 | 90 |
Voting- CountVectorizer | 93 | 93 | 93 |
Random Forest - CountVectorizer | 90 | 90 | 90 |
MLP- CountVectorizer | 93 | 93 | 93 |
Decision | Random Forest -TFIDF | 97 | 95 | 96 |
MLP-TFIDF | 96 | 96 | 96 |
Voting- CountVectorizer | 81 | 94 | 88 |
Support Vector -TFIDF | 80 | 92 | 86 |
MLP- CountVectorizer | 81 | 96 | 88 |
Attacking | MLP-TFIDF | 93 | 90 | 91 |
Voting- CountVectorizer | 93 | 91 | 92 |
Random Forest -TFIDF | 80 | 93 | 86 |
MLP- CountVectorizer | 93 | 91 | 92 |
Random Forest – CountVectorizer | 91 | 84 | 87 |
Stratified K-fold Cross-Validation (Split Size = 10) |
Machine Learning Classifiers | Accuracy |
Multinomial NB Classifier | 72 |
Logistic Regression Classifier | 85 |
Linear Support Vector Machine | 84 |
Random Forest Classifier | 90 |
Multi-layer perceptron classifier | 93 |
Voting Classifier | 93 |
In a nutshell, although all the classifiers selected for the machine learning experiment perform well and produce high accuracy in accomplishing a multi-class text classification of crowd-user reviews collected from the Amazon software App Store to identify rationale elements. In particular, MLP, Voting, and RF ML algorithms perform relatively better and predicted higher precision, recall, and F-measure values for the distant rationale elements (supporting, decision, attacking, neutral, and issues) identified by the proposed approach to improve the performance of low-ranked software applications in amazon software app store, as shown in Table. 3. Based on the experimental results shown in Table. 3, we conclude that either MLP, Voting, or RF can be selected as the best ML classifier to identify various rationale elements in the crowd-user’s comments in the social media platform and improve the performance of low-ranked software applications by focusing on the large volume of relevant information identified for the requirements and software engineers. Furthermore, our proposed approach outperforms previous similar research approaches [15, 17, 20] regarding classification accuracy, precision, recall, and F-measure. It can be seen in Table. 3, we achieved higher accuracy, precision, recall, and F-measure values than the previous rationale mining approaches.
Furthermore, to analyze the baseline configuration of the machine learning algorithms to classify crowd-user comments into different rationale elements, we investigated the learning curves of the most optimized machine learning classifiers, i-e, MLP, and RF.
Using a learning curve, we can visualize how the size of the training instance sets impacts the classification accuracy. Additionally, we assessed and identified how much training time is required for each training size. Figures <link rid="fig6">7</link>-a and 7-c show the learning curve of RF and MLP ML classifiers that classify crowd-user comments into different rationale elements. The MLP classifier (Fig. 7-c) offers the best configuration, also selected as the best classifier for classifying user comments into various rationale elements. Similarly, Figs. <link rid="fig6">7</link>-b and 7-d show the time required by the RF and MLP classifiers to train the ML algorithm. The MLP classifier (Fig. 7-d) offers the best configuration to train the ML classifiers.