The steps involved in the evaluation process is discussed briefly:
i. Dataset and attribute selection
The dataset contains 24 instances and 29 attributes. It has some missing values also. The data file is stored either in ‘CSV’ format or ‘ARFF’ format.
The dataset in ‘ARFF’ format is shown in below figure 1.
The steps involved in the evaluation process is discussed briefly:
ii. Dataset and attribute selection
The dataset contains 24 instances and 29 attributes. It has some missing values also. The data file is stored either in ‘CSV’ format or ‘ARFF’ format.
The dataset in ‘ARFF’ format is shown in below figure 1.
iii. Filters
The pre-process section allows filters to be defined that transform the data in various ways. The Filter box is used to set up the filters that are required. For this study, Unsupervised category filters is selected. Another one filter which is applied is ‘Replace Missing Values’ which will replace all missing values of our dataset and will make the dataset able to perform ‘Approximate Association Rule Generation’ which is used later in the study.
iv. Classification
To predict nominal or numeric quantities classifiers named as J48 is used for study in WEKA. It is also used for prediction purpose.
From the above figure it is seen that J48 is a good classifier as it gives an accuracy of 95.83%. The correctly and incorrectly classified instances show the percentage of test instances that were correctly and incorrectly classified. The raw numbers are shown in the confusion matrix, with ‘a’ and ‘b’ representing the class labels.
Some other factors in classifier output seen in the above figure are TP Rate, FP Rate, Precision, Recall and F-Measure.
v. Prediction of result
Under this step, it is important that dataset with cases to predict should be in the same structure that of the dataset used to learn the model. The difference is that the value of the result attribute is “?” for all instances.
a) Train Dataset in ARFF format: This format is shown in below figure.
b) Test Dataset in ARFF format: The below figure shows the Test Dataset used for further study.
vi. Manually Prediction in WEKA
Next, a manual prediction about success or failure of Software Reuse is done. The steps are discussed below briefly.
- The dataset in WEKA Explorer is loaded and Classify Tab is selected. Under Classify tab, test options ‘Use Training Set’ is selected and the ‘Success or Failure’ attribute is focused.
- Then, Classification of Training set data by J48 Classifier is performed
- Change the Test options into ‘Supplied test set’.
- Click on ‘Supplied test set’ by which ‘set’ tab will come.
- Click on to ‘set’ tab and in Test instances select ‘Open With ‘ to load the test set which is basically the same dataset of Training set except in the ‘Result’ attribute the values of result is removed and replaced with ‘?’ for all instances.
- Now after loading the test data we have to perform the classification by J48 classifier on the test dataset.
- Now after classification of Test data in result list, select ‘Visualize Classifier errors’ .It will show a graph known as visualization of WEKA classifier. Then save the visualization result which will be in ARFF format. This is shown in Figure 9.
- Now after classification of Test data in result list, select ‘Visualize Classifier errors’ .It will show a graph known as visualization of WEKA classifier. Then save the visualization result which will be in ARFF format. This is shown in Figure 9.
vii. Inbuilt Prediction by WEKA
- Load dataset into WEKA Explorer.
- Go to Classify tab and start classification by J48 classifier. In classify tab Test options choose ‘Cross Validation’.
- Then change the test option into ‘Supplied Test set’ and load the same dataset as test file.
- After loading the test file in Classify tab under test options select more options to go to the ‘classifier evaluation options’.
- In classifier evaluation options select the output predictions and choose ‘Plaintext’ as prediction.
- Then perform the classification of Test set by J48 Classifier. Now in classifier output it will be seen that WEKA performs predictions on test set. In the result the ‘Predicted error’ column contains predicted value of ‘Success or Failure’ attribute which is the predicted result of original result of train data set. Hence, WEKA performed prediction.
Here one more column is generated named as ‘Prediction’ which has some certain values for all instances. The hypothesis as to the good performance of boosting algorithms is that they increase the margins on the training data and this gives better performance on test data. In the following picture for some instances the ‘+’ sign signifies that WEKA prediction fails to match the actual result.
However from the two methods of prediction in WEKA ,they gives the same predicted result and difference between the probability predicted for the actual result and the highest probability predicted for the other results is same for both method.
viii. Association Rule Mining
Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules. Association rules are often used to analyze transactions. The Apriori algorithm uses frequent item sets to generate association rules, and it is designed to work on the databases that contain transactions. With the help of these association rule, it determines how strongly or how weakly two objects are connected.
ix. Apriori algorithm in Weka:
It is quite difficult to detect the association between such a large number of attributes. Fortunately, this task is automated with the help of Apriori algorithm. The steps followed using WEKA tool is discussed briefly.
- In the WEKA explorer, open the Preprocess tab, click on the Open file button and the file in ARFF format is selected. The database contains 24 instances and 29 attributes.
- Click on the Associate tab and click on the Choose button. Select the Apriori association.
- The parameters for the Apriori algorithm are set by clicking on its name, a window will pop up as shown below –
For this study, minimum support= 0.1 is set, because this can generate more frequent item set. Setting minimum confidence=0.9, can set higher because this boundary can give less number of rules.
x. Result of Apriori Algorithm
The result obtained by running Apriori alogorithm on WEKA can be observed in below figures:
Next, the values of parameters are listed in the Table I.
Table I
Parameters obtained
-N(required number of rules output)
|
10
|
-C (the minimum confidence of a rule)
|
0.9
|
D (delta at which the minimum support is
decreased at each iteration)
|
0.05
|
-U (upper bound for minimum support)
|
1.0
|
-M (the lower bound for the minimum support)
|
0.1
|
The following are rules selected for discussion from experiment.
Best rules found are listed below:
- Key Reuse Roles Introduced=yes 19 ==> Repository=yes 19 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.79)
- Software and Product=product 17 ==> Repository=yes 17 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.71)
- Top Management Commitment=yes Key Reuse Roles Introduced=yes 17 ==> Repository=yes 17 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.71)
- Non-Reuse Processes Modified=yes 16 ==> Type of Software Production=product-family 16 <conf:(1)> lift:(1.2) lev:(0.11) [2] conv:(2.67)
- Non-Reuse Processes Modified=yes 16 ==> Repository=yes 16 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.67)
- Human Factors=yes 16 ==> Repository=yes 16 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.67)
- Type of Software Production=product-family Software and Product=product 16 ==> Repository=yes 16 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.67)
- Non-Reuse Processes Modified=yes Repository=yes 16 ==> Type of Software Production=product-family 16 <conf:(1)> lift:(1.2) lev:(0.11) [2] conv:(2.67)
- Type of Software Production=product-family Non-Reuse Processes Modified=yes 16 ==> Repository=yes 16 <conf:(1)> lift:(1.04) lev:(0.03) [0] conv:(0.67)
- Non-Reuse Processes Modified=yes 16 ==> Type of Software Production=product-family Repository=yes 16 <conf:(1)> lift:(1.26) lev:(0.14) conv:(3.33)
The support for this rule can be computed by dividing the figure on the right-hand side of the rule 16 by the total number of instances considered in generating association Rules, 24. This rule has a support of 66%. The number 16 on the right-hand-side of the Rule indicates the number of items covered by its antecedent. The confidence is also Computed by dividing the figure on the left-hand-side of the rule by the figure on the Right-hand-side of the rule (16/16=1).
There are rules that show relevant information for success of software reuse which can be useful for developers in decision making about the activities and detecting attributes with ‘failure’ problems. Starting from this information, the developers can pay more attention to those attributes because they prone to failure.
xi. Approximate Association Rule Mining
The next goal of this study is to develop an association rule algorithm that accepts partial support from data. By generating these "approximate" rules, data can contribute to the discovery despite the presence of noisy or missing values. The approximate association rule algorithm is built upon the Apriori algorithm and uses two main steps to handle missing and noisy data. First, missing values are replaced with a probability distribution over possible values represented by existing data. Second, all data contributes probabilistically to candidate patterns. Patterns which receive enough full or partial support are kept and expanded. To demonstrate the capabilities, the algorithm is incorporated into the Weka implementation of Apriori.
xii. Visualization:
- Visualization Chart between two attributes
This visualization portion based on two attributes, when Staff Experience along with X axis corresponding to ‘Success or Failure’ along with Y axis. There can be more number of visualization portion based on other two attributes. Here only two visualization charts are shown.