Dataset Exploration and Analysis
Based on the previously listed input variables (input features), the researchers created a regression predictive model utilizing a variety of machine learning methods to forecast the court verdict. Every significant criminal case that occurred in the Jimma Zone between 2010 and 2014 E.C. is included in the dataset, together with comprehensive details about the offender, the nature of the offense, and the jury's verdict.Prior to developing a model, it is crucial to conduct some insightful analysis and dataset exploration since it is crucial to employ an approach that is compatible with the observed distribution.
In order to explore the dataset and make some analysis, initially the dataset is loaded as a DataFrame using the read_csv() Pandas function, specifying the location of the dataset as follows. Running the below result first loads the dataset and confirms the number of rows and columns, which are 911 rows and 18 variables (17 input variables which are age, sex, job, marital_status, education, trail_estatement, material_used, victim_status, typeofcrime_num, time, attorney_witness, defender_witness, evidence, mitigation_level, aggravation_level, decision_article_level, decision_sublevel; and 1 target variablewhich is decision_in_months). Their data types are also shown as follows. See for datail Annex 3.
The following result shows the top 20 instances for each variable and summarizes the number of rows and columns by using head() function of the DataFrame.
Reviewing the summary of each variable, it appears that all the variables are about severly skewed. The following result shows the description for each variable and summarizes the the distribution by using describe() function of the DataFrame.
The researchers takes a look at the distribution of all variables by creating a histogram for each attributes. Consequectly, most of the variables have a multimodal distribution. On contrary, most of the variables do not reflect a Gaussian distribution, and are highly skewed towards left or right. The detail screenshot of the variables values are dipected in the following histogram.
Pairplot visualization is important when the investigator want to go for exploratory data analysis.Pairplot visualizes the given data to find the relationship between them where the variables can be continuous or categorical. Pairplot is a module of Seaborn library which provides a highlevel interface for drawing attractive and informative statistical graphics.
The Fig. 2 above, the variation has been observed in each plot. The plots are in matrix format where the row name represents x axis and column name represents the y axis. The maindiagonal subplots are the univariate distributions for each attribute. From the pairplot of the variables, it can be easily understood that most of the variables do not reflect a Gaussian distribution.
In order to explore the correlation, initially the dataset is loaded and then the corr() function from pandas library and heatmap() function from seaborn library.
In order to make Spearman correlation, it is only need to change the value method= “spearman” in the corr() function of the above code.
Both correlations' output results are shown in the above Figs. 3 and 4. Figure 3 and Fig. 4 show the Pearson and Spearman relationship between different input variable and target variable respectively. Because either both variables have no relationship (totally independent) or wholly determine it (totally dependent), the correlation can't be 0, 1, or 1. As a result, the value is somewhere in between − 1 and 1. Having nearly one means a positive linear relationship with each other, i.e. as the predictor attribute increase with some amount the target variable will also enlarge with the same rate.
Evaluation of Regression Predictive Models
 Linear Regression (Linear): Linear regression is a method for modeling the relationship between one or more independent variables and a dependent variable. It is a staple of statistics and is often considered a good introductory machine learning method.
 Huber Regression (Huber): Huber regression is a type of robust regression that is aware of the possibility of outliers in a dataset and assigns them less weight than other examples in the dataset.
 RANSAC Regression (RANSAC): Random Sample Consensus, or RANSAC for short, is another robust regression algorithm. RANSAC tries to separate data into outliers and inliers and fits the model on the inliers.
 TheilSenRegression (TheilSen): Theil Sen regression involves fitting multiple regression models on subsets of the training data and combining the coefficients together in the end.
 XGBoostRegression (XGB): Extreme Gradient Boosting (XGBoost) is an opensource library that provides an efficient and effective implementation of the gradient boosting algorithm. It can be used for regression predictive modeling.
For testing, the kfold crossvalidation procedure is employed. The kfold crossvalidation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single traintest split. The researcher use k = 5, meaning each fold will contain about 911/5 or about 182 examples. Stratified means that each fold will aim to contain the same mixture of examples by class as the entire training dataset. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. For the model evaluation, the researchers followed three repeats. This means a single model fit and evaluated 5 * 3, or 15 times and the mean and standard deviation of these runs is reported. This can be achieved using the RepeatedStratifiedKFold scikitlearn class.
For the evaluation, default model hyperparameters employed. The researchers defined each model in turn and add them to a list so that the researchercan evaluate them sequentially. The get_models() function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later. The researchers then enumerated the list of models in turn and evaluate each and storing the scores for later evaluation. The following python code shows the proposed machine learning models on the given dataset using MAE evaluation metric.
The complete results of evaluating the proposed machine learning models on the given dataset using MAE evaluation metric is shown below. The researchers compared the mean performance of each method, and for more claritythe researcher used tools like a box and whisker plot to compare the distribution of scores across the repeated crossvalidation folds.
Table 2: Evaluation result of the proposed machine learning models
At the end of running python code, each sample of scores are plot as a box and whisker plot with the same scale so that the researcher can directly compare the distributions. A figure created below shows one box and whisker plot for each algorithm’s sample of evaluation results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows themedian of the sample, and the green triangle in each box shows the mean of the sample as presented in Fig. 5 below.
As indicated in Fig. 5 above, the MAE is about 4.080 for XGBoost regression algorithm appears to be the bestperforming, which is sitting and extending lower than the rest of the algorithms.
Make Predictions on Sample Data
The XGBoost regression (XGB) model was the final model that the researchers fitted and used to make predictions on individual rows of the sample dataset. Based on the provided dataset, the model produced an MAE of roughly 4.080, exceeding all other evaluated techniques.After the model fit the dataset, the researcher used the predict() method to use the model to make predictions for a sample dataset. The researcher made some predictions for scenarios where the outcome was known using the fit model in order to illustrate this procedure.
The following random samples of 10 cases from the dataset where have the understanding about the outcomes are used to make predictions using XGBoost regression (XGB) model. The samples are generated randomly from the dataset using sample() function. Here, the column names are replaced with their index. Hence, the last index (index number 17) is the expected outcome.
The following Table 3 clearly summarizes the randomly generated samples and their expected (E) and predicted (P) values using XGB model. The samples from Table 3 are coded in python under code 4 below the Table 3.
Table 3
Randomly generated samples and their expected (E) and predicted (P) values using XGB model
Sample

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

E

P

390

21

0

0

0

2

1

0

0

15

1

1

0

0

1

0

4

11

20

20

482

32

1

0

0

1

1

0

0

17

1

0

0

0

2

0

4

10

18

18

68

30

1

2

1

1

0

3

2

26

0

1

0

1

2

0

1

14

30

30

612

27

0

2

0

1

1

3

1

6

1

1

0

1

1

0

1

13

30

30

638

38

1

2

1

1

0

0

0

27

1

1

0

1

1

0

1

4

1

3

783

22

1

3

0

1

1

0

0

17

1

0

0

0

2

0

4

10

18

18

738

48

1

3

1

0

0

0

0

17

1

1

0

0

1

0

4

11

24

24

346

16

0

0

0

1

1

0

0

17

1

0

0

0

2

0

4

10

18

18

354

32

1

2

0

1

1

0

0

31

1

0

0

0

3

0

1

13

27

27

425

18

1

3

0

1

0

0

0

17

1

1

0

0

2

0

4

11

21

21

The above table using XGBoost algorith randomly selected and generated samples of their expected (E) and predicted (P) values using XGB model and the complete results of running code to make predictions using XGBoost regression (XGB) model on each sample cases produced the following result.
As it is presented from the results of XGBoost regression which achieved correct prediction to all the sample cases, except to one (i.e. Sample 638) only. In other word, it achieved 9/10 correct prediction. The above table (Table 4.3) clearly summarizes the randomly generated samples and their expected (E) and predicted (P) values using XGB model.
Discussion of the Finding
The algorithms were Linear Regression (Linear), Huber Regression (Huber), Random Sample Consensus Regression (RANSAC), Theil Sen Regression (TheilSen), and Extreme Gradient Boosting (XGB) and evaluated on the dataset using the kfold crossvalidation testing procedure, where k = 5. The results of evaluating the proposed machine learning models on the given dataset using the MAE evaluation metric reveal that the XGBoost regression algorithm appears to be the bestperforming, scoring a MAE of about 4.080. On the other hand, TheilSen performed the worst, with a MAE of about 17.146. Linear, Huber, and RANSAC have also shown they do not have skill on the given dataset. They score MAEs of about 14.899, 13.195, and 16.020, respectively. Therefore, with this experimentation, it clearly shows that the third research question is also addressed explicitly.