We first discuss our experimental setup in section 5.1 and then the results are presented in section 5.2 and 5.3 respectively. In section 5.2, we measure the error margins of our machine learning predictor. For instance, the difference between the predicted execution time on a device (CPU or GPU) and actual execution time that depicts the error. In section 5.3, we compare the performance of work stealing scheduling algorithm (as a part of proposed solution) with other scheduling schemes.
5.1 Experimental Setup
We use two computer systems to perform the experiments. The first system is used for the generation of dataset. Once the dataset is generated it is passed to second computer system where the machine learning predictor is implemented. Once the decision is made by the predictor regarding device suitability the tasks are handed over to the scheduler. The scheduler (work stealing and others for comparison described later in section 5.3) is implemented on the same first machine which is used earlier for the generation of dataset.
The hardware specification of computer system used to implement machine learning predictor are provided in Table 5.1 For the implementation of the machine learning model, the Python programming language is used on windows platform. We use PyCharm version 2019.2.3 professional edition toolkit and Jupiter notebook IDE along with Python version 3.6. As mentioned earlier, we have evaluated different regression models which are multiple linear regression, random forest, and gradient boosting.
Table 5.1
Hardware specification of computer system used to implement ML predictor
Device | Specification |
Operating System | Windows 10 pro |
CPU | Intel Core M-7Y30 |
System type | 64 bits |
Processors | 1.00 GHz and 1.6 GHz |
Primary Memory | 8.00 GB |
Secondary Memory | 1 TB |
The hardware specifications of the computer system used to implement task scheduler are provided in Table 5.2. It is a CPU (Intel)-GPU (NVIDIA) system with Linux Ubuntu 16.04 Operating system installed.
Table 5.2
Hardware specification of computer system used to implement task scheduler
Device | CPU | GPU |
Architecture | Haswell | Kepler |
Processor Number | Intel Core i5-4460 | GeForce GTX 760 |
Number of Cores | 4 | 1152 |
Number of Thread | 4 | - |
Memory | 8.00 GB | 2 GB |
Memory Bandwidth | 25.6 GB/s | 192.2 GB/s |
Performance | 409.6 GFLOPS | 2257.9 GFLOPS |
ISP | 32 | 2 |
Memory Speed | 1600Mhz | 6.0 Gbps |
OpenCL SDK | Intel SDK for OpenCL 2016 | CUDA 8.0 |
Compiler | GCC 5.4.0 | Nvcc |
5.2 Machine Learning Predictor
We execute applications both on CPU and GPU devices and record their execution times. This information is used to create two datasets i.e., CPU dataset and GPU dataset. The CPU dataset contains values of nine important features, input size, and execution time on CPU device whereas GPU dataset contains same nine features, input size, and execution time on GPU device. These datasets are used for training and testing the machine learning model for the prediction of the execution time of programs on CPU device and GPU device. The shape of datasets is 1165 tuples (rows) and 11 columns (nine selected features from code as listed in Table 4.2, input size and execution time). We train two models i.e., CPU model and GPU model. The CPU model is trained on CPU dataset to predict the execution time on CPU whereas GPU model is trained on GPU dataset to predict execution time of programs on GPU. We split our dataset (either CPU or GPU model) into two parts with the ratio of 20% and 80%. Table 5.3 shows that 80% of the dataset is provided to machine learning model for training purpose. The remaining 20% is used for testing and evaluating the trained model.
Table 5.3
Detail of Dataset used for Prediction
Set | Instances | Percentage |
Training set | 931 | 80% |
Testing set | 234 | 20% |
Complete dataset | 1165 | 100% |
The concept of performance evaluation in regression is that comparing the predicted values with the actual values. The predicted execution time on CPU device using all three machine learning models is compared to the actual execution time as measured manually and added as actual execution time in dataset.
The Fig. 5.1 (a-f) shows the correlation of actual and predicted values of machine learning models. When the actual and predicted values are overlapped it indicates that the model has predicted the execution time with maximum accuracy. In Fig. 5.1 (a-c) we compare actual and predicted values (of execution time) on CPU device using random forest, gradient boosting and multiple linear regression models respectively.
Similarly, in Fig. 5.1 (d-f), we compare actual and predicted execution time on GPU device instead. In both cases, it can be noticed that actual and predicted curves are overlapped mostly in case of random forest and gradient boosting whereas a clear difference between predict and actual values can be noticed for multiple linear regression. These results indicate that prediction of random forest and gradient boosting is better than multiple linear regression. However, there is no clear difference is noticed between the accuracy of random forest and gradient boosting.
To get better insight about the accuracy of the machine learning models, we measure the error rate of the predictive machine learning models for regression by evaluating multiple performance metrics like R2, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE).
Mean Absolute Error (MAE) shows the difference of the actual values and predicted values. It is the average absolute difference on the dataset. The formula of Mean Absolute Error is:
MAE = \(\frac{1}{N}{\sum }_{i=1}^{N}|yi-\widehat{y}|\) (2)
N - total number of tuples
Yi – Actual value
\(\widehat{y}\) \(–\) Predicted value of y
Figure 5.2 is the representation of MAE. In the case of mean absolute error, the lower the value the precise the model is and vice versa. In our prediction analysis the random forest has the smallest values as compared to both gradient boosting and multiple linear regression.
Root Mean Square Error is the square root of the Mean Square Error.
RMSE = \(\sqrt{MSE}\) = \(\sqrt{{\sum }_{i=1}^{N}{(yi-\widehat{y})}^{2}}\) (3)
It is the standard deviation of the prediction error from the line of best fit. In case of root mean square error the smaller the value the lower the error in the model and vice versa. Figure 5.3 shows the comparison of the values of the RMSE of the machine learning models. It can be noticed that random forest performed the best with minimum value of 1.63 for GPU and 8.68 for CPU.
R2 is co-efficient of determination. It shows the co-efficient that how accurate the predicted values fit as compared to the actual values. R-Squared is normally in the range of 0 and 1. The higher the R-Squared value the better the model is and smaller the R-Squared value inferior the model is. The formula of the R-Squared is
R-Squared = 1 - \(\frac{{\sum (yi-\widehat{y})}^{2}}{{\sum \left(yi-\stackrel{-}{y}\right)}^{2}}\) (4)
\(\stackrel{-}{y}\) – Mean value of y
R2 is in the range of 0 and 1. The 0 mean 0% and 1 mean 100% but the percentage does not show the accuracy. It only indicates the fitting of actual and predicted values. The Fig. 5.4 shows the comparison of R2 values of multiple linear regression, gradient boosting, and random forest models. In our case the random forest and gradient boosting models fits very accurately as compared to multiple linear regression. However, the random forest is slightly better than gradient boosting models. But multiple linear regression model is almost 88% lower than both models. The poor performance of multiple linear regression is likely because the data is not in linear form.
Our results demonstrate that random forest provides more accurate predicted values as compared to gradient boosting and multiple linear regression. Therefore, predicted execution time by random forest model is further utilized for the designing of task scheduler. To be more precise, in our proposed solution, we combine random forest machine learning model (for device prediction) with work stealing task scheduler.
5.2 Performance evaluation of work stealing based task scheduler
As we discussed in section 3.2, after predictor’s decision tasks are placed either on CPU queue or GPU queue. It is then responsibility of task scheduler to select a program for processing on a device. In this context, we compare the performance of work stealing algorithm with other scheduling alternatives. It is important to mention that work stealing algorithm itself is not our contribution since the algorithm is the part of existing literature. However, we propose work stealing algorithm as a part of our proposed solution. We believe that work stealing algorithm is well suited in our scenario where CPU becomes idle (more likely) after running the host part of a program and work stealing makes it busy by selecting jobs from GPU queue. We describe some other scheduling alternatives as well and make a performance comparison of these alternatives with work stealing based task scheduler.
CPU Only
In CPU Only, all programs are executed only on the CPU device without any evaluation that a program is CPU suitable or GPU suitable (GPU remains idle). It is considered as a baseline and is used as standard for the evaluation of the proposed work-stealing scheduler [13].
GPU Only
In GPU Only, all the programs are assigned to only GPU device and in this case CPU remains idle [13]. The GPU Only is also considered as baseline and standard for evaluation of the work-stealing scheduler.
First Come First Serve
In First Come First Serve (FCFS) scheduling, programs are processed in the order of arrival and a free device is used for the processing of the program. For instance, both CPU and GPU are free in the beginning and program P1 is assigned to CPU. Now the only free device is GPU and therefore program P2 will be processed on GPU and so on.
Device Suitability
In device suitability evaluation, the programs are assigned to their suitable devices. Suitable device means a job has less execution time on that device [13]. It is different than our proposed solution since we combine device suitability with work stealing algorithm. In device suitability only, it is not possible to select a job from another queue. For instance, it is not allowed to select a job from GPU queue for execution on CPU device (although the CPU is idle at that time).
Alternate Assignment
In alternate assignment the programs (in a queue) are first shuffled and then assigned alternatively to the CPU and GPU processing device [13]. For instance. First program will be assigned to CPU, second will be assigned to GPU, third to CPU, fourth to GPU and so on. In this research the programs are randomly shuffled many times. We have evaluated many different combinations in this regard. When we shuffle programs once it is called as Alternate Assignment 1, when shuffled twice it is called Alternate Assignment 2. Similarly, we have tried Alternate Assignment 3, Alternate Assignment 4, and Alternate Assignment 5 (i.e., shuffled five times before assignment on a device).
Input Size
All the jobs are sorted in descending order based on their input sizes. Half of the jobs from the top of the queue are assigned to GPU device while the remaining jobs are assigned to CPU device. The job having largest value of input size is executed on GPU device while the job having smallest input size is executed on CPU device [2].
The performance of work stealing based task scheduler is compared against aforementioned scheduling alternatives. In this research, total 120 programs are selected from two benchmarks suits polybench and AMD. The job pool contains 35 CPU suitable jobs and 85 GPU suitable jobs that are determined by predictor. CPU suitable means the jobs have less predicted execution time on CPU while GPU suitable jobs have less predicted execution time on GPU. Three performance metrics are used for evaluation which are execution time, throughput, and utilization of devices. The Fig. 5.5 provides the comparison of all aforementioned scheduling heuristics in terms of execution time. This is the total execution time taken by either CPU or GPU to execute all assigned program according to a specific scheduling heuristic.
The Fig. 5.5 shows that all assigned tasks are completed in 34.5 seconds in case of work stealing based scheduler which depicts the overall best performance as compared to other approaches. The CPU bar for work stealing based scheduler shows that all tasks assigned to CPU are completed in 32.2 seconds whereas all tasks assigned to GPU are completed in 34.5 seconds. It means that after 32.2 seconds CPU becomes idle, but GPU continues job execution that completes at 34.5 seconds. Thus, all assigned tasks are completed in 34.5 seconds which is the best result in comparison to other approaches. The difference between CPU and GPU execution time is marginal for work stealing whereas it is quite significant for other approaches. It is because of load balancing attribute of work stealing algorithm.
We also notice that CPU execution time of work stealing is lower than all other approaches except device suitability where execution time is 19.8 seconds as compared to 32.2 seconds for work stealing. However, the load is not balanced on device suitability, and it results in a higher execution time for GPU device. It is 39.6 seconds for device suitability as compared to 34.5 seconds for work stealing based scheduler. It means work stealing is better in overall performance since it completes all tasks in 34.5 seconds whereas device suitability takes 39.6 seconds. Similarly, GPU execution time of proposed solution is the lowest except the alternate assignment 5. The alternate assignment 5 has lower execution time on GPU device but more execution time on CPU device as compared to the proposed scheduler. As a result, the overall performance of proposed solution is better.
The Fig. 5.6 shows comparison of throughput of the proposed scheduler with other heuristics. The throughput is defined as follows:
Throughput = Total execution time of programs/ total number of programs (5)
Since the total execution time of work stealing based scheduler is the minimum whereas the total number of programs are the same for all alternatives, it results in maximum throughput for our proposed solution.
The Fig. 5.7 shows the utilization of devices by different schedulers. It is the measure of time for which a device remains idle. For instance, in case of work stealing, the CPU becomes idle at 32.2 seconds and remains idle until 34.5 seconds when GPU completes the assigned tasks. In this case, the idle time of CPU is 2.3 seconds which is 6.66% of total time of 34.5 seconds. In this case we have 100% GPU utilization whereas CPU utilization is 93.34% as shown in Fig. 5.7 (0.93). The graph shows that none of the schedulers have more utilization of devices than the proposed scheduler. When all jobs are executed on CPU device only, it means 0% utilization of GPU device since it remains idle all the time. Similarly, for GPU device only, CPU remains idle all the time. When the jobs are executed using first come first serve scheduler, 82% device utilization is achieved. Similarly, the device utilization is below 93% for all other scheduler which depicts best resource utilization of the proposed solution.
The experimental results in Fig. 5.5, 5.6, and 5.7 shows that the proposed scheduling scheme outperforms all other schemes in terms of overall execution time of job, utilization of processing devices and throughput. We believe that better performance of our proposed approach is because of two factors. First, we use machine learning based predictor to find a suitable device for a task (with less execution time). Second, we use load balancing using work stealing algorithm that exploit efficient utilization of both (CPU and GPU) devices. The rest of the scheduling heuristics are not considering or combining these two factors while making a scheduling decision. For instance, device suitability is not considered except for device suitability scheduling algorithm. However, device suitability is not incorporating load balancing which make it less efficient than proposed solution.