3.1 Study Area
The city of Kigali experiences a tropical climate. In 2035, the population of the city is predicted to reach 3.8 million people, up from the current population of roughly 1.2 million. The percentage of youths in the city is around sixty five percent, and a large proportion of youngsters are students.
The city has introduced car-free days and car-free zones (Kigali car-free day, 2019). In 2019, City of Kigali implemented bike-sharing to alleviate traffic congestion and greenhouse gas emissions (Guraride, 2019). But no research conducted on the evaluation of the mode.
Bike-share riders ride in designated bike lanes, which are physically separated from the pedestrian walkway and traffic lanes. This ensures the safety of riders. Users of bike-sharing programs take precautions against head injury by wearing helmets. The bikes are built with low gears specifically to reduce speed. To prevent accidents, bikes are illuminated and colored brightly at night. A deposit is required to utilize the system, which encourages responsible behavior. Users of the system are primarily young and well-educated. As a result, passengers are well-versed on traffic regulations, and the system administrators occasionally host trainings on the topic. In additional, experts have shown that private riders are more likely to have an injury than bike-sharing users. This is due to the design of bike-sharing bikes and the accompanying recommendations offered to users. Figures I and II show detailed maps of the stations and the corridors connecting them.
As it can be seen from Figure III, the maps displayed depict the locations of stations and corridors. The city is divided into three districts: Kicukiro, Gasabo, and Nyarugenge. The only districts where bike-sharing is available are Gasabo and Nyarugenge. Only 9 of the 18 stations established are operational, including 5 stations in corridor Remera and 4 stations in the Central Business District. Kigali has an area of around 730 square kilometers and a population of over 1.2 million.
3.2 Data
From September 12th, 2021, through May 30th, 2022, data was gathered from GuraRide. From a total of 9 active docking stations, we have data on 1073 bike-share trips. Eight docking stations were found to be non-functional, thus their data was not included in the tally. There are nine factors that have been used to predict system usage: (1) Date, (2) Gender, (3) Station, (4) Corridor, (5) Time, (6) fare, (7) age, and (8) Education. The date is crucial since it indicates year, months, and whether or not students will be on holiday or not. Understanding the gender of a potential user can assist determine if they are a man or a woman. Dates hep to know evolutions of bike-sharing on a daily basis. Location of the station is crucial, as users are less likely to make use of it if it is inconveniently situated. The way a system is used depends on the corridors that connect various types of land use and infrastructure. The amount of time a user spends in a system might vary, but prolonged use leads to a decline in productivity. Potential users may be enticed to join the system because of the fare. Younger individuals are more inclined to ride bikes than elderly people because they have more energy to devote to the activity. It's simple to teach people how to log in and out, the system involves the use of applications, and education plays a significant role in increasing system utilization.
3.4 Random Forest model
The classification method known as Random Forest is an ensemble approach. It takes a training dataset and uses bootstrapping to generate several decision trees, with each decision tree being trained with a different random subset of all the characteristics. It has been demonstrated that Random Forest can avoid overfitting and bring down the variance. During the generation of bootstrapped samples (i.e., tree bagging), given a training feature matrix Q = [q1, q2, · · ·, qn] T, n is the observations, and each xi is a p- dimensional feature vector, and a response vector F = (f1, f2 · · ·, fn) T, bagging will repeat B times to sample with replacement of the n observations and subsample without replacement of the p features and construct a decision tree based on the bootstrapped training data. Typically, √p features are subsampled to form each of the B trees. Random Forest not only performs sample bagging but also performs feature bagging, and its final prediction is based on the majority vote of classification results in B trees.
Given that: (q1, f1), (q2, f1), (q3, f1), ………… (qn, fn) for training data where fi\(\in \left\{-\text{1,1}\right\}\), and the class labels are written as mi=2fi-1\(\in \left\{-\text{1,1}\right\}\) and weight of each data point qi is M1(i)= \(\frac{1}{n}\) where i= 1…..., n. for t = 1,..., T; where T is the total iteration number. Train a weak learner, using the training data and weights provided by Mt. ht(qi) \(\in \left\{-\text{1,1}\right\}\). Then the classification error (\(e\)) will be:
\(e=\frac{1}{n}\sum _{1=1}^{n}{M}_{1}\) (i) I\({ (h}_{t}({f}_{i}\))\(\ne {z}_{1}\)), select \({\alpha }_{t}\)=\(\frac{1}{2}\text{ln}(\frac{1-{e}_{t}}{{e}_{t}})\), as i is data point, i= 1,…,…, m then if the data point was properly categorized, its weight will decrease in the subsequent iteration. If not, its weight will increase in the subsequent cycle.
\({M}_{t+1}\) (i)=\(\frac{{M}_{t}\left(\text{i}\right)\text{e}\text{x}\text{p}(-{\alpha }_{t}{{z}_{i}h}_{t}\text{f}\text{i}) }{{Z}_{t}}\) where Zt is a constant used to normalize the data\(\sum _{i=1}^{n}{M}_{t+1}\left(\text{i}\right)=1\)
Finally, the prediction is:
H(fi) = Sign (\(\sum _{t=1}^{T}{\alpha }_{t}{h}_{t}\left(\text{f}\right)\), Then The prediction label is 1 if H(x) is less than zero, and 0 otherwise.
3.4.1 Splitting Train test, label encoding and feature selection
With the train-test-split method, machine learning algorithms can be tested. The method splits the given dataset into two groups (train data set and test data set): The training dataset is used to train and fit the model while test data set is used to test the model. The algorithm uses the input element of the training data in the test dataset to make predictions. First, the model has to fit the given data with inputs and outputs that are already known. Then, the algorithms are used to make predictions about the remaining subset of data so that the data can be used to learn. Scikit-learn is a Python library that can be used to cluster, choose a model, and classify data in many different ways. "Model-selection" is a blueprint framing technique. This plan is used to look at data before putting that knowledge to use for new data. When making predictions, it helps to get accurate results by using the right model. The train-test-split is used to divide a dataset into two parts. One is used to train models, and the other is used to test the model. Instead of having to do it by hand, this function automatically divides data into subsets. By default, train-test-split uses random partitions to split the data into two parts. People often use this treatment because it is quick and easy to do.
Typically, machine learning datasets have column labels. Labels consist of words or numbers. Labeling training data with words facilitates comprehension. Label encoding translates integers to forms that are machine-readable. Algorithms define how labels should be utilized. The label encoding stage is a component of supervised learning models. Label encoder translates labels into 0–1 numeric values. Label encoding converts information into a machine-readable format. Each dataset class is assigned a unique number beginning with zero, which creates training priority concerns. There may be a preference for high-value labels over low-value ones.
The technique of feature selection was used to determine the important traits with the strongest association to the target label. Relevant attributes hinder model performance. Select model characteristics prior to creation. Feature selection lowers overfitting, increases accuracy, and saves training time. In research, the feature Selection makes use of the chi-square score to identify pertinent qualities.
3.4.2 Data preparation and preprocessing
We used data from the Guraride dataset. Some values in the dataset were missing. Thus, we have removed null values from the dataset. During this stage, data were cleaned, and wrong entry data corrected. There were 1254 non-students, 30 full-time employees, 48 part-time employees, and 68 self-employed individuals, while there were 8673 students. Education values have been changed to Yes for the Student bike-sharing service and No for all other non-students. The Education column has been replaced by the Bike Sharing Student column. After transformation, the frequency in the Table I has changed. In 1073 trips, 8673 students (Yes cases) and 1400 non-students (No cases) traveled (No cases).
Table I: Python Output: Data preparation and processioning
Since it was necessary to distinguish 2021 data from 2022 data, the date field was modified to include a year and month column as shown in Table II.
Table II: Python Output: Data separation (year 2021 and year 2022)
The annual basis revealed that in 2021, non-students (no-cases) made 136 bike-sharing trips while students (yes-cases) made 3,543. Non-students (no-cases) traveled 1,264 times in 2022, whereas students (yes-cases) traveled 5,130 times.
3.4.3 Data processing
According to Table III and Figure IV, the number of non-students who used bike-sharing was much lower than the number of students who used bike-sharing. It can be seen that the behavior of students was evaluated depending on the station and time in the year 2021.
Table III: Data 2021-2022
Months
|
Counts
|
---|
Data of year 2021
|
---|
9
|
555
|
10
|
784
|
11
|
1286
|
12
|
918
|
Data of year 2022
|
1
|
1491
|
2
|
1128
|
3
|
678
|
4
|
835
|
5
|
998
|
Additionally, Table III and Figure IV indicate how bike-sharing users changed over time. The country of Rwanda has 4 seasons, namely: long rainy seasons, long dry season, short rainy season and a short dry season. The leading month in 2021 is November, the last month of the short rainy season. And the leading month in year 2022 is the month of January which is the second month of the short dry season. In those two months, the students were at school, the main reason for a great number of bike-sharing users.
3.4.4 Key performance indicators
Confusion Matrix and Tuning Hyper Parameters
For classification issues with multiple output class labels, one common performance indicator is the confusion matrix. It is a matrix with four columns, one for each pair of hypothetical and real values. Precision, recall, and f1-score are determined using a confusion matrix. There are settings that may be tweaked to improve each technique.
Algorithms are trained using initially randomized parameters before their performance is assessed. After the parameters have been fine-tuned to maximize accuracy, they are employed in the final machine learning algorithm to make predictions about test data.
- Accuracy: It's a metric for measuring the amount of incorrectly anticipated classes as compared to the actual classes. The formula for this is:
$$\frac{TP \left(True Positive\right)+TN \left(True Negative\right)}{TP \left(True Positive\right)+TN \left(True Negative\right)+FN \left(False Negative\right)+FP\left(False Positive\right)}$$
- Precision: It's a metric for keeping track of the percentage of incorrectly classed predictions relative to the true classes. It is determined by:\(\frac{TP \left(True Positive\right)}{TP \left(True Positive\right)+FP\left(False Positive\right)}\)
- Recall: Recall, also known as sensitivity, is the proportion of time that a positive prognosis is really accurate. This measure has a great deal of promise because the data are so accurately analyzed. The following is the formula:\(\frac{TP\left(TruePositive\right)}{TP\left(TruePositive\right)+FN\left(FalseNegative\right)}\)
- F1-score: the score considers the possibility of both false positives and negatives. The formula for this is: 2x\(\frac{recall x Precision}{recall + Precision}\)
3.4.5 Chi-square test
The Chi-square is used to examine data relationships. The Test of Independence compared the actual pattern of cellular responses with the pattern predicted if the variables were independent. Researcher compare the Chi-Square statistic to a key value from the Chi-Square distribution to determine if real and predicted cell counts differ significantly. Below table indicates the variables tested at 95%.
Table IV: Chi-square test results
Variables
|
P-Value
|
Decision
|
---|
1. Gender versus Bike-Sharing
|
0.00087479
|
We don't have enough evidence to support that, there is no association between Gender and Bike-sharing.
|
2. Station versus Bike-Sharing
|
0.0000000000
|
We don't have enough evidence to support that, there is no association between the Station and Bike-sharing.
|
3. Corridor versus Bike-Sharing
|
0.000
|
We don't have enough evidence to support that, there is no association between Corridor and Bike-sharing.
|
4. Time versus Bike-sharing
|
0.3951566
|
We have enough evidence to support that, there is no association between time and Bike-sharing.
|
5. Fare versus Bike-sharing
|
-0.08590316
|
We don't have enough evidence to support that, there is no association between fare and bike-sharing.
|
6. Age versus Bike-sharing
|
0.0395898
|
We don't have enough evidence to support that, there is no association between Age and Bike-sharing.
|
3.4.6 Building model
Table V presents a sample of the dataset after applying dummy variables and balanced dataset.
Table V: model construction
This dataset contains 8400 of the No (non-students’ cases) and 8673 of the YES cases (Students). The final dataset had 7073 observations. Then 25% of the data were used for model testing and the rest of samples are used for model training. Random Forest model (RF) was applied and evaluated. And the results are presented in terms of f1-scores, recall, precision and accuracy.
3.4.7 Discussion of the results
Table VI shows that the random forest model properly identifies 2001 as "Yes" out of a total of 2105 excursions for students. This indicates that it appropriately categorizes 95% of student trips (yes). While the same model may categorize 1,570 "No" trips taken by non-students as "No"" There were 2,164 journeys taken by individuals other than students out of the total number of trips taken. Therefore, the classification accuracy for trips taken by anyone other than students is 73%. This indicates that the model's accuracy across both categories is 84%.
Table VI: classification results
Table of Classifications
|
---|
Observations
|
Predictions
|
---|
|
Status of Trips
|
---|
|
Student trips
(Yes)
|
non- Student trips (No)
|
%
|
Status of Trips
|
Student trips (Yes)
|
2001
|
104
|
95%
|
non- Student trips (No)
|
594
|
1570
|
73%
|
Overall accuracy 84%
|
For a developed model, Table VII summarizes the metric performance where f1-score is corresponding to 81.8% with the overall accuracy on student trips and non-student trips corresponding to 84%.
Table VII: Model performance
|
Precision
|
Recall
|
F1_score
|
Accuracy
|
---|
Student trips (Yes)
|
72.60%
|
93.80%
|
81.80%
|
84%
|
Non- Students trips (No)
|
95.10%
|
60.50%
|
73.90%
|
Table VII details the outcomes of an experiment using machine learning algorithms. We used random forests (RF) and the parameters have been tuned and the method used indicate parameters with optimal value to produce good results. The table indicates performance in terms of accuracy, precision, recall and f1-score for each category of the person accessing the bike-sharing service being a student or others. The accuracy which is an overall performance is also indicated. The results obtained show that the model performance is high.
Table VIII illustrates the relationship between independent variables and a dependent variable.
Table VIII: Independent variable coefficient
Criteria
|
Values
|
Coefficients
|
---|
GENDER
|
F
|
0.26047474
|
M
|
0.35092256
|
STATIONS
|
ARENA
|
0.84465993
|
CHUK
|
0.10064695
|
ENGEN
|
0.63766107
|
I&M BANK
|
0.40591029
|
MARRIOT
|
-1.42763083
|
MTN
|
0.11908014
|
NORSKEN
|
-1.03437044
|
REB
|
0.10644164
|
SERENA
|
0.85899855
|
CORRIDOR
|
CBD
|
1.99499702
|
REMERA
|
-1.38359972
|
|
Fare
|
-0.00234912
|
AGE
|
0.00750487
|
When it comes to eligibility for the bike share program, there is no distinction between male and female students. People using the Arena and Serena stations are more likely to be students than people using any other station. Up next is ENGEN Station. Customers at the MARRIOT and NORSKEN stations are less likely to be students. Central Business Development corridor users are students while Remera corridor users are others.