Agglomerative hierarchical clustering and mapping the dimensions of the urban slum population proportion
The overall interrelationships among the 106 key indicators (See Supplementary Information), including the target variable 'urbanslum' and 105 features, have been assessed to identify clusters of variables that are related or behave similarly, resulting in the extraction of 35 features (Table 1).
These results are presented in a heatmap of the correlation matrix created for the urban slum population proportion (Fig.1). Among these indicators, Manufacturing (value added as a % of GDP), population growth (annual %), urban population growth (annual %), poverty gap (%), carbon emissions, and mortality rate all indicate a strong positive correlation with the urban slum population proportion (Table 1). They suggest that as urban population growth and the poverty gap increase, accompanied by more carbon emissions and a higher mortality rate, the urban slum population proportion also increases.
Table 1. Basic information for target variable and feature selection of 35 key variables with four dimensions. The name and code of the data series align with the World Bank database at https://databank.worldbank.org/source/world-development-indicators. The series code name for “urbanslum” is assigned by the author.
Series Code
|
Series Name
|
Target variable
|
|
urbanslum
|
Proportion of Urban Population Living in Slum Households by
Country or area 2000 - 2020 (Percent)
|
Economic Development Dimension
|
|
NV.IND.MANF.ZS
|
Manufacturing, value added (% of GDP)
|
NE.CON.TOTL.KD.ZG
|
Final consumption expenditure (annual % growth)
|
NY.GDP.MKTP.KD.ZG
|
GDP growth (annual %)
|
BX.KLT.DINV.WD.GD.ZS
|
Foreign direct investment, net inflows (% of GDP)
|
FP.CPI.TOTL.ZG
|
Inflation, consumer prices (annual %)
|
NV.IND.TOTL.ZS
|
Industry (including construction), value added (% of GDP)
|
CM.MKT.LCAP.GD.ZS
|
Market capitalization of listed domestic companies (% of GDP)
|
GC.XPN.TOTL.GD.ZS
|
Expense (% of GDP)
|
FS.AST.CGOV.GD.ZS
|
Claims on central government, etc. (% GDP)
|
NE.GDI.FTOT.ZS
|
Gross fixed capital formation (% of GDP)
|
BG.GSR.NFSV.GD.ZS
|
Trade in services (% of GDP)
|
DT.ODA.ODAT.PC.ZS
|
Net ODA received per capita (current US$)
|
IT.CEL.SETS.P2
|
Mobile cellular subscriptions (per 100 people)
|
BM.TRF.PWKR.CD.DT
|
Personal remittances, paid (current US$)
|
NY.GDP.PCAP.PP.KD
|
GDP per capita, PPP (constant 2017 international $)
|
NY.GNP.MKTP.PP.CD
|
GNI, PPP (current international $)
|
SL.GDP.PCAP.EM.KD
|
GDP per person employed (constant 2017 PPP $)
|
NE.RSB.GNFS.ZS
|
External balance on goods and services (% of GDP)
|
NE.CON.GOVT.ZS
|
General government final consumption expenditure (% of GDP)
|
NY.GNS.ICTR.ZS
|
Gross savings (% of GDP)
|
NY.GDP.MINR.RT.ZS
|
Mineral rents (% of GDP)
|
Health and Social Dimension
|
SP.ADO.TFRT
|
Adolescent fertility rate (births per 1,000 women ages 15-19)
|
SH.STA.MALN.ZS
|
Prevalence of underweight, weight for age (% of children under 5)
|
SH.DYN.MORT
|
Mortality rate, under-5 (per 1,000 live births)
|
SH.IMM.MEAS
|
Immunization, measles (% of children ages 12-23 months)
|
SH.MMR.RISK
|
Lifetime risk of maternal death (%)
|
SP.DYN.LE00.IN
|
Life expectancy at birth, total (years)
|
Urban and Environmental Dimension
|
AG.SRF.TOTL.K2
|
Country's Surface area (sq. km)
|
SH.STA.SMSS.UR.ZS
|
People using safely managed sanitation services, urban (% of urban
population)
|
EN.ATM.CO2E.KD.GD
|
CO2 emissions (kg per 2015 US$ of GDP)
|
Demographic Dimension
|
SP.POP.GROW
|
Population growth (annual %)
|
SP.URB.GROW
SP.URB.TOTL
|
Urban population growth (annual %)
Urban population
|
SE.ENR.PRSC.FM.ZS
|
School enrollment, primary and secondary (gross), gender parity index (GPI)
|
Fig.1. Clustered heatmap of the overall interrelationships among the variables. Combining heatmaps and clustering algorism results in a matrix where the cell at the intersection of row i and column j represents the correlation between the ith and jth feature. It diverges around a center value (0 in this case), with two contrasting colors at the ends, a clustered correlation heatmap using the correlation matrix. In the clustered heatmap, not only are the individual values represented as colors, but the rows and columns are ordered (or clustered) so that similar rows and columns are near each other. The color of each cell in the heatmap corresponds to the correlation value. A value close to 1 indicates a strong positive correlation. A value closes to -1 indicates a strong negative correlation. A value near 0 indicates no linear relationship between the features.
The negatively correlated indicators to the urban slum population proportion are listed in Table 2. This result shows as these variables increase, the proportion of the urban slum population tends to decrease. The indicators include socio-economic development factors such as net ODA(Official development assistance)[6] received per capita, school enrollment (primary and secondary), mobile cellular subscriptions, personal remittances, life expectancy at birth, and GDP per capita. Additionally, health and well-being indicators such as immunization coverage against measles and the lifetime risk of maternal death are negatively correlated with the urban slum population proportion. Moreover, urban infrastructure and resource indicators like the proportion of people using safely managed sanitation services in urban areas, the surface area of countries, external balance on goods and services, general government final consumption expenditure, gross savings, and mineral rents as a percentage of GDP are also negatively associated with the urban slum population proportion. In summary, these indicators demonstrate that improvements in socio-economic development, health and well-being, and urban infrastructure are associated with a lower population proportion of the urban slum population.
Table 2. Positively correlated variables with the urban slum population proportion in the clustered heatmap.
Series code
|
Series name
|
NV.IND.MANF.ZS
|
Manufacturing, value added (% of GDP)
|
SP.POP.GROW
|
Population growth (annual %)
|
SP.URB.GROW
|
Urban population growth (annual %)
|
SI.POV.UMIC.GP
|
Poverty gap at $6.85 a day (2017 PPP) (%)
|
EN.ATM.CO2E.PP.GD
|
CO2 emissions (kg per 2017 PPP $ of GDP)
|
SH.DYN.MORT
|
Mortality rate, under-5 (per 1,000 live births)
|
Table 3. Negatively correlated variables with the urban slum population proportion in the clustered heatmap.
Series code
|
Series name
|
DT.ODA.ODAT.PC.ZS
|
Net ODA received per capita (current US$)
|
SH.IMM.MEAS
|
Immunization, measles (% of children ages 12-23 months)
|
SE.ENR.PRSC.FM.ZS
|
School enrollment, primary and secondary (gross), gender
parity index (GPI)
|
IT.CEL.SETS.P2
|
Mobile cellular subscriptions (per 100 people)
|
BM.TRF.PWKR.CD.DT
|
Personal remittances, paid (current US$)
|
SH.MMR.RISK
|
Lifetime risk of maternal death (%)
|
SP.DYN.LE00.IN
|
Life expectancy at birth, total (years)
|
NY.GDP.PCAP.PP.KD
|
GDP per capita, PPP (constant 2017 international $)
|
SL.GDP.PCAP.EM.KD
|
GDP per person employed (constant 2017 PPP $)
|
SH.STA.SMSS.UR.ZS
|
People using safely managed sanitation services, urban (%
of urban population)
|
AG.SRF.TOTL.K2
|
Country’s Surface area (sq. km)
|
NE.RSB.GNFS.ZS
|
External balance on goods and services (% of GDP)
|
NE.CON.GOVT.ZS
|
General government final consumption expenditure (% of GDP)
|
NY.GNS.ICTR.ZS
|
Gross savings (% of GDP)
|
NY.GDP.MINR.RT.ZS
|
Mineral rents (% of GDP)
|
Feature transformation and reduction
The PCA[7] analysis focuses on capturing the underlying structure and reducing the dimensionality of the data and the collinearity problem[8]. The process involves creating a smaller set of variables that captures the most useful information from the original variables for predicting outcomes. This is achieved by applying a transformation to the original variables, resulting in transformed variables that represent projections onto a new variable space. In this new space, the distinct outcome groups are better separated than the original variable space. The results (Table 4) show the Explained Variance Ratio and Most Important Feature for each principal component obtained through PCA analysis. In this case, the Explained Variance Ratio represents the proportion of variance in the original dataset that is explained by each principal component. The principal component with the highest Explained Variance Ratio (0.6748) is primarily explained by the feature "urban population (% of total population)". This suggests that this feature has the most significant contribution to the variance captured by this principal component. Similarly, other important features such as "Services, value added (% of GDP)", "general government final consumption expenditure (% of GDP)", and "domestic credit to private sector (% of GDP)" are identified with their respective importance scores.
Table 4. Top 10 PCA and explained variance ratio of selected series with their most important features. The 'Most Important Feature' is the feature that contributes most to the direction of the PCA component in the multidimensional feature space.
PC
|
Explained Variance Ratio
|
Most Important Feature
|
Series name
|
1
|
0.677898298
|
SP.URB.TOTL.IN.ZS
|
Urban population (% of total population)
|
2
|
0.147463761
|
NV.SRV.TOTL.ZS
|
Services, value added (% of GDP)
|
3
|
0.068850840
|
NE.CON.GOVT.ZS
|
General government final consumption expenditure (% of GDP)
|
4
|
0.031012423
|
FS.AST.PRVT.GD.ZS
|
Domestic credit to private sector (% of GDP)
|
5
|
0.020778243
|
BX.KLT.DINV.CD.WD
|
Foreign direct investment, net inflows (BoP, current US$)
|
6
|
0.016872068
|
MS.MIL.XPND.ZS
|
Military expenditure (% of general government expenditure)
|
7
|
0.014434908
|
NV.IND.MANF.KD.ZG
|
Manufacturing, value added (annual % growth)
|
8
|
0.008164319
|
NY.GDP.COAL.RT.ZS
|
Coal rents (% of GDP)
|
9
|
0.004163555
|
NY.GDP.FRST.RT.ZS
|
Forest rents (% of GDP)
|
10
|
0.003105314
|
SE.XPD.TOTL.GD.ZS
|
Government expenditure on education, total (% of GDP)
|
Modeling evaluation
The study utilized an 80/20 split for training and testing datasets, respectively, to develop generalized machine learning models for predicting the urban slum population proportion, with a random seed of 42. The models were assessed using metrics via Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared score.
Ridge Regression exhibited the highest performance among the evaluated models, as evidenced by its impressive results. It achieved an MAE of 0.4545, an MSE of 0.2707, and an R-squared score of 0.9532[9]. These metrics highlight the accuracy of Ridge Regression's predictions, with minimal errors observed. The high R-squared score indicates thats the model successfully captured the intricate relationships between the input features and the proportion of the urban slum population.
The Gradient Boosting model achieved an MAE of 0.6503, MSE of 0.8153, and R-squared score of 0.8591, demonstrating good predictive accuracy and a strong ability to capture the variation in the urban slum population, despite higher prediction errors compared to the top models. The Random Forest model yielded a marginally higher MAE of 0.9185, an MSE of 1.3164, and an R-squared score of 0.7726, showing reasonable accuracy but facing challenges in capturing complex features. Linear Regression demonstrated poorer performance, with an MAE of 0.9764, an MSE of 1.6766, and an R-squared score of 0.7103, indicating its limitations in predicting and explaining the variability in the urban slum population proportion. The Decision Tree model, with an MAE of 1.5517, an MSE of 3.0269, and an R-squared score of 0.4770, exhibited the lowest performance among the assessed models. Its higher prediction errors and lower R-squared score suggested a limited ability to explain the variability in the target variable.
Table 5. Machine learning methods performance evaluation in this study. The R-squared score herewith refers to the term general to the metrics used to evaluate the performance and effectiveness of the model (James et al. 2013). The recommendation is not solely based on the score but also on the specific use case and requirements. Often, it is advisable to try multiple models and perform cross-validation to see how they perform on unseen data before making a final decision. Taking into account interpretability, training time, and complexity, in addition to accuracy, is crucial for a comprehensive evaluation and informed decision-making process (Linardatos, Papastefanopoulos, & Kotsiantis, 2020).
Model
|
MAE
|
MSE
|
R-squared score
|
Ridge Regression
|
0.45451654
|
0.27067628
|
0.95323646
|
Gradient Boosting
|
0.65472268
|
0.81528271
|
0.85914724
|
Random Forest
|
0.91853373
|
1.3164061
|
0.77257038
|
Linear Regression
|
0.97636901
|
1.67657814
|
0.71034507
|
Decision Tree
|
1.55168906
|
3.02694294
|
0.47704856
|