Machine Learning Insights into Multifaceted Social Impacts on Global Urban Slum Population

doi:10.21203/rs.3.rs-3137402/v1

The growth of the urban slum population proportion highlights the extent of urban inequality. This article uses machine learning to surpass the limitations of traditional static linear regression models. By handling large data volumes and uncovering multidimensional relationships, the study illuminates the complexities of urban slum population proportion. The study leverages multivariate features from the World Development Indicators (WDI) and the Urban Indicators Database from UN-Habitat, identifying 105 key features associated with urban inequality. The top 35 determinants are selected using agglomerative hierarchical clustering, with Principal Component Analysis applied for feature extraction, regarding four major dimensions of economic progress, social health and wellness, the urban milieu, and demographic attributes. Results show that Ridge Regression excels in effectively managing complex relationships among various socio-economic, environmental, and health variables in slum populations. The study aims to offer a deeper understanding of multidimensional social factors influencing global urban slum population expansion and urban inequality dynamics.

Social science/Development studies

Social science/Environmental studies

Social science/Sociology

In the context of urban studies, inequality represents disparities in the distribution of resources and health outcomes among different groups within a city or town, extending to any imbalances in the decision-making process(Portneyc 2003:158-169). Aligned with “UN SDG 10” and “UN SDG 11”[1], global urban slum population proportion is a measure to address challenges faced by marginalized communities in substandard urban living conditions[2]. Inextricably tied to the urban environment, slums share common characteristics such as heightened health risks stemming from environmental pollution, substantial waste accumulation, overcrowding, inadequate housing, and exposure to physical hazards (UN 2021, 2020, 2003). The proximity and density of these settlements often exacerbate these negative factors, deepening the divide of urban inequality. The environmental burdens faced by these communities, including pollution, waste accumulation, and overcrowding, are further intensified due to socio-economic disparity. Consequently, the gap in urban inequality widens. This study aims to delve into these aspects, shedding light on the magnitude and implications of urban inequality as observed through the lens of slum population growth.

Indeed, the implicit legibility or understanding of the urban slums issue is often emphasized in academic discourse as a critical factor in understanding why countries that have developed later struggle to rapidly enhance the overall quality of urban life. For example, it is proposed that the expansion of urban slums or low spatial quality is crucial to development and equality. The commonly referenced example in this context is the concept of city space legibility, where increased spatial planning specialization propels the character of the city and urban life(Lynch 1964, Lefebvre and Nicholson-Smith 1991)[3]. Concurrently, the expansion of urban slums and the consequent surge in their population are commonly viewed as important reflections of the intricate complexities inherent in urban justice (Beatley and Wheeler 2004:18, UN 2020).

Traditionally, analyzing urban slum population proportion and its associations with socio-economic factors has relied on static linear regression models. However, the complexity of urban inequality and its implications extend beyond simple linear relationships.

Machine learning techniques have become powerful tools for unpacking the intricate dynamics of complex systems, including those found in social phenomena (Grimmer et al., 2021). Such complexities frequently surpass straightforward linear relationships, which calls for a more robust analytical method (Bar-Yam, 2004). When paired with dimensionality reduction methods like PCA for extracting features, these techniques enhance data mining efficiency, identifying the most consequential variables(Awan et al. 2019). This technique not only streamlines the variable count but also aids in disentangling the tangled dynamics inherent in complex systems (Hall and Smith 1998).

The objective of the research is to surpass the limitations of traditional static linear regression models and achieve a deeper understanding of the societal influences on the urban slum population proportion. By utilizing machine learning methods, this research captures multidimensional relationships and interactions among various social variables, encompassing economic and political aspects, which shape the complexity of the urban slum population. Additionally, it sheds light on the influential multifaceted factors, handles large amounts of data via data mining, transformation, and uncovers multidimensional relationships that may not be evident through traditional statistical methods. The study aims to contribute to the expanding knowledge base on the urban slum population by providing more insights into the underlying dynamics of complex social systems.

By leveraging data resources from the World Development Indicators (WDI) [4] and the Urban Indicators Database from UN-Habitat, this study identifies 105 key features associated with urban inequality[5]. The top 35 determinants influencing the urban slum population proportion are selected using the agglomerative hierarchical clustering algorithm, encompassing variables such as health expenditure, economic contributions from sectors like urbanization and agriculture, manufactured exports, current health expenditure, and access to clean fuels and cooking technologies. Principal Component Analysis is used to further extract features, apply linear transformation, and reduce dimensionality for model preparation. The results reveal that Ridge Regression outperforms other regression models, providing stable and interpretable results. It effectively handles complex relationships among various socio-economic, environmental, and health variables prevalent in the slums population. The balance that Ridge Regression maintains between complexity (fitting numerous parameters) and simplicity (avoiding overfitting by regularizing parameters) proves particularly beneficial for urban slum population studies, making it a valuable tool in this research context.

Agglomerative hierarchical clustering and mapping the dimensions of the urban slum population proportion

The overall interrelationships among the 106 key indicators (See Supplementary Information), including the target variable 'urbanslum' and 105 features, have been assessed to identify clusters of variables that are related or behave similarly, resulting in the extraction of 35 features (Table 1).

These results are presented in a heatmap of the correlation matrix created for the urban slum population proportion (Fig.1). Among these indicators, Manufacturing (value added as a % of GDP), population growth (annual %), urban population growth (annual %), poverty gap (%), carbon emissions, and mortality rate all indicate a strong positive correlation with the urban slum population proportion (Table 1). They suggest that as urban population growth and the poverty gap increase, accompanied by more carbon emissions and a higher mortality rate, the urban slum population proportion also increases.

Table 1. Basic information for target variable and feature selection of 35 key variables with four dimensions. The name and code of the data series align with the World Bank database at https://databank.worldbank.org/source/world-development-indicators. The series code name for “urbanslum” is assigned by the author.

Series Code		Series Name
*Target variable*
urbanslum		Proportion of Urban Population Living in Slum Households by Country or area 2000 - 2020 (Percent)
*Economic* *Development Dimension*
NV.IND.MANF.ZS		Manufacturing, value added (% of GDP)
NE.CON.TOTL.KD.ZG		Final consumption expenditure (annual % growth)
NY.GDP.MKTP.KD.ZG		GDP growth (annual %)
BX.KLT.DINV.WD.GD.ZS		Foreign direct investment, net inflows (% of GDP)
FP.CPI.TOTL.ZG		Inflation, consumer prices (annual %)
NV.IND.TOTL.ZS		Industry (including construction), value added (% of GDP)
CM.MKT.LCAP.GD.ZS		Market capitalization of listed domestic companies (% of GDP)
GC.XPN.TOTL.GD.ZS		Expense (% of GDP)
FS.AST.CGOV.GD.ZS		Claims on central government, etc. (% GDP)
NE.GDI.FTOT.ZS		Gross fixed capital formation (% of GDP)
BG.GSR.NFSV.GD.ZS		Trade in services (% of GDP)
DT.ODA.ODAT.PC.ZS		Net ODA received per capita (current US$)
IT.CEL.SETS.P2		Mobile cellular subscriptions (per 100 people)
BM.TRF.PWKR.CD.DT		Personal remittances, paid (current US$)
NY.GDP.PCAP.PP.KD		GDP per capita, PPP (constant 2017 international $)
NY.GNP.MKTP.PP.CD		GNI, PPP (current international $)
SL.GDP.PCAP.EM.KD		GDP per person employed (constant 2017 PPP $)
NE.RSB.GNFS.ZS		External balance on goods and services (% of GDP)
NE.CON.GOVT.ZS		General government final consumption expenditure (% of GDP)
NY.GNS.ICTR.ZS		Gross savings (% of GDP)
NY.GDP.MINR.RT.ZS		Mineral rents (% of GDP)
*Health and Social* *Dimension*
SP.ADO.TFRT		Adolescent fertility rate (births per 1,000 women ages 15-19)
SH.STA.MALN.ZS		Prevalence of underweight, weight for age (% of children under 5)
SH.DYN.MORT		Mortality rate, under-5 (per 1,000 live births)
SH.IMM.MEAS		Immunization, measles (% of children ages 12-23 months)
SH.MMR.RISK		Lifetime risk of maternal death (%)
SP.DYN.LE00.IN		Life expectancy at birth, total (years)
*Urban and Environmental* *Dimension*
AG.SRF.TOTL.K2		Country's Surface area (sq. km)
SH.STA.SMSS.UR.ZS		People using safely managed sanitation services, urban (% of urban population)
EN.ATM.CO2E.KD.GD		CO2 emissions (kg per 2015 US$ of GDP)
*Demographic* *Dimension*
SP.POP.GROW		Population growth (annual %)
SP.URB.GROW SP.URB.TOTL		Urban population growth (annual %) Urban population
SE.ENR.PRSC.FM.ZS		School enrollment, primary and secondary (gross), gender parity index (GPI)

Fig.1. Clustered heatmap of the overall interrelationships among the variables. Combining heatmaps and clustering algorism results in a matrix where the cell at the intersection of row i and column j represents the correlation between the ith and jth feature. It diverges around a center value (0 in this case), with two contrasting colors at the ends, a clustered correlation heatmap using the correlation matrix. In the clustered heatmap, not only are the individual values represented as colors, but the rows and columns are ordered (or clustered) so that similar rows and columns are near each other. The color of each cell in the heatmap corresponds to the correlation value. A value close to 1 indicates a strong positive correlation. A value closes to -1 indicates a strong negative correlation. A value near 0 indicates no linear relationship between the features.

The negatively correlated indicators to the urban slum population proportion are listed in Table 2. This result shows as these variables increase, the proportion of the urban slum population tends to decrease. The indicators include socio-economic development factors such as net ODA(Official development assistance)[6] received per capita, school enrollment (primary and secondary), mobile cellular subscriptions, personal remittances, life expectancy at birth, and GDP per capita. Additionally, health and well-being indicators such as immunization coverage against measles and the lifetime risk of maternal death are negatively correlated with the urban slum population proportion. Moreover, urban infrastructure and resource indicators like the proportion of people using safely managed sanitation services in urban areas, the surface area of countries, external balance on goods and services, general government final consumption expenditure, gross savings, and mineral rents as a percentage of GDP are also negatively associated with the urban slum population proportion. In summary, these indicators demonstrate that improvements in socio-economic development, health and well-being, and urban infrastructure are associated with a lower population proportion of the urban slum population.

Table 2. Positively correlated variables with the urban slum population proportion in the clustered heatmap.

Series code	Series name
NV.IND.MANF.ZS	Manufacturing, value added (% of GDP)
SP.POP.GROW	Population growth (annual %)
SP.URB.GROW	Urban population growth (annual %)
SI.POV.UMIC.GP	Poverty gap at $6.85 a day (2017 PPP) (%)
EN.ATM.CO2E.PP.GD	CO2 emissions (kg per 2017 PPP $ of GDP)
SH.DYN.MORT	Mortality rate, under-5 (per 1,000 live births)

Table 3. Negatively correlated variables with the urban slum population proportion in the clustered heatmap.

Series code	Series name
DT.ODA.ODAT.PC.ZS	Net ODA received per capita (current US$)
SH.IMM.MEAS	Immunization, measles (% of children ages 12-23 months)
SE.ENR.PRSC.FM.ZS	School enrollment, primary and secondary (gross), gender parity index (GPI)
IT.CEL.SETS.P2	Mobile cellular subscriptions (per 100 people)
BM.TRF.PWKR.CD.DT	Personal remittances, paid (current US$)
SH.MMR.RISK	Lifetime risk of maternal death (%)
SP.DYN.LE00.IN	Life expectancy at birth, total (years)
NY.GDP.PCAP.PP.KD	GDP per capita, PPP (constant 2017 international $)
SL.GDP.PCAP.EM.KD	GDP per person employed (constant 2017 PPP $)
SH.STA.SMSS.UR.ZS	People using safely managed sanitation services, urban (% of urban population)
AG.SRF.TOTL.K2	Country’s Surface area (sq. km)
NE.RSB.GNFS.ZS	External balance on goods and services (% of GDP)
NE.CON.GOVT.ZS	General government final consumption expenditure (% of GDP)
NY.GNS.ICTR.ZS	Gross savings (% of GDP)
NY.GDP.MINR.RT.ZS	Mineral rents (% of GDP)

Feature transformation and reduction

The PCA[7] analysis focuses on capturing the underlying structure and reducing the dimensionality of the data and the collinearity problem[8]. The process involves creating a smaller set of variables that captures the most useful information from the original variables for predicting outcomes. This is achieved by applying a transformation to the original variables, resulting in transformed variables that represent projections onto a new variable space. In this new space, the distinct outcome groups are better separated than the original variable space. The results (Table 4) show the Explained Variance Ratio and Most Important Feature for each principal component obtained through PCA analysis. In this case, the Explained Variance Ratio represents the proportion of variance in the original dataset that is explained by each principal component. The principal component with the highest Explained Variance Ratio (0.6748) is primarily explained by the feature "urban population (% of total population)". This suggests that this feature has the most significant contribution to the variance captured by this principal component. Similarly, other important features such as "Services, value added (% of GDP)", "general government final consumption expenditure (% of GDP)", and "domestic credit to private sector (% of GDP)" are identified with their respective importance scores.

Table 4. Top 10 PCA and explained variance ratio of selected series with their most important features. The 'Most Important Feature' is the feature that contributes most to the direction of the PCA component in the multidimensional feature space.

PC	Explained Variance Ratio	Most Important Feature	Series name
1	0.677898298	SP.URB.TOTL.IN.ZS	Urban population (% of total population)
2	0.147463761	NV.SRV.TOTL.ZS	Services, value added (% of GDP)
3	0.068850840	NE.CON.GOVT.ZS	General government final consumption expenditure (% of GDP)
4	0.031012423	FS.AST.PRVT.GD.ZS	Domestic credit to private sector (% of GDP)
5	0.020778243	BX.KLT.DINV.CD.WD	Foreign direct investment, net inflows (BoP, current US$)
6	0.016872068	MS.MIL.XPND.ZS	Military expenditure (% of general government expenditure)
7	0.014434908	NV.IND.MANF.KD.ZG	Manufacturing, value added (annual % growth)
8	0.008164319	NY.GDP.COAL.RT.ZS	Coal rents (% of GDP)
9	0.004163555	NY.GDP.FRST.RT.ZS	Forest rents (% of GDP)
10	0.003105314	SE.XPD.TOTL.GD.ZS	Government expenditure on education, total (% of GDP)

Modeling evaluation

The study utilized an 80/20 split for training and testing datasets, respectively, to develop generalized machine learning models for predicting the urban slum population proportion, with a random seed of 42. The models were assessed using metrics via Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared score.

Ridge Regression exhibited the highest performance among the evaluated models, as evidenced by its impressive results. It achieved an MAE of 0.4545, an MSE of 0.2707, and an R-squared score of 0.9532[9]. These metrics highlight the accuracy of Ridge Regression's predictions, with minimal errors observed. The high R-squared score indicates thats the model successfully captured the intricate relationships between the input features and the proportion of the urban slum population.

The Gradient Boosting model achieved an MAE of 0.6503, MSE of 0.8153, and R-squared score of 0.8591, demonstrating good predictive accuracy and a strong ability to capture the variation in the urban slum population, despite higher prediction errors compared to the top models. The Random Forest model yielded a marginally higher MAE of 0.9185, an MSE of 1.3164, and an R-squared score of 0.7726, showing reasonable accuracy but facing challenges in capturing complex features. Linear Regression demonstrated poorer performance, with an MAE of 0.9764, an MSE of 1.6766, and an R-squared score of 0.7103, indicating its limitations in predicting and explaining the variability in the urban slum population proportion. The Decision Tree model, with an MAE of 1.5517, an MSE of 3.0269, and an R-squared score of 0.4770, exhibited the lowest performance among the assessed models. Its higher prediction errors and lower R-squared score suggested a limited ability to explain the variability in the target variable.

Table 5. Machine learning methods performance evaluation in this study. The R-squared score herewith refers to the term general to the metrics used to evaluate the performance and effectiveness of the model (James et al. 2013). The recommendation is not solely based on the score but also on the specific use case and requirements. Often, it is advisable to try multiple models and perform cross-validation to see how they perform on unseen data before making a final decision. Taking into account interpretability, training time, and complexity, in addition to accuracy, is crucial for a comprehensive evaluation and informed decision-making process (Linardatos, Papastefanopoulos, & Kotsiantis, 2020).

Model	MAE	MSE	R-squared score
Ridge Regression	0.45451654	0.27067628	0.95323646
Gradient Boosting	0.65472268	0.81528271	0.85914724
Random Forest	0.91853373	1.3164061	0.77257038
Linear Regression	0.97636901	1.67657814	0.71034507
Decision Tree	1.55168906	3.02694294	0.47704856

The study yields three contributions. Initially, it forms a systematic approach to machine learning usage for social sciences, leveraging the agglomerative clustering and principal component analysis interactions for feature selection, transformation, and reduction for addressing the multidimensional social data of slum population growth. These methods quantify and emphasize the most significant features of the social data under investigation. The analysis highlights the substantial impact of socio-economic factors on the global urban slum population proportion. Moreover, paired with PCA into machine learning is beneficial for classifying high-dimensional data (Howley et al. 2005:363). The study underscores the necessity of a comprehensive understanding of the societal origins of urban slum population proportion and the importance of inter-sectoral collaboration. These socially significant feature categories underscore the interconnected nature of factors such as urban environmental conditions, economic attributes, health and social expenditures, and demographic attributes.

The study uses agglomerative hierarchical clustering and mapping to analyze key indicators associated with the proportion of the urban population living in slums. By leveraging 106 key indicators, the research finds 35 features that strongly correlate with the target variable, 'urbanslum'. The results include variables such as manufacturing (expressed as a percentage of GDP), population growth, urban population growth, poverty gap, carbon emissions, and mortality rate, all of which exhibit a strong positive correlation. These findings suggest that increases in these indicators correspond to an increase in the proportion of the urban population living in slums, indicating a significant social challenge that requires intervention. Through this analysis, the research offers a multidimensional perspective on the dynamics of urban slum growth.

Additionally, the study highlights several indicators that negatively correlate with the urban slum population proportion. These indicators include net ODA received per capita, primary and secondary school enrollment, mobile cellular subscriptions, personal remittances, life expectancy at birth, GDP per capita, and several others, in addition to certain health and urban infrastructure indicators. As these variables improve, the proportion of the urban slum population decreases, underlining the critical role of socio-economic development, health and well-being, and infrastructure in alleviating slum conditions. Consequently, efforts to decrease the proportion of the urban slum population could be more effectively oriented by focusing on the enhancement of these aspects.

Lastly, the study employs Principal Component Analysis (PCA) for feature transformation and reduction to capture the underlying structure of the data and mitigate the issue of collinearity. The PCA results demonstrate that the principal component with the highest explained variance ratio is primarily driven by the feature "urban population (% of total population)"( Table 4.). This indication signifies that this feature contributes significantly to the variance captured by this principal component, emphasizing the influence of urban population growth on the prevalence of slums. From a statistical standpoint, in the PCA employed in this study, "urban population (% of total population)" was identified as the principal component that explains the most variance in the data. The result suggests a strong linear relationship between the urban population and the urban slum population. That is, as the urban population increases, so does the slum population. From a socio-economic perspective, rapid urbanization can result in a lack of affordable housing in cities. When urban infrastructure development can't keep up with population growth, it often leads to the development of informal settlements or slums (Malik and Wahid 2014). Hence, a larger urban population could lead to a larger slum population if there is insufficient housing and infrastructure to accommodate the influx of residents (Nassar and Elsayed 2018)[10].

As previously described, the exploration culminated in the discernment of predominant motifs, which were coherently categorized into four discrete classifications, each epitomizing a distinct facet of social data (Table 1). This approach highlights the pressing global issue of the urban slum population proportion and its detrimental effects on human health and the environment. Traditional linear regression models have limitations in capturing the complexity of urban inequality under social context. To overcome these limitations, machine learning techniques were employed to analyze the dataset and uncover the determinants of the urban slum population proportion. The suggestion for opting for one or more models, based on the performance evaluation provided, is contingent upon a multitude of aspects, encompassing the particular objectives, requisites, and limitations associated with the task in question. Here are several points to weigh when advocating for a machine learning model in this study (Table 6) .

In a nutshell, the research enriches the understanding of the intricate interactions between socio-economic variables and the urban slum population proportion growth. It emphasizes the necessity for a comprehensive understanding of the social origins of urban slum population proportion and the importance of holistic approaches that confront the multifaceted challenge of urban slum issue. By utilizing machine learning techniques and considering multiple dimensions, policymakers are better equipped to make informed decisions and implement interventions to address the urban slum population growth issues on a global scale.

Table 6. The general comparison of five machine learning models used in this study.

Option description		Mechanics	Prons and cons
Ridge Regression		Minimizes the sum of squared residuals with an added penalty proportional to the square of the magnitude of the coefficients.	Pros: Handles multicollinearity well; Simpler and computationally efficient. Cons: Might not perform well with non-linear data; Model complexity can be increased due to the introduction of regularization.
Decision Tree Regression		Splits the dataset into subsets based on feature values, and this process is recursively repeated until the tree reaches a predefined depth or purity.	Pros: High interpretability; Can capture non-linear relationships. Cons: Prone to overfitting, especially with complex datasets; Can create overly complex trees.
	Random Forest Regression Gradient Boosting	Creates a set of decision trees from randomly selected subsets of the training set and averages their predictions. Involves iteratively building new models that focus on reducing the errors made by the previous models	Pros: High predictive accuracy; Less prone to overfitting compared to a single decision tree. Cons: Computationally more intensive than a single decision tree. Interpretability is less compared to a single decision tree but better than SuperVector Regression. Pros: Effective for capturing complex relationships and patterns; provides insights into feature importance. Cons: Computationally intensive for large datasets; prone to overfitting; requires careful hyperparameter tuning; challenging to interpret due to ensemble nature.

The research method design (Fig.2) outlines a systematic three-step approach to analyze the multidimensional social causes contributing to the urban slum population proportion. Firstly, a comprehensive dataset comprising 105 features is processed through a hierarchical clustering algorithm known as agglomerative clustering. This process produces a clustered heatmap, allowing for an extensive analysis of the numerous features and their classification into distinct groups, thereby capturing different dimensions of the social data. Secondly, PCA is applied to these processed data, further transforming the 35 principal components while preserving most of the original data variance. These components then serve as the basis for the development and training of different machine learning models, effectively quantifying and highlighting the most critical features influencing the urban slum population proportion. Lastly, an evaluation and comparison of these models are conducted based on predefined metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared score, and robustness is ensured through 10-fold cross-validation. This comparative study aids in determining the most effective model to predict the urban slum population proportion.

Data preparation: target variable- the urban slum population proportion and its blank corrections with the multiple imputation (MI)

As indicated at the outset, the target variable "Proportion of Urban Population Living in Slum Households" is a key metric utilized by the United Nations to assess urban development and inequality. It represents the fraction of a city's population living in slums, areas marked by substandard living conditions and a lack of basic amenities. Data on this metric, which embodies four out of the five UN-Habitat defined household deprivations, is collected biannually since 2000. To address data gaps and form a complete yearly data of “Proportion of Urban Population Living in Slum Households” (Fig.3), an MI strategy is used, yielding imputed values for years and proportion. The code, starting with library imports, identifies available and missing years and incorporates the proportion of the urban slum population. Missing data is managed by integrating 'years' and 'proportion' into a 'data' array and applying the MI via IterativeImputer, thus systematically filling the gaps[11].

Data splitting

This study uses a data division strategy before implementing PCA to maintain data integrity and the independence of testing data during training. It allocates 80% of data to training and reserves 20% for testing, using a fixed random seed of 42 for consistency and model reliability (Bisong, 2019).

Ensuring similarity in the distribution of training and testing sets is crucial for the model's effective learning of relevant patterns during the training stage and its accurate application to the test data (Bisong, 2019). The analysis seeks to investigate the distribution characteristics of the two datasets. Histograms and fitted curves display the distribution shape, while the probability plot assesses how closely the data aligns with a normal distribution. This examination confirms there are no significant differences in feature distribution between the training and testing sets (Fig.4). Furthermore, the Kolmogorov-Smirnov (KS) test scrutinizes the data's distribution, particularly between training and testing datasets, by comparing if they're significantly different based on a p-value of 0.05 (Fig.4). The examination findings indicate that the test data accurately mirrors the training data's distribution, with no significant skewness or abnormalities observed. The results suggests that additional data cleaning or transformations aren't required, enabling further training, validation, and evaluation processes.

The hierarchical clustering heatmap relationships between each feature and target variable the urban slum population proportion

The method from the Seaborn ‘clustermap’ function employs an agglomerative hierarchical clustering algorithm, which iteratively merges the closest clusters, updating the distance matrix after each merge until only one cluster remains. The resulting dendrogram, which can be reordered for optimal leaf placement, visualizes data relationships, aiding in discerning hidden patterns and defining a sensible number of clusters, although it does not directly perform dimensionality reduction like PCA. The data analysis process involves the construction of correlation matrices to visualize the relationships between pairs of variables to the target variable air quality. The correlation coefficient, ranging from -1 to 1, indicates negative and positive correlations, with 0 indicating no linear relationship.

PCA : dimension projection and reduction

Dimension projection and reduction convert all columns to a numeric format, normalize them, and perform PCA. Each component's explained variance is displayed, representing the portion of the original dataset's variance it captures (Fig. 5). The PCA components, stored in the ‘components_ attribute’ of the fitted PCA object, are lower-dimensional linear representations of the features. While they represent a mixture of all input features and lack direct interpretability, the most influential features within each component can be identified by examining the largest absolute values in the components' vector. Table 4 Could refer to specific details.

Machine learning modeling and evaluation

This modeling utilizes cross-validation to train and assess multiple regression models. It begins by loading the dataset, identifying 'urbanslum' as the target variable, and scaling all features for uniformity. Six regression models (Linear, Ridge, Decision Tree, Random Forest, and Gradient Boosting) are defined. During model training and evaluation, the 5-fold cross-validation technique is applied. This technique assesses model performance using Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared metrics and prevents overfitting by checking model performance on unseen data. It also promotes efficient data use and aids in model selection by allowing fair performance comparison across models. The average performance of each model, based on these metrics, is stored in a dataframe and displayed.

DATA AVAILABILITY

The data that support the findings of this study are available at https://github.com/shanshanfy/UrbanSlumPopProp.git

CODE AVAILABILITY

The code for analysis in the current study can be accessed through https://github.com/shanshanfy/UrbanSlumPopProp.git

DEVELOPER ENVIRONMENT AVAILABILITY

The developer environment ‘UrbanSlumPopPropPackages.yaml’ file for Conda open-source package management system is provided through: https://github.com/shanshanfy/UrbanSlumPopProp.git. It allows for isolated environments to manage packages without interference. The file contains the configuration of the project’s Python environment, including channels, dependencies, and library versions.

AUTHOR CONTRIBUTIONS

As the sole author of this paper, S.S. was responsible for all aspects of the research, including but not limited to its conception, design, execution, data analysis, manuscript drafting, revision, and finalization.

COMPETING INTERESTS

The author declares no competing financial or non-financial interests related to this work.

Awan, S. E., Bennamoun, M., Sohel, F., Sanfilippo, F. M., Chow, B. J., & Dwivedi, G. (2019). Feature selection and transformation by machine learning reduce variable numbers and improve prediction for heart failure readmission or death. PloS one, 14(6), e0218760.
Bar‐Yam, Y. (2004). Multiscale variety in complex systems. Complexity, 9(4), 37-45.
Beatley, T., & Wheeler, S. M. (Eds.). (2004). The sustainable urban development reader. London, UK: Routledge.
Bisong, E., & Bisong, E. (2019). Introduction to Scikit-learn. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, 215-229.
Bisong, E., & Bisong, E. (2019). More supervised machine learning techniques with scikit-learn.
Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, 287-308.
Bisong, E., & Bisong, E. (2019). Regularization for Linear Models. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, 251-254.
Bolleyer, N., & B.rzel, T. A. (2010). Non-hierarchical policy coordination in multilevel systems. Eur. Polit. Sci. Rev., 2, 157–185.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. https://doi.org/10.1007/BF00058655.
Breiman, L. (2001). Random Forests. Machine Learning, 45, 5–32. Available: https://doi.org/10.1023/A:1010933404324.
Bohr, Jeremiah, & Dunlap, Riley E. (2018). "Key Topics in Environmental Sociology, 1990–2014: Results from a Computational Text Analysis." Environmental Sociology, 4(2), 181–195.
Brady, David, Beckfield, Jason, & Seeleib-Kaiser, Martin. (2005). "Economic Globalization and the Welfare State in Affluent Democracies, 1975–2001." American Sociological Review, 70(6), 921–948.
Brady, David, Beckfield, Jason, & Zhao, Wei. (2007). "The Consequences of Economic Globalization for Affluent Democracies." Annual Review of Sociology, 33, 313–334.
Braswell, Taylor. (2022). "Extended Spaces of Environmental Injustice: Hydrocarbon Pipelines in the Age of Planetary Urbanization." Social Forces, 100(3), 1025–1052.
Bro, R., & Smilde, A. K. (2014). Principal component analysis. Analytical methods, 6(9), 2812-2831.
Cohen, A. J., Brauer, M., Burnett, R., Anderson, H. R., Frostad, J., Estep, K.,& Forouzanfar, M. H. (2017). Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015. The lancet, 389(10082), 1907-1918.
Demirtas, H., Freels, S. A., & Yucel, R. M. (2008). Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. Journal of Statistical Computation and Simulation, 78(1), 69-84.
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24, 395-419.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Hall, M. A., & Smith, L. A. (1998). Practical feature subset selection for machine learning.
Hill, Terrence D., Andrew Jorgenson, Peter Ore, Kelly Balistreri, and Brett Clark. 2019. “Air Quality and Life Expectancy in the United States: An Analysis of the Moderating Effect of Income Inequality.” SSM – Population Health 7:100346. doi:10.1016/j.ssmph.2018.100346.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.
Howley, T., Madden, M. G., O’Connell, M. L., & Ryder, A. G. (2005, December). The effect of principal component analysis on machine learning accuracy with high dimensional spectral data. In International Conference on Innovative Techniques and Applications of Artificial Intelligence (pp. 209-222). London: Springer London.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: Springer.
Karagulian, F., De Vito, S., Karatzas, K., Bartonova, A., & Fattoruso, G. (2023). New Challenges in Air Quality Measurements. In Air Quality Networks: Data Analysis, Calibration & Data Fusion (Environmental Informatics and Modeling). Springer.
Kumar, P., Druckman, A., Gallagher, J., Gatersleben, B., Allison, S., Eisenman, T. S., Hoang, U., Hama, S., Tiwari, A., Sharma, A., Abhijith, K. V., Adlakha, D., McNabola, A., Astell-Burt, T., Feng, X., Skeldon, A. C., de Lusignan, S., & Morawska, L. (2019). The nexus between air pollution, green infrastructure and human health. Environment International, 133(Part A), 105181. https://doi.org/10.1016/j.envint.2019.105181
Linardatos, P., Papastefanopoulos, V., & Kotsiantis, S. (2020). Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy (Basel, Switzerland), 23(1), 18. https://doi.org/10.3390/e23010018
Lefebvre, H., & Nicholson-Smith, D. (1991). The production of space (Vol. 142). Blackwell: Oxford.
Lynch, K. (1964). The image of the city. MIT press.
Lim, W. H., Yamazaki, D., Koirala, S., Hirabayashi, Y., Kanae, S., Dadson, S. J.,& Sun, F. (2018). Long‐term changes in global socio-economic benefits of flood defenses and residual risk based on CMIP5 climate models. Earth's Future, 6(7), 938-954.
Malik, S., & Wahid, J. (2014). Rapid urbanization: Problems and challenges for adequate housing in Pakistan.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
Nassar, D. M., & Elsayed, H. G. (2018). From informal settlements to sustainable communities. Alexandria engineering journal, 57(4), 2367-2376.
NCBI Resource Coordinators, Database Resources of the National Center for Biotechnology Information, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D12–D17, https://doi.org/10.1093/nar/gkw1071
Normann, H. E. (2017). Policy networks in energy transitions: The cases of carbon capture and storage and offshore wind in Norway. Technol. Forecast. Soc. Chang., 118, 80–93.
NIST/SEMATECH e-Handbook of Statistical Methods,(2022) http://www.itl.nist.gov/div898/handbook/ -- Kernel Density Plot. In DataPlot Reference Manual. -- Quantile-Quantile Plot.
Portney, K. E. (2013). Taking sustainable cities seriously: Economic development, the environment, and quality of life in American cities. MIT Press.
Rice, J., & Rice, J. S. (2009). The concentration of disadvantage and the rise of an urban penalty: urban slum prevalence and the social production of health inequalities in the developing countries. International journal of health services : planning, administration, evaluation, 39(4), 749–770. https://doi.org/10.2190/HS.39.4.i
Soomai, S. S., MacDonald, B. H., & Wells, P. G. (2013). Communicating environmental information to the stakeholders in coastal and marine policy-making: Case studies from Nova Scotia and the Gulf of Maine/Bay of Fundy region. Mar. Policy, 40, 176–186.
Stolz, A., & Hepp, M. (2015). Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce. In COLD.
Teoh, T. T., & Rong, Z. (2022). Regression. In Artificial Intelligence with Python (pp. 163-181). Singapore: Springer Singapore.
UNICEF. (2021). Children uprooted in a changing climate: Turning challenges into opportunities with and for young people on the move. UNICEF.
United Nations Department of Economic and Social Affairs. (2020). Shaping the Trends of Our Time: Report for the UN 75th Anniversary. United Nations. https://doi.org/10.18356/d81797b7-en
UN-Habitat. (2003). Global Report on Human Settlements 2003. UN-Habitat.

[1] United Nations' Sustainable Development Goal (SDG) “UN SDG 10” is Reduce inequality within and among countries, “UN SDG 10” is https://unstats.un.org/sdgs/report/2021/goal-10/ “UN SDG 11” is Make cities and human settlements inclusive, safe, resilient and sustainable https://unstats.un.org/sdgs/report/2021/goal-11/

[2] https://data.unhabitat.org/pages/housing-slums-and-informal-settlements

[3] Urban legibility and spatial planning can inadvertently perpetuate inequality. Distinct zoning often leads to socio-economic segregation, confining low-income families to less favorable areas. Based on Kevin Lynch’s perceptions of the city, or 'mental maps', can vary significantly based on socio-economic status, influencing experiences of urban life. Henri Lefebvre argued for space production democratization, envisaging an ideal city where all residents contribute to its shape and character. Yet, in practice, power and wealth disparities often skew this process, leading to urban spaces that reinforce existing socio-economic inequalities.

[4] WDI: https://datatopics.worldbank.org/world-development-indicators/; Urban indicators database: https://data.unhabitat.org

[5] The selection aligns with the United Nations' Sustainable Development Goals (SDGs) 10 and 11. https://www.un.org/sustainabledevelopment/inequality/

[6] Official Development Assistance (ODA) is the governmental aid aimed at fostering economic development and improving the well-being of developing nations. The OECD collects, verifies, and publicly disseminates data on ODA. (OECD https://www.oecd.org/dac/financing-sustainable-development/development-finance-standards/official-development-assistance.htm )

[7] To perform PCA, the data is first normalized to have a zero mean. Next, the covariance matrix is computed, which measures the joint variability across the variables in the dataset. The covariance matrix contains covariance scores for every variable with every other variable, including itself. The eigenvectors of the covariance matrix are then determined. Finally, the original data is multiplied by the eigenvectors to obtain the transformed data (Bro and Smilde 2014).

[8] The presence of strong correlations among social attributes can adversely impact the accuracy of predictions. PCAvis employed to address the challenges of high dimensionality and collinearity by reducing the number of predictor attributes( Howley et al. 2005:364).

[9] Specifically, Ridge Regression achieved an MAE of 0.4545, indicating that, on average, the predicted values deviated by 0.4545 units from the actual values. Additionally, it achieved an MSE of 0.2707, which represents the average squared difference between the predicted and actual values. Lastly, Ridge Regression obtained an R-squared score of 0.9532, indicating that approximately 95.32% of the variance in the dependent variable can be explained by the independent variables in the model. Overall, these results suggest that Ridge Regression performed well and demonstrated a strong ability to predict the outcome.

[10] It's important to note that while the research’s statistical analysis can identify relationships and correlations, it doesn't prove causation. While the data suggests that urban population growth is strongly associated with slum population growth, the method adopt herewith doesn't definitively prove that one causes the other.

[11] The code shared in this context initiates by importing the necessary libraries. Subsequently, it acknowledges the available and missing years, defined within the 'years' and 'missing_years' arrays respectively. Additionally, it takes into account the proportion of urban population living in slums per the defined 'proportion' array. To tackle the issue of missing data, the 'years' and 'proportion' arrays are integrated using 'np.column_stack()' to generate a 'data' array. This array is fitted using an IterativeImputer instance to perform the MI. This data processing serves as a proficient solution to fill the missing values for the monitored years using sklearn's MI technique.

(Not answered)

ADDITIONALINFORMATION.docx

Machine Learning Insights into Multifaceted Social Impacts on Global Urban Slum Population

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

References

Footnote

Additional Declarations

Supplementary Files

Status:

Version 1