This study breaks new ground by utilizing cluster analysis to identify sub-phenotypes of behavioral risk profiles related to youth substance use on population-based longitudinal health surveys. Traditional studies in this field have focused on identifying individual risk factors associated with substance use among youth and investigating the statistical power of each factor. Some studies have also used clustering algorithms to identify patterns of substance use. However, no existing studies have assessed risk profile phenotypes of youth substance use involving substance use indicators and associated health risk behaviors with a data-driven approach. As the first study to ascertain sub-phenotypes of risk profiles among Canadian youth, our research sheds light on risk profiling with cross-sectional and longitudinal evidence. Our findings are consistent with previous studies that employ statistical modelling approaches in determining risk factors16,17,18,19,31,36,38.
Our study results showed that the number of smoking friends, number of classes skipped, and weekly allowance were consistently among the top factors associated with substance use among Canadian youth. These findings are consistent with previous studies that have identified peer influence, academic problems, and access to money as key risk factors for substance use among young people39,40. Studies have shown that peers play a significant role in youth substance use. Adolescents with more friends who use substances are more likely to use substances themselves39. This can be attributed to peer pressure, social norms, and the influence of social networks on behavior40. In the context of smoking, one study found that the number of smoking friends was a strong predictor of adolescent smoking41. Skipping classes is also associated with increased substance use among youth. Studies have shown that students who skip classes are more likely to use substances, including alcohol and marijuana42,43. This could be due to a lack of supervision, boredom, or increased opportunities to use substances when not in school. Finally, the association between weekly allowance and substance use is complex. While some studies have found that higher levels of allowance are associated with increased substance use44, others have found no significant association45. It is possible that the relationship between allowance and substance use may be mediated by other factors, such as peer influence or parental monitoring.
Our finding agrees with Halladay et al. (2020), who revealed an association between substance use and mental health46. Many studies have found that individuals in the multi-use group report higher psychiatric symptoms, including depression and anxiety than in the single-use group23,24. Radloff (1977) states that a CESD score ≥ 10 represents clinically relevant depressive symptoms47. In our study at WI, students in SP4 had a CESD score of 10.8 ± 6.70, suggesting that, on average, this group of individuals already experienced clinically relevant depressive symptoms. In contrast, those in SP1 had the smallest CESD score (7.08 ± 5.39). On average, individuals in SP3 started to have clinically relevant depressive symptoms at WIII. This was determined by their average CESD score of 10.0 ± 6.15.
A sedentary lifestyle was associated with a high risk of substance use. Our finding agrees with West et al. (2020) that sedentary behavior was positively associated with adolescent drinking and marijuana consumption48. At WI, the difference between the students who were in SP1 had, on average, the smallest sedentary time (170 ± 61.3 minutes) compared to their peers in SP4 (1012 ± 269) by a magnitude of 5.95 times. Comparing SP4 vs. SP1 for the last two years, the sedentary time differences were similar at WI, 6.09 and 5.70 times for WII and WIII, respectively.
Similarly, eating breakfast was associated with a low risk of substance use among Canadian youth. Our result is consistent with the literature about youth substance use and the correlation of nutrition-related attitudes, e.g., substance users were more likely to be at greater risk of poor eating habits, including not eating breakfast26. The prevalence of eating breakfast decreased while the risk level increased from SP1 to SP4. The longitudinal evidence suggests that the prevalence of the students eating breakfast decreased across three years in the SP1 group. The same trend was observed across three years among the other risk profile groups (SP2 to SP4).
Utilizing the top factors correlated with youth substance use, this study identifies four sub-phenotypes of risk profiles, SP1 through SP4, with escalating risk by groups. These sub-phenotypes provide a more comprehensive overview and quantification of the prominent characteristics for each risk level of engaging in substance use among youth. Note that marijuana consumption had increased by 3.98 times from 4.0% at WI to 15.9% at WIII. The highest magnitude of marijuana use amongst other substances may be partly due to the legalization of the recreational use of marijuana in October 2018.
This study has implications from public health and health policy perspectives. Our findings suggest that the correlates of youth substance use are multifaceted, concerning individual-level factors, peer influence, and environmental effects. Understanding risk profiles related to youth substance use will help school program managers/policymakers identify and characterize valuable measures to evaluate risk reduction interventions. From the population level, differences between sub-phenotypes of risk profiles may have essential effects on subjective youth behavioral and mental health. This can be further conceptualized by having different preventive capabilities against addictive behaviors in school settings. The diverse associations between youth substance use and multifaceted health-related behaviors should be considered for decision-makers who want to invest in risk reduction interventions targeting multiple health risk behaviors among youth.
Cluster analysis is a valuable tool for this study, identifying hidden patterns of risk profile related to youth substance use. This approach enabled the identification of subgroups of individuals with similar patterns of risk factors, providing a more nuanced understanding of the complex factors associated with substance use. By understanding the unique risk profiles of different subgroups of adolescents, interventions can be tailored to address the specific needs of each group, leading to more effective prevention and intervention strategies. For example, a prevention program that addresses peer influence may be particularly effective for a subgroup of youth who are heavily influenced by their smoking friends, while a program that focuses on academic support may be more effective for a sub-cohort who are skipping classes and struggling with their schoolwork.
Cluster analysis has advantages and potential limitations; for example, the various clustering algorithms usually provide very different results due to the distinct criteria for merging clusters. Although cluster analysis has unique advantages for revealing “hidden” patterns and unexpected associations in variables, no backward option can be made in earlier steps due to the hierarchical nature of the analysis. Therefore, we implemented various clustering algorithms to mitigate these limitations, including fuzzy clustering, partitioning-based, and hierarchical-based. The last two types of clustering methods are hard clustering algorithms that assign data elements to one cluster. Unlike hard clustering, fuzzy clustering algorithms are soft-clustering methods, assigning membership coefficients of objects to all clusters. A focus was given to fuzzy clustering algorithms, considering the overlapping nature of risk profiles observed on the longitudinal samples of COMPASS data. Our study results show that fuzzy clustering outperforms hierarchical and partitioning clustering on COMPASS data, demonstrating its appropriateness for unearthing the overlapping nature of risk profile phenotypes related to youth substance use. Furthermore, we performed data visualization via the t-SNE algorithm to project high-dimensional data to lower-dimensional space. In addition, silhouette plots demonstrate the internal index of clustering validation, and different risk levels with associated characteristics represent risk profiling.
The major strengths of this study derive from COMPASS data and methodologies. COMPASS data are high-dimensional population-level health surveys with large sample sizes and reliable data quality, using national surveillance instruments-based measurements49. The Cq employs the active-information passive-consent protocols, which help achieve high participation rates and reduce sampling bias while preserving student confidentiality50. COMPASS methodologies are sufficiently robust given the delicate balance of data accuracy and participant anonymity in longitudinal studies concerning youth health behaviors51. The multiple sources of COMPASS data bring new insights into strengthening the cross-sectional and longitudinal evidence of risk profiling. Another strength is the applied methods, undertaking comprehensive data preprocessing steps and employing advanced unsupervised machine learning (ML) models such as fuzzy clustering to discover hidden patterns. The ML pipeline developed in this study can be used in real-world decision support, making a clinical tool determining different risk profile phenotypes from real-world raw data towards the final analysis results that are interpretable to stakeholders.
This study has certain limitations22. First, it is preferable to have external validation data to validate the identified clusters. However, the subsequent data collection cycles were challenging due to the COVID-19 pandemic. Second, COMPASS data are subject to self-reported bias and may lead to inaccurate measures. Third, many participating schools in the COMPASS study are convenience samples that do not cover all Canadian provinces. The sampling of schools may not be genuinely representative, limiting the results’ generalizability. Moreover, omitting participants with irregular patterns in their grades may also compromise generalizability. Future work is warranted to evaluate whether there were any systematic differences between those retained in the analyses and those excluded. Last, as with any large-scale health survey, non-responses introduce missing values. We applied various imputation techniques to impute missing values.
In conclusion, this study provides novel insights into the identification of sub-phenotypes of risk behavioral profiles related to substance use among Canadian youth using cluster analysis. Our findings have practical implications for practitioners in school settings who can use the evidence provided to develop targeted risk reduction interventions for at-risk adolescents. Furthermore, this study provides valuable information for stakeholders to guide the implementation of school policies and procedures to improve youth health behaviors. This research contributes to a better understanding of the complex relationships between substance use and associated health risk behaviors among adolescents and can inform the development of effective prevention strategies.