The comprehensive analysis of the reviewed studies shows a focus on higher education students, constituting a substantial majority at 78.6%, in contrast to the secondary education students, which accounted for 17.9%. This discrepancy highlights the emphasis on career analysis within the realm of higher education. However, it is imperative to recognize the potential for extending career planning services to students at an earlier stage, particularly those in secondary-level education. Enabling career planning at this earlier year holds significant promise in providing students with more time and resources for informed decision-making (Magnuson & Starr, 2000). Moreover, career analysis could also be particularly beneficial for special needs students (Sobnath et al., 2020a). This suggests a promising opportunity for inclusive career guidance strategies that cater to a diverse range of students, ultimately fostering a more equitable and effective educational experience.
The observed increasing trend in the number of studies on career prediction employing data analysis, as revealed in this systematic review (Fig. 3), signifies a burgeoning interest and recognition of the potential of data-driven approaches in shaping the future of career prediction. The continuous increase in research of this topic since 2016 highlights a sustained dedication towards leveraging advanced computational methods in understanding and enhancing career outcomes for students. This trend not only reflects a maturing field of study but also suggests that the integration of data science techniques in career analysis is becoming more ingrained within academic and professional communities.
Furthermore, the prevalence of studies centered in Asia (Fig. 2) sheds light on geographical trend. The heightened interest in this topic within Asian contexts may stem from a combination of factors, including rapid technological advancement, a strong emphasis on education, and a burgeoning job market (Kurniawati, 2021; Woetzel & Seong, 2021; Zhao, 2017). In contrast, the relative scarcity of similar studies in Western regions may be attributed to differing educational and career development paradigms. This regional disparity highlights the need for a more comprehensive and global perspective in understanding the application of data science in career analysis. It presents an opportunity for cross-cultural insights and collaboration to further refine and tailor data-driven approaches in career planning on a global scale.
The Random Forest method was the most commonly used data science algorithm in the studies examined (RQ1). Its popularity can be attributed to several reasons (Fawagreh et al., 2014; Y. Liu et al., 2012). Random Forest Classifier outperforms single classifiers in terms of classification rate and accuracy. Random Forest Classifier has shorter training time, making it faster than bagging or boosting. It is highly effective in cases of large data sets and also avoids the overfitting by dealing with noise in data sets. Internal estimates of inaccuracy, strength, correlation, and variable relevance are also provided. It is straightforward and easily parallelized. These collective advantages position Random Forest as a preferred algorithm, resulting in widespread adoption in studies aimed at predicting students' career pathways and employability. Random Forest belongs to ensemble learning, which is also the most commonly used method (Table 2). This indicates that in machine learning, ensemble approaches produce more accurate results than a single model (Dietterich, 2000).
In student career analyses, academic performance emerged as the most commonly included feature, followed by student’s demography (RQ2). It is important for educational institutions, to document these variables properly. An accurate record of academic achievements and a comprehensive understanding of each student's demography not only aids in tailoring educational experiences but also serves as a cornerstone for effective career guidance. This suggestion is also strengthened by the fact that, besides surveys, most of the data collected in the studies are from schools’ or universities’ databases. Additionally, it is crucial for future studies to incorporate feature importance analysis. This allows identification of key predictors influencing career predictions. Academic performance, achievements, work experience, as well as performance and behavior on online learning platforms emerged as the most significant predictors in this systematic review. As such, it is also necessary for educational institutions to prioritize the systematic documentation of these variables. Teachers, armed with this information, can offer more targeted guidance to students, nurturing their strengths and addressing areas of development. For students, recognizing the weight of these variables in career predictions provides valuable insight, encouraging them to actively engage in their academic pursuits and seek out additional opportunities for growth and recognition. This symbiotic relationship between schools, teachers, and students, fortified by comprehensive data collection and analysis, lays a solid foundation for informed career planning and attainment.
The most frequently used evaluation metrics in the prediction models is accuracy (RQ3). Accuracy represents the proportion of correctly predicted instances out of the total number of instances. While it provides a straightforward and easily interpretable assessment of model performance, it may not always be the most suitable metric, particularly with imbalanced or multiclass datasets (Ranawana & Palade, 2006; Wilson, 2001). Consequently, alternative evaluation metrics are also often employed to complement the prediction performance assessment, such as Recall as the second most frequently used evaluation metric, Precision, F1-Measure, time, AUC, RMSE, MAE, time, Matthews Correlation Coefficient, specificity, Kappa, hit ratio, negative predicted values (NPV), and log-loss.
Several limitations were observed across the studies reviewed in this systematic analysis. One of them was the issue of insufficient sample size (ElSharkawy et al., 2022; Guleria & Sood, 2023; José-García et al., 2022; Y. Wang et al., 2022; Yeung & Yeung, 2019). This limitation can potentially hinder the robustness and generalizability of the findings, pointing the importance of adequate sample representation for accurate predictions. Moreover, a prevalent challenge identified was the restricted number of features considered as predictors (Guleria & Sood, 2023; Maaliw et al., 2022; Mandalapu & Gong, 2019; Saidani et al., 2022; Yang & Chang, 2023). While many studies exhibited commendable efforts in utilizing available data, the omission of potentially influential variables may lead to an incomplete understanding of the complex interplay between various factors influencing career trajectories. Thus, it is imperative for future research endeavors to comprehensively evaluate and incorporate a broader array of features to enhance the accuracy and effectiveness of career prediction models.
4.1 Limitations
This study, like any other literature review, is limited by the search terms used. To overcome this limitation, we used synonymous and interchangeable phrases as well as performed an additional search within various journals. However, it is probable that some relevant works did not match our search criteria.