The outbreak of the recent Coronavirus virus (COVID-19) pandemic disrupted life across the globe. The virus was first identified in December 2019 in Wuhan, China, and later on, due to its deadly impact and fast spread, was declared a pandemic by World Health Organization (WHO) [1]. COVID-19 is caused by a novel coronavirus known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and became a global threat [2]. It has affected staggeringly disparate sectors of life and infected around 756,581,850 individuals, causing 6,844,267 deaths by February 17, 2023 [3]. Subsequently, a significant development was measured after the COVID-19 vaccination process, which resulted in a remarkable decline in the number of positive cases. A recent WHO report suggests that the number of weekly cases is declining every day and almost 13,195,832,385 (by February 17, 2023) vaccination doses have been administered around the world [3]. With this huge development, recent studies indicate the economic activities going towards a normal state gradually [4].
During this pandemic, the scientific and research community, with the support of paramedical staff, has played an important role highlighting different factors such as cures, precautions, testing, medications, and vaccination development. For instance, researchers have presented several ideas to support various interests such as education [5], national economies [6], corporations [7], and healthcare [8] to react quickly and protect their communities throughout the crisis. In addition, some major developments have been shown by Artificial Intelligence (AI) and Machine Learning (ML) communities. The contribution of AI & ML has resulted in several digital solutions to help officials reduce COVID-19’s impact on society. Developing smart applications [9], predicting future cases [10], applying smart surveillance systems for contact tracing [11], improving testing capabilities [12], and increasing smart diagnostic systems [13] are some of the fresh ideas proposed by AI and ML research communities.
Additionally, the availability of public datasets has encouraged computing researchers and statisticians to investigate COVID-19 characteristics systematically and present valuable recommendations. In this regard, GitHub[1] and Kaggle[2] research communities have provided a collaborative platform for researchers and organizations. Similarly, data.world[3], and HDX[4] are online web portals that provide COVID-19 public datasets. These communities are a helpful source for researchers to analyze, evaluate, and generate new patterns and recommendations to authorities to respond efficiently during this crisis. Several researchers have contributed to these data platforms and uploaded several datasets, such as chest X-ray images [14], daily incidence data [15], and daily reported cases [16], that are publicly available for future research.
Therefore, this research will take a step to deliver a statistical and ML model that can assist healthcare professionals in treating different patient age groups. Primarily, the research problem undertaken in this study is “to predict the age group of COVID-19 patients using different attributes by applying statistical and machine learning approaches”. For this, the authors applied multiple strategies in selecting appropriate datasets. In addition to using a previously published dataset [17], in this study, researchers managed to collect a new dataset from different Pakistani hospitals’ repositories. Furthermore, datasets were collected after the identification of the Delta variant in Pakistan [18], which highlights the data’s importance in identifying symptoms based on the recent mutation of COVID-19. Firstly, this study presented a detailed analysis and investigation of two separate datasets collected from different countries. The idea was to compare the results generated from both scenarios and situations occurring in a particular country. Secondly, the statistical analysis led to identifying the appropriate dataset for ML implementation. Finally, ML algorithms were applied to predict the age groups of people infected with COVID-19 using five common symptoms known as (i) Cough, (ii) Fever, (iii) Sore throat, (iv) Shortness of breath, and (v) Headache [19,20].
The following are the main contributions of this research article:
- Management of a ready dataset collected from medical institutions operating in Pakistan.
- Publication of a ready dataset on public research platforms for potential use and implementation by researchers around the world in the future.
- Proposal of a statistical and machine learning approach to extract the association between COVID-19 symptoms and a patient’s age group.
- Identification of the likelihood and significance of each symptom in infected people.
- Execution of ANOVA and t-tests on five investigated symptoms to the age groups.
- Implementation of multiple machine learning algorithms and ensemble approaches to compare results and identify the optimal machine learning approach for prediction.
The rest of the paper is structured as follows: Section 2 formalizes the context of this research study. The methodological steps are discussed in section 3. Section 4 explains the statistical analysis conducted in this study. In addition, section 5 demonstrates the machine learning implementation. Finally, the research concludes in the last section.