An Algorithm for Notiable Disease Modeling and Prediction Using Articial Intelligence Techniques: A Case of Kenya.

The disease outbreak management operations of most countries (notably Kenya) present numerous novel ideas of how to best make use of notiable disease data to effect proactive interventions. Notiable disease data is reported, aggregated and variously consumed. Over the years, there has been a deluge of notiable disease data and the challenge for notiable disease data management entities has been how to objectively and dynamically aggregate such data in a manner such as to enable the ecient consumption to inform relevant mitigation measures. Various models have been explored, tried and tested with varying results; some purely mathematical and statistical, others quasi-mathematical cum software model-driven.


Abstract Background
The disease outbreak management operations of most countries (notably Kenya) present numerous novel ideas of how to best make use of noti able disease data to effect proactive interventions.
Noti able disease data is reported, aggregated and variously consumed. Over the years, there has been a deluge of noti able disease data and the challenge for noti able disease data management entities has been how to objectively and dynamically aggregate such data in a manner such as to enable the e cient consumption to inform relevant mitigation measures. Various models have been explored, tried and tested with varying results; some purely mathematical and statistical, others quasi-mathematical cum software model-driven.

Methods
One of the tools that has been explored is Arti cial Intelligence (AI). AI is a technique that enables computers to intelligently perform and mimic actions and tasks usually reserved for human experts. AI presents a great opportunity for rede ning how the data is more meaningfully processed and packaged. This research explores AI's Machine Learning (ML) theory as a differentiator in the crunching of noti able disease data and adding perspective. An algorithm has been designed to test different noti able disease outbreak data cases, a shift to managing disease outbreaks via the symptoms they generally manifest.
Each noti able disease is broken down into a set of symptoms, dubbed symptom burden variables, and consequently categorized into eight clusters: Bodily, Gastro-Intestinal, Muscular, Nasal, Pain, Respiratory, Skin, and nally, Other Symptom Clusters. ML's decision tree theory has been utilized in the determination of the entropies and information gains of each symptom cluster based on select test data sets.

Results
Once the entropies and information gains have been determined, the information gain variables are then ranked in descending order; from the variables with the highest information gains to those with the lowest, thereby giving a clear-cut criteria of how the variables are ordered. The ranked variables are then utilized in the construction of a binary decision tree, which graphically and structurally represents the variables. Should any variables have a tie in the information gain rankings, such are given equal importance in the construction of the binary decision-tree. From the presented data, the computed information gains are ordered as; Gastro-Intestinal, Bodily, Pain, Skin, Respiratory, Others. Muscular, and nally Nasal Symptoms respectively. The corresponding binary decision tree is then constructed.

Conclusions
The algorithm successfully singles out the disease burden variable(s) that are most critical as the point of diagnostic focus to enable the relevant authorities take the necessary, informed interventions. This algorithm provides a good basis for a country's localized diagnostic activities driven by data from the reported noti able disease cases. The algorithm presents a dynamic mechanism that can be used to analyze and aggregate any noti able disease data set, meaning that the algorithm is not xated or locked on any particular data set.

Background
Disease surveillance is an information-based activity involving the collection, analysis and interpretation of large volumes of disease outbreak data from a variety of sources in order to inform and drive objective and informed intervention. The Disease Surveillance and Response Unit (DSRU) is the entity mandated (in Kenya) to monitor and undertake response and mitigation measures in the event of a noti able disease outbreak; a noti able disease refers to any disease in a country or community whose occurrence must be reported to the authorities (WHO, 2006). Each time a noti able disease is reported, the DSRU undertakes the necessary response activities (CDC, 2012; DSRU, 2014).
In Kenya, disease outbreaks are mostly tackled from two perspectives; reactive measures -in the event a noti able disease outbreak is reported, mitigating steps are only undertaken in response to the particular incident(s) to minimize the potential consequent adverse effects; not much is learned or information utilized in the aftermath that could meaningfully, incrementally and objectively inform future outbreaks and; proactive measures -anticipatory measures are put into play such that should an outbreak occur or recur, its adverse effects are greatly minimized with health personnel taking informed, premeditated and experience-driven steps as a better approach to empower the health personnel be better prepared to cope with every subsequent outbreak.
The infectious diseases of the past have been known to have included some of the most contagious and feared plagues of the past, with new strains continuing to emerge over time; this warrants a widely and greatly co-operative and proactive approach even when the disease outbreak responses and intervention efforts remain the prerogative of the concerned national government. Global partners (such as the Centre for Disease Control and Prevention [CDC], the United Stated Agency for International Development [USAID], the World Health Organization [WHO] among others) have also been seen to play a great role by working in close collaboration to offer the much needed medico-technical and social support from its battery of experienced and seasoned teams cutting across numerous medical specialities and vast geopolitical backgrounds (Brownstein et al, 2009;Martinez, 2000).
To enable each country's concerned teams better manage its disease outbreaks more e ciently, a noti able disease list and its epidemiological week (Epi-Week) must be de ned; an Epi-week is a weekly period in a country within which noti able disease outbreak data must be recorded and reported to the relevant health authorities. Kenya's epi-week runs from Monday through Sunday (DSRU, 2014; WHO, 2006).
The efforts to manage disease outbreaks have become a very complex endeavour; historically, it was easier due to smaller populations and the limited, minimized yet localized cross-border and crossterritorial movements and interactions that curtailed the cross-pollination or dissemination of infectious diseases the concerned population may have been harbouring -this has greatly changed in the advent of globalization (Wagner, 2001).
The effects of globalization have brought forth new dynamic risk factors in disease spread and management. Such factors include: faster and easy cross-border movements of people and animals, making diseases spread faster -for instance, urbanization remains one of the greatest factors of disease spread: new urban settlements and availability of a huge community of commuting skilled and readily available labour across geopolitical boundaries having the ability to create some infection epicentres that if not well-managed, could easily become incubators for new epidemics, and zoonotic diseases, which can spread in a more rapid manner, quickly elevating them to global levels of interest and concern (Nsubuga et al, 2010).
Next comes means of transporting goods or parcels. The e cient and rapid movement of goods also presents a possibility of enabling and enhancing the spread of diseases since the goods may be harbouring and transporting whatever existing disease strain to wherever they are transported or delivered (Mack, et al, 2010). Additionally, there is also the new, modern practice of families frequently eating out where they get more exposed to different infectious disease strains, among other exposures (Zhong et. al., 2021). Suddenly, one nation's (seemingly localized) epidemic challenges quickly become other nations', regions' and partners' health concerns -pathogens are not known to commonly follow or respect geopolitical and human boundaries.
Additionally, in economic and industrial competitive terms, other factors could also kick in -for instance, the economic empowerment or disempowerment of the noti able disease-affected populations when skilled, experienced and knowledgeable working personnel get grossly affected by a disease (Kulldorff, 2001;Morse, 2001;Neiderud, 2015;Pillai et al, 2014). The push and pull factors for disease surveillance also touch on the socio-economic activities of a nation; disease outbreaks have been known to decimate the knowledgeable, skilled and able-bodied working populations of any nation to a point of economic near-standstill if not total collapse (Roser, 2015).
Further, it is has been observed that the progression or retrogression of the economic well-being of a community can now be greatly tied to proper disease outbreak management; if the adverse effects slow down economic activity, then all measures, (including the improvement of the health infrastructure and the response and mitigation apparatus of a country) must be called upon to prevent or deal with the adversity of the disease outbreaks (Baker et al, 2002;Roser, 2015). To combat such disease strains, concerted efforts and clear-cut strategies need to be employed; the enhanced use of ICT software and tools has been seen as a great driver and catalyst to enable the quick aggregation, packaging and dissemination of disease data through to the relevant personnel for easier, faster and better-informed interventions (Weinberg et al, 2003).
The disease outbreak data used here is subjected to AI's machine learning theory. Machine learning is a technique that provides systems with the ability to automatically learn and improve from experience (Neiderud, 2015;Roser, 2015). Whilst traditional disease outbreak management assumes the method of relying on past disease data that is seen to point towards what infectious disease strains manifest, this research looks to dig deeper. Using AI, the researcher hopes to drive a different perspective to noti able disease outbreak management.
Of the two disease outbreak management perspectives outlined earlier, the researcher looks to build on the proactive disease outbreak measures. The main driving question or hypothesis here is whether a different approach could be employed to the processing and packaging of noti able disease data in order to better inform and drive proactivity in the disease surveillance and response practice.

Method
The methodology used here employs various techniques; quantitative and qualitative research analysis blended with evolutionary and iterative prototyping. The C4.5 decision tree theory in arti cial intelligence has been used in the diagnostic analysis efforts, with the computed information gains consequently becoming reliable determinants in informing the structure of the resultant binary decision tree(s).
Post validation, the algorithm could be further applied to the general noti able disease-list across many counties and regions to handle the variation of the disease outbreak footprints as an additional test measure of the algorithm's e cacy; it is expected that any challenges experienced in the process of the development of the algorithm will be used as a basis for future improvement and to inform policy development and assist in better planning efforts (Childs et al, 2007;Moncayo et al, 2009).
The eight symptom clusters adopted are listed below: Once the information gains are computed, their rankings are used to determine the order of the symptom variable clusters in the construction of the binary decision tree, yielding a structure that helps to graphically and visually break down the noti able disease outbreak data into a meaningful form to guide intervention and proactive action. The ranked symptom burden variables can assist the health personnel in easily mapping what disease symptom variable(s) to lay emphasis upon in their efforts to combat outbreaks.
This means that there is a deviation from the traditional practice of the focus being laid upon the singular diseases themselves; the combating of disease outbreaks would mainly be driven by the symptom cluster variables i.e. it is possible to focus on only those diseases that manifest certain symptom cluster variables that are highly ranked via the algorithm using the computed information gains. Thus, the planning and mitigating measures would mainly be on the diseases symptom variables, and not necessarily the raw disease(s) themselves.
Each noti able disease data set follows the information gains ranking. For instance, if the Pain symptom variable ranks rst in the information gains computation, then it will become the root node in the resultant binary decision tree. The rest of the symptom variables will follow accordingly. If two or more variables tie in the information gain ranks, then they shall jointly be part of a leaf node (or the root node, if they tie on rank one) as the decision tree gets de ned and constructed. Below is the algorithmic process ow of the various activities:

Results
The information gain scores tabulated here are derived from the data sets prepared from the primary data. The information gain scores are ranked from the largest to the smallest; with the highest information gain score pointing to a particular symptom variable(s) that is the most critical in the decision tree construction, whilst the smallest information gain score points to the variable that is the least important as a binary decision tree determinant variable. The binary decision tree shall then be constructed according to the information gain rankings tabulated above.

Discussion
It is a prerogative of every nation to focus on the strengthening of its public health infrastructure to protect its citizen's health, and especially in combating disease outbreaks (Baker et al, 2002). Thus, all disease outbreak factors can easily be dealt with.
This research looks to present a good push in innovating new approaches and methodologies in the development of a proactive, early warning system in the response and intervention efforts to support some medium-and long-term mechanisms for the processing of the disease data with the focus on speci c trends to inform policy development and planning, thereby boosting decision-making at the DSRU in collaboration with other concerned partners (Bernardo et al, 2013).
The algorithm demonstrates interesting results. Of the eight disease symptom burden cluster variables, the gastro-intestinal variable emerges as the most prominent, having registered an information gain score of 4.4366. It goes on to form the root node (the rst node of a binary decision-tree). It is closely followed by the bodily symptom variable with an information gain score of 4.3496. The others follow in the following order (based on the information gain scores): Pain (3.4801), Skin (2.8691), Respiratory (2.3451), Others (2.2060), Muscular (1.0144) and nally Nasal manifestations (0.7654). With the gastrointestinal variable emerging as the most highly ranked variable, this means that the disease mitigation efforts and focus should be laid upon those diseases that manifest any gastro-intestinal symptoms. Once these have been exhaustively addressed, next will come those with bodily symptoms. Consideration should be taken right from the most highly ranked symptom variable cluster to the lowest in order to objectively guide the diagnostic preparedness of an entity (be it a nation, a province, county or any other geographical demarcation possible) The proposed shift here means there is a deviation from the traditional diagnostic practice of focus and diagnostic emphasis being laid upon diseases individually; instead, each disease is de ned as a set of symptoms within the de ned disease symptom clusters. The algorithm can then be applied; simply determine the information gains of each of the variable, the rankings and consequent variable classi cation and nally, the binary decision tree construction. Learning is a multi-faceted occurrence, with learning processes involving the acquisition of new declarative knowledge, the development of motor and sensitive skills through instruction and practice ordering of the new knowledge. Tools to drive such new knowledge include arti cial intelligence branches such as machine learning, as utilized in this research (Michalski et al, 2013).
The essence of this algorithm is to derive a seemingly localized diagnostic framework to enable local medical personnel easily manage disease outbreaks by predicting what disease symptom variables should be given priority in the ght against outbreaks. This approach assumes that in order to manage disease outbreaks on an ongoing basis, all the diseases' should be classi able within the eight symptom clusters. The algorithm then goes on to cluster the diseases based on their most critical symptom burdens. Emphasis is laid on the disease symptom variables i.e. the disease(s) that manifest(s) a certain highly ranked symptom variable is given more prominence in the diagnostics and interventional process.
The algorithm's ranking of symptom variables is purely data-driven i.e. as new data is posted, the symptom variables' information gains are expected to keep changing and assuming new ranks, thereby dynamically changing the order of importance of the symptom variables of focus. As such, the ght against disease outbreaks focuses not on the diseases, but by the symptoms that drive these diseases.
Of great importance is the management of disease outbreaks by providing an objective basis for crunching and aggregating the data in a novel and objective way to easily inform decision-making.

Conclusion
In conclusion, it has been demonstrated that the disease management efforts of an entity can be purely driven by the disease data presented aided by the just de ned and validated algorithm. This research study ends up creating a case for disease diagnostics mainly using symptom burden variables. Notably, a case for the machine learning driven algorithm has been presented together with its validation process. Additionally, the algorithm has been used in the computation of the information gains and their rankings.
Finally, the just de ned, computed and ranked information gains have been shown to form a basis for the de nition and construction binary decision tree.
In the end, the algorithm has been designed, constructed and validated. The whole process easily enables the disease outbreak management exercise of any local authority be home-grown i.e. the basis of the disease outbreak management can be guided and driven by the local disease data being captured and continuously crunched to keep the disease diagnostics exercise as uid and as objective as the data that drives it.

Declarations Ethics Approval and Consent to Participate
This piece of research has been cleared to undertake this research and publications by both the participating entities, Strathmore University (SU) and the Kenya Medical Research institute (KEMRI).

Consent for Publication
The researcher has granted consent for this publication to be undertaken in the cover letter. He agrees to be bound by the journal's rules and regulations as the case may be.

Availability of Data and Materials
The data sets used in this research have been provided as part of the uploaded les.

Competing Interests
The researcher wishes to declare that there are no known competing interests to this research article.

Funding
This research has been undertaken with o funding whatsoever.

Authors' Contributions
IAL undertook the study, critique and review of the AI techniques and principles that were eventually used in the research study. Finally, he helped get the ethical review clearance from Strathmore University for this study to be carried out.
MM assisted in the study, classi cation and documentation of noti albe diseases and breakdown into their symptom cluster variables that were eventually used in the study. He was also very instrumental in helping this research team with the engagement with health personnel from KEMRI as well as getting clearance for us to get the access to the data used in the study.
NNM undertook the study, lead discussions and the design of the multidisciplinary approach that brought together medical scholars and practioners (like Prof. Matilu) and ICT and computer science scholars and practioners (like Prof. Ateya) to undertake this research. He also assisted in the design of the research and the processes that informed the successful completion of this study. He was also responsible for the design and implementation of the data models that helped in sifting through the data that informed the eventual results presented in this study. Finally, he was also responsible for the design, implementation and validation of the algorithm as well as documentation of the processes and outcomes that informed this research writeup.