Systematic human learning by literature and data mining for feature selection in machine learning

We proposed a learning algorithm for human to conduct literature and data mining for causal factor discovery. The applicability is to select features for a machine learning prediction model, including but not limited to that using real-world, time-varying data from electronic health records. This protocol is relatively quick to �nd potentially actionable predictors for a clinical prediction while dealing with high dimensionality in big data. However, this protocol might not �nd a potentially novel cause, since this only exhaustively examines the existing evidences in a single study. The key stages consisted of systematic human learning, causal diagram construction, data preprocessing, causal inference modeling, and development and validation of a prediction model to describe the explainability.


Introduction
Insurance-based healthcare has been widely implemented worldwide, which urges more preventive intervention to improve patient outcome and reduce healthcare utility.Clinical prediction models are needed to achieve such purpose.Although several machine learning algorithms have been shown satisfying predictive performances for health outcomes, [1][2][3] these were exposed to optimistic bias due to no independent test set, no data partition, or high dimensionality relatively compared to sample size. 4inical prediction also should be actionable.Identifying causal effects on a predicted health outcome enables targeted interventions to that condition.However, machine learning cannot infer causality yet.
The best practices established in medicine are developed using hypothetico-deductive reasoning.Any prior knowledge is collected by human learning through literature to generate a hypothesis.Subsequently, statistical methods are used to verify the assumption using available data.In traditional approach, randomized controlled trial (RCT) design have been used to collect data in a causal inference study because of its robustness to remove effects from common causes or confounding factors.However, this is limited due to some considerations (e.g.ethical issues).Even if RCT is possible, a preliminary study of a causal effect using observational data is still warranted to reduce resource waste and potential harm in human research.To conduct such study, a causal diagram is constructed based on prior knowledge as a central assumption. 5Available data (e.g.electronic health records) may be used to verify this assumption.Solely using statistical analysis on available data without contextual knowledge can introduce data-driven bias. 6Meanwhile, although human learning can use contextual knowledge to prevent such bias, we still need machines to deal with big data.
To solve data-driven bias using contextual knowledge, we proposed a learning algorithm for human to systematically construct a causal diagram by literature mining.Then, one of the generalized (G) methods, i.e., inverse probability weighting (IPW), is used to verify each causal factor using available data. 7,8This method was designed for time-varying exposures which are typically available data from electronic health records. 9,10Eventually, all causal factors are included in a prediction model to describe the explainability.The use of causal factors in a model is not necessarily be the highly predictive.Better performances of prediction model are normally achieved by exploiting confounding factors. 5However, by differentiating causal and non-causal predictors, this can warn a human user when conducting a critical appraisal a machine learning model.If a causal effect is large, this also provides a pathway for a preventive intervention.In addition, compared to systematic review and meta-analysis, our learning algorithm is relatively quick and low-intensive labor, which follows human intuition on learning through literatures.
We applied this protocol to several studies which were parts of a project that applied our human-machine learning algorithm to a variety of predicted outcomes.Human learning by this protocol was one of the comparators beside those applying standard machine learning prediction by PROBAST guidelines. 11thical clearance was waived by the Taipei Medical University Joint Institutional Review Board (TMU-JIRB number: N202106025).This protocol aimed to propose a protocol for feature selection in statistical machine learning by an algorithm for human to systematically construct a causal diagram by literature mining, and to verify the causal assumptions from prior knowledge by statistical modeling, including but not limited to those using real-world, time-varying data from electronic health records.

Reagents Equipment
We used R 4.0.2programming language (R Foundation, Vienna, Austria) to conduct data analysis.The integrated development environment software was RStudio 1.3.959(RStudio PBC, Boston, MA, USA).To ensure reproducibility, we used Bioconductor 3.11; 12 thus, versions of the included R packages were all in sync according to versions in this Bioconductor version.For statistical machine learning, we used an R package of caret 6.0.86 that wraps R packages for a modeling algorithm, which was glmnet 4.1.We created R packages for many steps in the data analysis, which are medhist 0.1.0and gmethods 0.1.0.All of these packages are available for download from this repository https://github.com/herdiantrisufriyana.Details on other R package versions and all of the source codes (vignette) for the data analysis are available in https://github.com/herdiantrisufriyana/shl.
To reproduce our work, a set of hardware requirements may be needed.We used a single machine.It was equipped by 8 logical processors for the 3.40 GHz central processing unit (CPU) (Core(TM) i7-4770, Intel®, Santa Clara, CA, USA), and 16 GB RAM.But, one can use a machine with only 4 logical processors and 4 GB RAM, if the sample size is smaller than that of dataset we used in this protocol.Procedure 1. Choose one or more literature databases A systematic human learning was conducted by literature mining in a particular period.This drew on our assumption of causality.For simplicity and to avoid redundant records, we only used PubMed because it is the most frequently updated (daily), has the longest period coverage (1950 to the present), and is a life science-focused literature database. 13This database also allows use of speci c terms in the Medical Subject Headings (MeSH) vocabulary thesaurus from the National Library of Medicine, National Institutes of Health (Bethesda, MD, USA).

Look for a document from an authoritative institution
We adopted snowball sampling method by starting with convenience sampling, 14,15 which was a document from an authoritative institution, to obtain a similar sense with human intuition when learning through the literature.We used the keywords '"Fetal Membranes, Premature Rupture"[Mesh]' to nd the document for an outcome of prelabor rupture of membranes (PROM) in the literature database, as a convenience sampling step.This led to Practice Bulletin No. 172 from the American College of Obstetrics and Gynecology (ACOG). 16We only considered pregnant women as the population of which those studies investigated.The initial document was denoted d 0 (Algorithm 1).We denoted causal factors of PROM as A, while the confounders were denoted L. Confounders are causal factors of a causal factor of PROM.This means L represents the same factors that cause both A and PROM.Initially, there was no A or L. By reading an article/document d 0 , we identi ed a ∈ A to determine k 0 keywords that refer to a at the s=0 stage.The next steps were iterative until no k s keywords referred to any a ∈ A.

Search for causal factors for each causal factor of the outcome from either the initial document or the subsequent documents
We assigned k s to a s and searched for the document d s using k s for causal factors of a s .If a document was found, then we continued; otherwise, the iteration ended.We continued by reading d s to determine k s+1 keywords.This refers to a causal factor of a ∈ A that is referred to by the previous k s keywords.

Identify whether the causal factors from previous step are also causal factors of the outcome
Documents were searched and read to check if the k s+1 keyword also refers to causal factors of PROM.If yes, then the k s+1 keyword was passed to the s+1 stage; otherwise, we assigned k s+1 to l s and the iteration ended.

Construct a causal diagram for each proposed causal factor of the outcome
Factors of A and L are called rst-or second-level factors of PROM, while only rst-level ones are causal factors.This determined the position of factors within a circular network depicting a causal diagram which we used for causal inferences.Since rst-level factors may come from second-level factors in the process, we could also nd inter-causal factor relationships.We included these relationships as edges in the network, because these are needed to construct causal inference formulas.For each causal factor with the common causes that have available data, a node and an edge to this node were drawn from the node of each variable consisting either a causal factor or the common causes.This node represented measured variables.Another node and an edge from this node were drawn to the represented node.This node represented unmeasured variables that can affect measurement error of the measured variables.
Please kindly nd out more explanation about constructing a causal diagram in this reference. 5The source codes (vignette) for this step are available in https://github.com/herdiantrisufriyana/shl.7. Split a dataset randomly for a discovery and validation set and de ne variables in the dataset that can represent each causal factor and the common causes Only this set was used for causal inferences.Later, we also used it for training set of a prediction model.We represented demographics and medical histories as candidate causal factors if applicable.These were respectively binarized into 0 and 1 for negative and positive factors.Details of the ICD-10 codes and demographic variables we assigned to each causal factor are available in the source codes.We provided an R package medhist 0.1.0consisting functions to extract, preprocess, and transform data into each causal factor from a nationwide health insurance claim data.

De ne causal inference formula for each proposed causal factor of the outcome based on the causal diagram and available
Only factors were included in the formulas.For example, both asthma and in uenza are rstlevel factors of PROM, while varicella is a second-level factor of PROM via asthma.To determine the formula for the causal inference of asthma, we included only asthma and in uenza.We used only asthma's signi cance to determine if asthma was a causal factor of PROM.Only the causal factor of interest and confounding factors or common causes were included in the causal formula.We avoided including common effects to prevent collider-strati cation bias, or unnecessary inclusion of second-level factors. 5 Conduct causal inference modeling by a generalized (G) method our assumptions of PROM causality, applied of the generalized (G) methods, i.e., IPW, for each causal factor.7,8 This method was designed for time-varying exposures.9,10 However, we also conducted outcome regressions for causal inferences, since this is one of the more commonly methods although it does not work in general.5,17 Another common method is propensity-score matching with various versions, but we did not apply this method for simplicity.While adjusting all confounding factors is di cult, if not impossible, we disclosed open backdoors (confounding factors that were not blocked) because of limitations of providing data for each causal factor.This will help in interpreting the results of the study with caution.18,19 An R package is provided for the causal inference modeling using G-methods, which is gmethods 0.1.0.

Develop and validate a prediction model to describe the explainability
After verifying causal factors, we only included those in a prediction model that applied a logistic regression with a shrinkage method, as recommended by PROBAST, instead of using a stepwise selection method. 11We chose an RR, which applies L 2 -norm or beta regularization, because this method retains all causal factors within the model after weights are updated by training. 20The model was evaluated by the area under receiver operating characteristics to nd the predictive performance using all con rmed causal factors.

Anticipated Results
Several causal diagrams are expected to be constructed for an outcome, beginning from outcome-related guidelines by an authoritative institution.A study may indicate >1 relationships between causal factors and either another causal factor or an outcome.Some covariates in a causal diagram may not have available data, but, the diagram is shown to disclose potential backdoors or confounding factors that are not controlled.The outcome regression may show larger effects compared to those by IPW.Since the latter method, which is a G-method, is possible to estimate a causal effect although the causal model is mistakenly speci ed, 17 the confounding effects are well-removed thus demonstrating smaller effect of a causal factor.

Troubleshooting A. Step 1 Problem
No exact MeSH term for the outcomePossible reasonThe outcome term may be one of entry terms.SolutionChoose either the main or entry term.Browse all MeSH categories within the page of the most similar term to nd an alternative term.B. Step 2 Problem No document from an authoritative institution Possible reason The outcome may be either a novel condition or a multidisciplinary problem.Solution E. Step 10 Problem E.1 No con rmed causal factors for a prediction model Possible reason E.1.1 Diagnosis or procedure codes may not represent a causal factor.Solution E.1.1Consider other codes that may represent a causal factor.Possible reason E.1.2There is no relevant variable in a dataset.Solution E.1.2Consider to get another secondary dataset or collect primary data.time: 20 minutes to 2 hours per causal factor Step 9 Approximate time: 1 to 20 minutes per causal factor Step 10 Approximate time: 1 to 20 minutes per causal factor