Naive Bayesian Machine Learning to Diagnose Breast Cancer

A novel MLAC (Machine Learning Against Cancer) method to distinguish between cancerous and noncancerous RNA genomic data is developed and tested with 100% accuracy on all healthy and cancerous Breast tissue samples. A Naive Bayesian ML (Machine Learning) system is trained using WES (Whole Exome Sequencing) data in a high-level i.e. normalized quan-tiﬁcation of RNAs obtained from 1091 breast cancer samples’ WES ﬁles from the TCGA (The Cancer Genome Atlas) and 179 healthy samples’ WES data from the GTEx (Genotype-Tissue Expression) project. We could show that both sensitivity and speciﬁcity of the method in classiﬁcation of cancerous and noncancerous cells is perfectly 100%.

childbearing and breastfeeding have a protective effect. [2] [3] In diagnosis and cancer identification, histological examination is used that is a slow process and needs technical experts and suffers from large amount of variations among observers. In recent years, thanks to high throughput Omics technologies, we are no longer missing data but need novel methods and techniques to handle and analyze them; thus bioinformatics and computers have found a solid ground to contribute in Life Sciences. One of the most applicable approaches to benefit from Computer Science in Physiology and Medicine is utilization of AI (Artificial Intelligence) and ML to extract knowledge by computers out of Big Data generated by Omics technologies. [4] In this work, we have developed and trained a new ML-based system using general new generation of RNA Seq. data that can detect breast cancer even in very early stages, and hence will decrease the risk of mortality by early treatment.
ML is rapidly opening its position in medical and pharmaceutical sciences. Different models of ML have been tested in last few decades and have returned great results in different fields of medicine including but not limited to cancer identification. [5] [6] Naïve Bayes, Support Vector Machines, Random Forest, Logistic Regression, K Nearest Neighbors and Neural Networks (NN) are examples of general supervised ML algorithms that have reportedly been successful in different medical and pharmaceutical projects. [6] In this work we came up with a novel approach of applying ML for cancer detection that is effective and robust. Using our method, cancerous tissue can be identified easily in any stages, thus providing an opportunity to be controlled in time. This approach also offers a new direction for disease diagnosis while providing a new method to predict traits based on genomic information.

Method
In this project, we have used Naive Bayes algorithm from Sci-Kit Learn on 1270 samples from The Cancer Genome Atlas (TCGA) research network and the Genotype-Tissue Expression (GTEx) project portal and directly fed the genome data to the machine to do heavy statistical calculations on our high dimensional data. In below, different parts of the method are clarified.

Bayes' theorem
Bayes' theorem was proposed by the English Thomas Bayes in 1763 when he was trying to prove the existence of God by means of statistical inference. [7] Bayesian statistics are used in estimates based on anticipated subjective knowledge. Therefore, the implementations of this theorem adapt with use and allow combining the fusion of data from two or more different sources and expressing them in terms of likelihood. Naive Bayesian Classifier is an implementation of Bayes' theorem, with some additional simplifying hypotheses, which allow applying an independence hypothesis, between the predictor variables, hence "Naive" is added to the name of these implementations because a naive Bayesian classifier assumes that the features of a class / object are not related to each other i.e. the presence of a particular feature is not related to the presence or absence of another. In this way each feature independently contributes to the probability of a given class. In return, Bayes Classifiers can easily be trained, require little data to train, and can classify big data quickly. Nevertheless naive Bayes classifiers are amazingly simple, they have worked quite well in many real-world situations, including our cancerous/healthy tissue classification. It required a small amount of training data and could be fast and accurate as reflected in the Results section. On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator in a sense that one cannot rely on its parameters for extraction of feature importance. [9] 2.2 Model function More formally, as shown by equations 1-6, Bayesian classifiers are, indeed, probabilistic classifiers using Bayes rule i.e.
For example, A can be the prior probability of cancer and B the posterior probability of cancer; given positive cancer test result is the product of the prior times the sensitivity i.e. the chance of a positive result given cancer. Indeed, a naive Bayesian classifier accomplishes statistical inference based on maximum likelihood estimation i.e. setting the parameters of the probability distribution in a way that maximises the goodness of fit of a statistical model to the training data via joint probability distributions of the training samples.
In technical words, the likelihood function describes a hyper surface whose peak, if it exists, is an arrangement of model parameters values and coefficients that maximize the probability of drawing the obtained sample. [8] In its more general form, according to Sci-kit Learn website documentation, Bayes' theorem states the following relationship, given class variable y and dependent feature vector x 1 through x n : Using the naive conditional independence assumption that for all i, this relationship is simplified to Since P (x 1 , x 2 , ..., x n ) is constant given the input, we can use the following classification rule: and we can use Maximum A Posteriori (MAP) estimation to estimate P (y) and P (x i | y); the former is then the relative frequency of class in the training set. The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P (x i | y). [9] GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian: where the parameters σ y and µ y are estimated using maximum likelihood.

Feature selection
In ML supervised classification methods as well as in K-Means unsupervised clustering algorithm, the input data (the points) is viewed as a p-dimensional vector (an array or ordered list of p numbers). Then the classifiers more or less based on similar criteria e.g. in the Bayesian classifiers, the classifier looks for a hyper surface that maximizes the likelihood of drawing the sample, or in SVMs, it looks for a hyperplane that optimally separates the points of one class from the other, which eventually could have been previously projected to a higher dimensional space. There is wrong perceptions in the ML community that have prevented potential achievements, and we could get great results by violating fake red lines; one of them, for instance, is about the number of features such as "it is obviously impractical to select all of the point mutations as dimensions for the model because mass dimensions will increase the computation cost." As a result, researchers usually try to reduce by themselves the assumed learning pressure on the machines brought about by highly redundant dimensions and select a subset of features i.e. genes to reduce the number of features and dimensions. [10][11][12] A strength point of our work is that we consider ML as powerful advanced statistics tool doing heavy statistical analyses, that people themselves cannot do. As a result, we gave all the data corresponding to the WES as feature inputs to the ML at once and it returned almost perfect results quickly and precisely. We thought of 19627 different genes not as too many features but as different pixels of a less than 141*141-pixel photo and it was a very light task for the machine to analyze such a low resolution image and it took only seconds to classify the cancerous and noncancerous cells 100% precisely.

Model optimization and settings
We have used Gaussian Naive Bayes classifier from scikit-learn 0.23.1 with its default settings i.e. priors equal to None and var smoothing equal to 1e-9 where var smoothing is the portion of the largest variance of all features that is added to variances for calculation stability. Nevertheless naive Bayes classifiers are amazingly simple, they have worked quite well in many realworld situations, including our cancer/non-cancer classifier. They require a small amount of training data and can be extremely fast compared to other ML classifiers. On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator. It means that one cannot rely on its parameters for extraction of feature importance. [9] 2.5 Model evaluation Model evaluation produces measures to approximate a classifier's reliability.
To distinguish between cancerous and noncancerous cells, since it is a binary classification, we use accuracy, precision, specificity, sensitivity, f1 score, several averaging techniques and ROC curve to evaluate the model. We, indeed, use Sci-kit Learn Metrics Classification Report that returns precision, recall and f1 score for each of two classes. In binary classification, recall of the positive class is called "sensitivity"; and recall of the negative class is "specificity".
In what follows, the terms and derivations from confusion matrix such as accuracy, specificity, sensitivity, f1 score are given to review and compare: T P R = T P/P = T P/(T P + F N ) = 1 − F N R Specificity, selectivity or true negative rate (TNR): Precision or positive predictive value (PPV) is the ratio of the correctly labeled samples by our program to all labeled ones in reality.
Precision can be calculated only for the positive class i.e. class 1 that shows cancer or can be evaluated for each one of the two classes independently treating each class as it is the positive class at time, and the latter is done in Sci-kit Learn Metrics Classification Report as shown in table 1.
Negative predictive value (NPV): Miss rate or false negative rate (FNR): Fall-out or false positive rate (FPR): False discovery rate (FDR): False omission rate (FOR): Accuracy (ACC): The harmonic mean of precision and sensitivity or f1-score (F1): Since we are using Sci-kit Learn Metrics Classification Report to show the results as shown in table 1, we also describe the meaning of micro avg, macro avg and weighted avg. used in the report:  Macro-average of f-Score (MAAF) would be the harmonic mean of the two numbers above.
Macro-average method is suitable to know how the system performs overall across different sets of data but should not be considered in any specific decision making because it calculates metrics for each label and finds their unweighted mean i.e. it does not take label imbalance into account, while in our case, the labels are highly imbalanced i.e. 1091 vs. 179. On the other hand, micro-average is a useful tools and returns measures for decision-makings especially when datasets vary in size because it calculate metrics globally by counting the total true positives, false negatives and false positives. Finally, Weighted-average, according to Sci-kit Learn documentation on f1-score metrics, calculates metrics for each label, and finds their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance; consequently, it can result in an F-score that is not between precision and recall.

Results
Genomic variation files for healthy people (179 persons) and cancer patients (1091 samples) were obtained from the Gtex Project and the TCGA online database. The results were just amazing because the system can detect all cancerous and noncancerous samples correctly and as seen in the classification report shown in table 1, the performance of the classifier is perfect with accuracy and precision of 100% and sensitivity and specificity of 1. In this classification, not only the accuracy is 100% but also the Receiver Operating Characteristic's Area Under Curve (ROC AUC) from prediction scores also would be 1 as seen in figure 1.

Discussion and Conclusions
The classifier did its task perfectly with no error, at least on our available data.
There are yet some aspects to reflect on. Although most of TCGA Breast Cancer (BRCA) comprise white women's samples, but it contains samples of Asian, black, and latin women as well as men. Since our method classifies all cancerous and non-cancer samples correctly using the information available in genomic variation, it means that the genetic signatures of cancer are detected universally without need to consider racial or sexual differences. The samples also from four different stages and the trained machine could detect cancerous samples from all stages correctly. This means even cancerous from earliest stages were distinguished correctly from noncancerous samples which is extremely important for effective treatments because cancer stage plays an important role in determining treatment options and patients' survival, as the earlier the diagnosis, the higher chance for the successful treatment.
Our work provided a new approach in application of ML on medical data that resulted in excellent classification between cancerous and noncancerous cells of the breast. In this work, we did not reduce the dimension of input data and left all the statistical analysis to the ML system and it could do its job very well and distinguished the cancerous samples from healthy cells almost perfectly. We even did not need to balance the number of samples of each class and it shows that the difference between two class is so much that providing hundreds of samples enables the machine to distinguish between two categories perfectly without any mistake. We even did not need to pre-process the data obtained from Gtex and TCGA despite the fact that their data is not perfect and there are some rows of missing data for some genes quantities in some samples, yet the data provided by these two projects are fairly clean and reliable and it was enough for our classifier to be able to do its classification 100% correctly. This ML system is trained now to receive any new person's RNA-seq data and recognize if the patient's breasts are cancerous or not. It can detect the problem in different stages of cancer accurately; therefore, it can be helpful in early diagnosis of cancer. The limitation of our model is that it needs data of samples from organs and the involving labs should follow the same protocols on obtaining the transcriptomics data of 19627 genes as done by Gtex and TCGA on samples obtained from people's breasts. The New Generation RNA-seq protocols followed by Gtex and TCGA are well-known and standard. Thus the next work can be finding suitable biomarkers in the blood that can detect healthy people and patients only by their blood tests.

Availability of data and materials
The data used in this project are publicly available on www.gtexportal.org and https://portal.gdc.cancer.gov/ and all ethical issues are strictly observed by them. This project does not need any extra personal/patient consent approval either because the data are normalized and does not reveal any private information and whatever necessary with respect to the law is observed by the institutes publishing them. All software can be available on github after paper approval.

Authors' contributions, competing interests and consent for publication
I, Arash Hooshmand, as the only author of this article have submitted it to the BMC Bioinformatics journal, hence I reiterate my consent to publish it in this journal. There is no competing interests and there is no need to any other consent approvals.

Funding and acknowledgement
The library of KTH Rotal Institute of Technology has Open Access publication agreements with BMC Bioinformatics and has accepted to pay for its publications. I also thank and acknowledge Houshmand family and their companies, especially Mr. Eng. GHolamAbbas Houshmand, Atash Houshmand, Shahab Houshmand, Shahin Houshmand and Shadab Houshmand for their financial support and contribution in the project.