Inside-out Vision for Treatment Identi cation in Dental Setup using Machine Learning

Smart clinics have gained much popularity due to the technological advancements in areas like computer vision. The recognition of objects and activities and overall perceiving the environment lies at the core of such systems. This is essential not just for the eco-independent systems, but also for HumanMachine-Interaction specially in scenarios with small work-areas like dental treatment. In this paper, we compare a number of machine learning models (including Multinomial Logistic Regression, Lazy Instance-based Learning (IBk), Sequential Minimal Optimization (SMO), Hoeffding Tree and Random Tree) for robustly identifying dental treatments. We take the objects-focussed as input, which covers parameters like material, symptoms of the patient teeth and tools used by the dentist. We take advantage of the fact that the issue of identifying a particular treatment can be solved by recognizing the objects seen during an activity. We collected a dental dataset in-the-wild and ran our tests to find that integrating different parameters improves accuracy relative to using each one separately. However, we also noted that in certain cases using the symptoms stand-alone gave better results. Also, with respect to RMS Dr. Shaheena Noor Computer Engineering Department, Sir Syed University of Engineering & Technology Tel.: +92-300-278066 E-mail: shanoor@ssuet.edu.pk Dr. Humera Noor Minhas CTO and Co-Founder, 4pilots, Jena Germany humera.noor@gmail.com Engr. Muhammad Imran Saleem Computer Engineering Department, Sir Syed University of Engineering & Technology isaleem@ssuet.edu.pk Dr. Vali Uddin Vice Chancellor, Sir Syed University of Engineering & Technology vc@ssuet.edu.pk Dr. Najma Ismat Computer Engineering Department, Sir Syed University of Engineering & Technology nismat@ssuet.edu.pk 2 Dr. Shaheena Noor et al. error convergence, symptoms showed to have lower error compared to combined. Finally, we noticed that the combined approach led to longer build and test times for the machine learning models. This shows that in machine learning applications in general and in medical/dental applications in particular, adding more parameters does not always lead to improved results. Rather it depends on the ML tool used, the parameters considered and the data given as input.

error convergence, symptoms showed to have lower error compared to combined. Finally, we noticed that the combined approach led to longer build and test times for the machine learning models. This shows that in machine learning applications in general and in medical/dental applications in particular, adding more parameters does not always lead to improved results. Rather it depends on the ML tool used, the parameters considered and the data given as input.

Introduction
Healthcare heavily relies on images processing for not only medical diagnosis, but also for Human Computer Interaction systems and smart robots. This requires, on the one hand, sophisticated image capturing setups and efficient machine learning models on the other. As the wearable technology becomes more pervasive and advanced, it has made it possible to perceive the environment from the point of view of the actor. This has changed the working paradigm a lot, because now the scene can be perceived from the actor's point of view thus giving hints into his cognitive progress, and also save us from installing fixed camera setups around the actor. This mode of data perception, also called as First Person Vision, allows to interpret complicated scenarios by breaking them into their constituent objects seen and handled by the actor.
The paper is organized as follows: In Section 2, we provide an overview to the literature review on inside-out vision and machine learning for computer vision applications. In Section 3, we elaborate our system based on treatment identification using inside-out vision and introduce the ML models used in our work. In Section 4, we present our dataset, provide details of the experiments and a discussion of the results. Finally, in Section 5, we conclude our findings.

Inside-out vision
Inside-out vision associates human eye movements with human mind. There is a lot of research on human working behavior and its connection to the eye movement which in fact follow our mind and thoughts [1], [2]. The concept of eye movement and trajectories was proposed by Yarbus [3], that became the basis for future researche [4] [5], [6]. Even for the same setup, if the task is given with different goals, the eye movement trajectories and patterns will be different. Inside-out vision is different from the traditional wall mounted cameras. Over here, the actor itself wears the camera on shoulder, forehead or eyes etc. and observe the scenario from his point of view. On the other hand traditional wall mounted cameras are used to observed the overall situation of the environment -hence referred to as outside-in vision. They observe the overall activity especially focusing on human activity and behavior.
Outside-in view is well-suited for identifying person recognition, current activities, predict future activities, observing human behavior and many more. But there are also issues related to it if we want to capture items that are far away from the camera or they are small in size. Therefore, missing these items from the environment reduce the accuracy in the overall performance. For such scenarios, inside-out vision cameras are much more reliable as they are easy to installed and can provide minor details of the scene. In short, we can say that these inside-out vision cameras is easily worn by the person and it exactly focuses where the person is looking. So relevant information will be captured and other additional information will be discarded.
Research has been done on scenario detection using inside-out vision cameras to note objects and behaviors activities. In [7], the importance of eye movement when perceiving a scene was investigated. Similarly, in [8] gaze patterns are being observed during a scenario execution and their linkage with the task being performed. Object and activity recognition [9] using deep features are very much used when we talk about computer vision and machine learning. It has been checked that the eye movements are closely linked to the current environment, job or operation and this is why the objects seen aid in situations recognition. It is particularly helpful when identification needs to be achieved without large movements and when the interacting objects are in close proximity to each other. Inside-out vision is widely being used in smart home, identifying objects [10], locations, observing human behavior [11] researching on neuro system [4], predicting mental disorder and illness. According to Furnari et al. in [12] inside-out vision views are used to estimate the future actions based on the interactions with the objects.

Machine Learning for Understanding and Prediction
As the image capturing devices are becoming more pervasive and sophisticated, a lot of input that we get from the environment is visual -in the form of images and videos. Understanding these images to perceive the environment is, therefore, an interesting and challenging task. Many efforts from machine learning researchers are being put in computer vision and image understanding applications. This includes not only identifying the low-level features apparent from the visual data, e.g. recognizing objects and activities, but also to analyse them for scene understanding and behavior analysis. This automated analysis is being used in diverse domains like healthcare, marketing, sports, and security etc [13].
In machine learning, decision making and behavioral analysis is essentially a classification problem, where prediction implies assigning the correct label [14] [15]. In [16], Chen et al. proposed an unsupervised deep learning solution to forecast scenarios based on multi-dimensional sensor data representing spatio-temporal correlations. Their system is data-driven and essentially model-free. In [17], Alonso et al. used machine learning techniques for predicting anomalies in software. They used Random Forest and achieved a validation error of less than 1%. Isupova in [18] did a deep dive into machine learning for behavior analysis and detection of anamolies in videos. The author introduced a novel Bayesian topic models-based approach to model the activities and use them to identify the changes. Arac et al. in [19] explored the brain-to-behavior connection from both neurology and imaging point-of-view and presented a deep learning toolbox using three convolutional neural network architectures for automated analysis.
Statistical machine learning methods are also used for behavior and scenario prediction, however not much attention is placed to video analysis. Umana-Hermosilla et al. in [20] use multinomial logistic regression to predict how individuals and companies perceive the new normal under the COVID-19 pandemic. Westlinder in [21] explored instance-based learning and trained a Support Vector Machine with packet-based features for video traffic classification. In [22], Chakrabarti et al. proposed a fast training method using Sequential Minimal Optimization (SMO) to handle the large training sets and solving the quadratic programming problem that becomes a bottleneck for Support Vector Machines. Bifet et al. in [23] explained the application of Hoeffding Trees in training and classifying data streams.

Gaze-based Information
Information regarding objects seen in FPV images is important and is used to define the focus of attention. This information is more comprehensive compared to the regular outside-in view, because it does not have occlusion problems. As an example, consider Figure 1 [24], captured via a mounted camera, where we see a dentist performing treatment on a patient. However, the treatment can not be identied correctly because of the occlusion of objects (like tools) by the dentist's hands. Also, the state of the patient's teeth (i.e. symptom) can not be detected from outside-in view, which is another significant parameter in identifying the overall treatment. Now consider Figure 2 [25],[26], in which a similar scene is observed from the first person view of the dentist. Here we can see clearly see the tools used along with the condition of teeth (i.e. symptoms). These inside-out camera observations provide information about the objects seen and help to understand the current scenario. Fig. 1: Outside-in view: Patient being teated by a dentist [24]. No information about condition of teeth, tools or material used can be extracted from the image due to self-occlusion.  The treatments and their respective objects in focus is shown in Table 1. As described above, three parameters are being used to identify the dental treatment and that are: material, symptom and tools.

Machine Learning Techniques for Treatment Identification
In this section, we explore a number of machine learning methods for treatment identification in dental setup. We examine Multinomial Logistic Regression, Lazy Instance-based Learning (IBk), Sequential Minimal Optimization (SMO), Hoeffding Tree and Random Tree, and provide a comparative analysis over performance, latency and error metrics. In the remainder of the section, we provide a theoretical and mathematical overview of these techniques.

Multinomial Logistic Regression
Multinomial Logistic Regression is a modeling technique to predict classification of dependent variables based on multiple independent variables -which can be binary or continuous.
In order to calculate the parameters and reduce errors, a number of techniques are used, one of them being ridge estimator [31]. The ridge estimator is used for multinomial logistic regression as follows: Given k classes for n instances each with m attributes, an m x (k-1) matrix is calculated denoting the parameter matrix B.
The probability for all classes j except for the last class is calculated by Eq. (1) The last class has probability as given in Eq. (2) Thus, the multinomial log-likelihood is calculated as in Eq. (3): (3) A Quasi-Newton Method is used to search for the optimized values of m * (k − 1) variables in order to find the matrix B for which L is minimized. We use it to estimate the local minima since other comprehensive methods can become computationally expensive for each iteration.

Lazy Instance-based Learning (IBk)
Instance-based Learning (IBk) is an implementation of K-nearest neighbors (k-NN) classifier, with the possibility to select an appropriate value of k based on cross-validation and apply distance weighting. A comparative analysis of Bayes and Lazy classification algorithms show [32] that the lazy classifier is more efficient than Bayesian classifier. The simplistic classification algorithm of IBk is given in Algorithm 1: Algorithm 1 K-Nearest Neighbor Algorithm [32]. Given a query instance xq to be classified, 5: Let x 1 ...x k denote the k instances from D that are nearest to xq 6: Return 7: where (a, b) =1, if a = b, and -(a, b)=0 else

Sequential Minimal Optimization (SMO)
The originally available training method for Support Vector Machine (SVM) were complex and required computationally expensive Quadratic Programming (QP) solvers. Hence, Sequential Minimal Optimization (SMO) algorithm was proposed by Platt in [33] for solving the QP problem arising during SVM training. It is implemented by the LIBSVM tool [34] and has gained much popularity. Training a Support Vector Machine can be regarded as an optimization problem, which requires solving a quadratic program. Consider, e.g. a binary classification problem [35] with a dataset (x 1 , y 1 ), ..., (x n , y n ) where x i is an input vector and y i is the corresponding binary label. A soft-margin SVM is expressed in Eq. (4) subject to: where C is an SVM hyperparameter, K(x i , x j ) is the kernel function and alpha i are Lagrange multipliers. SMO is an iterative algorithm to solve this optimization problem by breaking it into smaller atomic problems and solving analytically. This allows the constraints on any two Lagrange multipliers alpha 1 and alpha 2 to be simplified to: This simplified problem can be solved analytically by finding a minimum of a 1-D function over the rest of terms in the equality constraint.

Hoeffding Tree
The Hoeffding Tree [36], also known as Very Fast Decision Tree (VFDT), is a decision tree for classification tasks in data streams. It is an incremental, tree induction algorithm that can learn from massive datasets given that the distribution of samples does not change over time. Hoeffding trees utilize the fact that if an appropriate splitting attribute is chosen then even a small sample set can be sufficient. Mathematically, Hoeffding trees are based on Hoeffding bound [37], which quantifies the number of observations required to estimate the statistical parameters within a prescribed precision. It can be shown that with enough examples (infinite), its output is identical to that of a non-incremental learner.

Random Tree
Random tree, which may also be regarded a decision tree, is essentially the outcome of a stochastic process. It's constructed by considering k randomly chosen attributes at each node without any pruning. Random tree also has an option to allow estimation of class probabilities based on back-fitting.

Dataset
For the purpose of experiments we collected our own dataset in the wild. Such a data collection is inherently complicated because of the randomness and unstructured nature of the input. However, it serves as a reliable test set for evaluation of ML models. The data was extracted from real and synthetic videos of dental treatments and used for training and testing the models. A subset of images is shown in the Figure 5. Next, we explain the configuration of the machine learning models used for treatment prediction. We used the ML tool WEKA [38] for implementing the models and run the simulations.

Multinomial Logistic Regression
We used WEKA's logistic regression class with modified ridge estimator to train and run a logistic model. The missing values (if any) were replaced with a missing value's filter and nominal classes were converted to binary. For calculations, we used up to 4 decimal places, batch size of 100 and left the maximum iterations unspecified. The ridge value was initiated to 1.0E-8.

Lazy Instance-based Learning (IBk)
The WEKA implementation of Lazy IBk is based on Euclidean distance for nearest neighbor search. We used a batch size of 100, no distance weighting and up to 2 decimal places for calculations. We used a single neighbor for the model (k = 1) and trained it without restricting the maximum number of instances in the training pool.

Sequential Minimal Optimization (SMO)
The SMO function of WEKA is an implementation of the original Platt's proposed algorithm for training an SVM classifier. In this implementation, all missing values are globally replaced and the nominal attributes are transformed into binary. Also, in case of multi-class problems the solution is implemented using pairwise classification i.e. 1-on-1.
For our calculations we use a batch size of 100, a random seed for the cross-validation and up to 2 decimal places. We use the polynomial kernel, the logistic calibrator and normalize training data. The epsilon for round-off error is left to the default value of 1.0E-12.

Hoeffding Tree
We use the Adaptive Naive Bayes as leaf prediction strategy for implementation of Hoeffding Tree. For this, we set the number of instances that a leaf should observe between split attempts to 200. The split is done according to the info gain split criterion and the allowable error in a split decision is 1.0E-7.
For our calculations we use a batch size of 100 and up to 2 decimal places. The minimum fraction of weight required to go down is set to 0.01 and the hoeffding threshold below which a split will be forced to break ties is 0.05.

Random Tree
To train a random tree-based model in Weka, we started by specifying the random seed value of 1 for selecting attributes and did not allow unclassified instances for training.
For our calculations, we set the minimum total weight of the instances in a leaf as 1.0, did not apply back-fitting, specified the batch size as 100, and used up to 2 decimal places. In case several attributes look equally good, we did not enable the breaking of ties randomly and allowed the tree to grow to an unlimited depth. The minimum proportion of the variance has to be 0.001 and the number of randomly chosen attributes is calculated by: int(log 2 (No. of predictors) +1).
We present the results in Table 2. We note that objects in gaze are a strong predictor for recognizing treatment even if only one of the three criteria i.e. material, symptom and tools is taken into account. Moreover, by combining multiple parameters improves the overall accuracy. With extracted object information in focus, the ML models were regenerated and trained. As expected, in most cases the precision and recall were improved when considering the combined information compared to when the parameters were used individually. It is surprising to note that for Multinomial Logistic Regression and Tree Hoeffding methods, the best precision and recall were achieved when Symptoms parameter was used independently.
We also looked at the training and testing times (in sec) of each of these models and present them in Table 3. Here, times smaller than 0.01 sec were automatically rounded off to zero. We notice that the combined models were the slowest to build and test, hence have a disadvantage for computationally intensive applications or for huge datasets. Out of these the SMO took longest to build and Hoeffding Tree took the longest to test.  Finally, we compare the Root Mean Squared Error (RMS) of the different models and found that in most of the cases Symptoms parameter returned the minimum RMS error.

Conclusion
In this paper, we used the dental setup and presented treatment identification system with the objects in focus that includes materials, symptoms and tools.
We have trained and tested various machine learning models including Multinomial Logistic Regression, Lazy Instance-based Learning (IBk), Sequential Minimal Optimization (SMO), Hoeffding Tree and Random Tree. Combining the observations w.r.t. performance, speed and error convergence, we conclude that combining multiple parameters result in better performance in some cases, however it results in low error convergence and slow speed. Also, not all machine learning techniques give an improved performance for combined approach. Rather, the best trade-off is received when we use symptoms as a stand-alone parameter. This shows that in machine learning applications in general and in medical/dental applications in particular, adding more parameters does not always lead to improved results. Rather it depends on the ML tool used, the parameters considered and the data given as input. Figure 1 Outside-in view: Patient being teated by a dentist [24]. No information about condition of teeth, tools or material used can be extracted from the image due to self-occlusion. Subset of dataset collected in-the-wild for Dental Treatment