3.1 General Architecture
This paper defines an integrating Clustering Algorithm into Domain Ontology based on Cloud Computing (CADOCC) architecture for supporting various applications involved in dealing with intelligent mobile app development. This paper argues that CADOCC can be employed as a common framework to integrate clustering algorithm and domain ontology uniformly using a fundamental cloud computing environment. The proposed CADOCC consists of data preparation module, machine learning module, semantic web module and cloud computing module. Figure 1 displays the CADOCC, which implies the application framework at a high level of abstraction.
- 3.2 Data Preparation Module
In the data processing stage, the data preparation module performs data acquisition, filtering, and formatting operations to generate available training data to the machine learning module. In data acquisition operation, data source can be web-based data acquired from different web sources. The collected data in this study included OGD, LTC related articles and online news. In data filtering operation, Jieba [40]is employed for automatic segmentation of Chinese word. In the data formatting operation, Term Frequency–Inverse Document Frequency (TF-IDF) [41] is used to evaluate the weight of words from a article. A practical case description of the data preparation module is in Section 4.1 and 4.2.
The machine learning module adopts mathematical and statistical techniques, for example, K-means clustering, K-NN(K Nearest Neighbor)[42] classifier, fuzzy c-means, bayesian network, neural network, Latent Semantic Indexing (LSI)[43], etc. Using these algorithms, programs can learn the rules in data for use in decision-making and prediction. This study employed K-NN classifier to categorize each class on domain ontology and K-means clustering to extract hidden topics from article posted on social web. In the training process, the machine learning module is used to design a predictive model that can be reused by the semantic web module once it is established. Therefore, the training process is not required every time the App interacts with a user.
LSI associates words with similar semantics in documents. Generally, the correlation between individual words is ignored when searching for keywords. However, in practical applications, individual words may actually be related to each other. LSI uses singular value decomposition (SVD)[44], dimension reduction (DR) and vector space model to consider these correlations. SVD is highly useful in machine learning to gather intelligence for various applications [45]. The association between an article and a word is used in SVD to determine the category of an article. In this study, we used clustering and feature extraction to identify LTC information and LSI to determine the similarities among OGD, news, and LTC information.
The Semantic Web Module is composed of related web technologies, including XML-based markup language[15], open data, ontology, and logic rule, etc. The XML-based markup languages contain Extensible Markup Language (XML), Resource Description Framework (RDF), RDF Schema (RDFS), Web Ontology Language (OWL), etc. A web-based open data that can be accessed using URL. An ontology is developed with an RDF-based markup language to define a shared conceptualization of a special domain. There is a detailed definition of the ontology in our previous paper[46]. The complex semantics can be presented with logic rules, for example, Semantic Web Rule Language (SWRL), Jena-based rule, etc.
The cloud computing model can be represented as a three-tier architecture, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). The developers would typically deploy their applications on the infrastructure. Cloud computing module is used to improve the performance of machine learning module and semantic web module. Excessive data can lead to slow computing or even system failure. Cloud computing technology can solve these problems.
Hadoop is a cloud computing platform that contains of MapReduce and HDFS (Hadoop Distributed File System) [47]. The MapReduce enables distributed computing, and the HDFS achieves a distributed file system with the advantages of high fault tolerance. The YARN [48] is a kind of cluster manager that is proposed in Hadoop 2 version. Apache Spark [49] is also an open-source platform that adopts in-memory technique and Resilient Distributed Dataset (RDD). The data can be calculated in the memory and finally written into the hard disk to be accessed. In this study, the cloud computing technology used Spark for the computation of LSI and K-means clustering.
3.2 Long-Term Care Application Platform
In order to verify the feasibility of CADOCC, a Long-Term Care Application Platform (LTCAP) is developed based on CADOCC, as shown in Fig. 2.
1. User
1.1 User operates App to access LTC policies and information in the LTCAP.
1.2 The LTCAP presents the corresponding information and replies as required.
2. Developer
2.1 System manager imports the LTC related OGD and news to the database.
3. Computing
3.1 User access the information to the database through LTCAP for subsequent recommendation.
3.2 Google Maps API is invoked to retrieve OGD information and process the LTC article information into the format required by the machine learning.
3.3 The user profile is converted to an RDF file which excludes personal security data. The Jena inference engine uses the RDF file to infer the LTC services and organizations for user needs. The detailed flow chart of Jena inference engine in introduced in Section 4.5 and 4.6.
3.4 Machine learning technology, including LSI, K-means and K-NN, is used to design a predictive model that can be reused once it is established. An example of machine learning processing is introduced in the Section 4.3.
3.5 LSI, K-means and K-NN algorithms are developed on the Spark cloud environment to improve machine learning performance.