Validity as a Measure of Data Quality in Internet of Things Systems

Data quality became significant with the emergence of data warehouse systems. While accuracy is intrinsic data quality, validity of data presents a wider perspective, which is more representational and contextual in nature. Through our article we present a different perspective in data collection and collation. We focus on faults experienced in data sets and present validity as a function of allied parameters such as completeness, usability, availability and timeliness for determining the data quality. We also analyze the applicability of these metrics and apply modifications to make it conform to IoT applications. Another major focus of this article is to verify these metrics on aggregated data set instead of separate data values. This work focuses on using the different validation parameters for determining the quality of data generated in a pervasive environment. Analysis approach presented is simple and can be employed to test the validity of collected data, isolate faults in the data set and also measure the suitability of data before applying algorithms for analysis. On analyzing the data quality of the two data sets on the basis of above-mentioned parameters. We show that validity for data set 1 was found to be 75% while it was found to be 67% only for data set 2. Availability and data freshness metrics performance were analyzed graphically. It was found that for data set 1, data freshness was better while availability metric was found better for data set 2. Usability obtained for data set 2 was 86% which was higher as compared to data set 1 whose usability metric was 69%. Thus, this work presents methods that can be leveraged for estimating data quality that can be beneficial in various IoT based industries which are essentially data centric and the decisions made by them depends upon the validity of data.


Introduction
Internet of Things (IoT) is a network of physical devices or things embedded with electronic devices such as sensors, actuators, software and an interconnecting network which enables it to collect and manage data without involving human interaction [1]. Common to all the mentioned applications, the IoT application comprises of four major segments [2]. The sensor and actuators with the communication capabilities are the integral part and mostly hardware oriented. The software segment consists of the user interface modules, connectivity and adaptability of the apps across different machine capabilities and interfaces [3]. IoT combines various technologies into a single intelligent unit. This intelligence in IoT networks is attributed to the amount and quality of data. All IoT devices generate varied amounts of data at varied levels of scalability and time dependence like a point of sales terminal, distributed sensors or industrial machinery [4]. These data range from being mission critical to custom business logic. Also, with the proliferation of organizations in to the IoT domains and its enablement through hardware, software services and connectivity, the complexity of the applications have increased. In addition to handling increased software complexity, IoT performance also depends on its responsiveness and real time digital services. Since an important part of an IoT application is the software in addition to its hardware, evaluation of the performance of an IoT application becomes necessary [5]. Software metrics therefore need to be developed for IoT applications, in order to monitor IoT performance for fulfilling the following requirements: • Able to monitor devices that run on different processor architecture. • Able to monitor IoT application written in different programming languages. • Overhead in monitoring should be minimal. • Able to receive data generated It is recognized that Software metrics is a measure of characteristics of the software that are quantifiable or countable. Also, the main objective of any software metric is to analyze the product or process; to determine its quality and suggest improvements as well as predict when the software development process is over [6][7][8].
In this paper, we try to determine data quality in an IoT application on the basis of different metrics namely, validity, data freshness, completeness, usability and availability. These metrics have been studied using two separate data sets. The major focus of this paper is to verify these metrics on aggregated data set instead of separate data values. In order to determine performance of a software model, we have predefined ISO-9126 Software Quality Characteristics which is used for evaluating the software model [9].

Role of Data and its Quality in IoT Applications
Analysis of IoT data can be helpful in many ways such as optimizing operations, efficient use of energy and resources. Combining IoT with data analytics becomes effective in real time applications like healthcare, telematics, smart cities and the like [10]. We have a large number of real time applications available in the market which are fully dependent on the data for its functioning [11,12]. Therefore, if the data quality degrades it will have a negative impact on the end user relationship. To implement any IoT based application, large number of sensors are deployed which further generate huge amount of data. This data generally falls under the category of big data since it justifies the properties of big data namely volume, velocity and variety. Data obtained from the sensors needs to be validated before using it for any decision-making process [13]. Also, IoT application domains like utilities, mobile and enterprises require continuous monitoring of the data as any type of fault can lead to wrong analysis, thereby leading to decrease in trust by the end user. Apart from affecting the reasoning capability, it also leads to wastage of energy since in case of wireless sensor nodes, every bit of transmission consumes energy. [14] In addition, IoT based monitoring applications require alerting capabilities that are sophisticated and customizable considering the most decentralized network of devices or the inherent excessive dynamics of the network and devices. The volume of telemetry data that is collected from IoT devices are impacted by the intermittent connectivity issues [15]. Therefore, the traditional alerting tools face challenges which can be overcome to an extent by measuring the quality of the data collected. We are also aware that every IoT application is unique with a said functionality and rarely uses off-the-shelf products [16]. Therefore, a generalized solution "one-size-fits-all" is not applicable.

Methodology for Analysis of Metrics
We consider "Temperature" as a common attribute that forms a part of most applications varying from personalized healthcare, to environment monitoring. The data quality dimensions related to parameter "temperature" can be illustrated as Figure 1.
Whenever we apply the concept of IoT for studying environmental behavior in any geographical location, large number of sensors are deployed [17][18][19]. The data obtained from these sensor nodes go through various issues such as hardware & software problem, poor connectivity, environmental effects [20]. Thus, the data samples obtained needs to be Availability, usability, accuracy, Ɵmeliness.

Environment Modelling
Completeness, data volume.
Event based Systems Timeliness, accuracy, confidence.
1. Out-of-range faults occur when the data obtained from the sensors does not lie within an expected range of values. Generally, this type of faults are determined by comparing data using a threshold limit which is usually based on domain knowledge.
2. Struck-at faults or constant faults occur when the data obtained from the sensor nodes show no change or very little change for a significant amount of time. It behaves as if data has reached a frozen state and shows no variation.

3 3 Metrics used
The methodology and quality metrics considered in this research are depicted through Figure 2. Our analysis starts with number and maximum value of updates on a time-series framework. The metrics used on the data is dependent on the frequency of updates received are data freshness, completeness and availability as shown in Figure 2. On the contrary, validity is determined on the maximum of the update value of the parameter considered.

Validity
When we deal with data related to pervasive environment validity is defined on the basis of some practical considerations [21]. To determine validity of any measure, various techniques are used. For example, a database application comprising of student details, can be validated by cross-checking from various other available data sources [22]. However, when we consider the case of IoT applications, it is not easy to obtain correctness of data since these data are generated for a short span of time. Therefore, for defining validity, we first need to define validity rules. Validity rule (V) can be expressed as a Boolean function. If validity rule V is satisfied by the data object, then its value is 1 else 0. These rules are domain specific [23][24][25][26][27][28]. The two most common validity rules are: a. Static rules that are satisfied by verifying the data set. For example temperature in Polar region will be negative or around zero degree Celsius. b. Dynamic rules are used to verify the changes in the data i.e. drop of 10 degrees is not possible in half an hour.
Validity can also be expressed as a stochastic measure when it is evaluated for a particular observation period. Let n be the number of updates that have occurred and n valid be the total valid instances; probability of validity can be expressed as: In research application areas, validity refers to how accurately a particular method measures any quantity. So, when we are dealing with data related to pervasive environment, validity is defined on the basis of some practical considerations that we try to explore in our work. The requirements of IoT applications are varied and the data quality dimensions align accordingly. For, example, a timely data may not be accurate while an accurate data may not be complete or in time. These tradeoffs force us to outline a metric system that could map single attribute "validity" in terms of its timeliness, completeness and its usability [29].
We assume IoT monitoring applications which employ several sensors that collect temperature data for storage and analysis. For the ease of elaboration, our focus of study considers 'Maximum' temperature. Since we consider data for an entire year, we also focus on evaluating metrics for aggregated data in our study. We usually obtain large set We have considered maximum temperature as a measure for determining its validity. For deciding the validity of a dataset, first of all we need to calculate the update_value which serves as an important entity for validation of any data. To obtain update_value, we followed following steps: • Step 1: Obtain mean of maximum temperature for each month and calculate Standard Deviation (SD) for the dataset. On the basis of these values, max update_value for each month was determined which is depicted in Fig. 3. From the figure we see that there are two discrepancies in the graph, in the month of February and July which should ideally be aligned as they belong to the same period and same geographical location. Spikes obtained in the graph for these months are due to the significant difference between the maximum update_ value of the two data sets which creates confusion regarding the correctness of data set. Thus, any irrelevance is an anomaly.
To determine validity, we used weather data [32] and our algorithm compared the max temperature value. Then, we isolated the particular day in a month where maxupdate was found and compared max temperature for that particular day with reading of the reference website. When there was a considerable difference, we calculated average and standard deviation for those differences. As, we have earlier stated that to determine validity, firstly we need to find those validity rules that are true (1) or false (0). They are represented as bool values.

Result 1
To determine probability validity, we used formula mentioned in (1). Formula was applied on the two datasets as shown in Tables 1 and 2. Prob. Validity obtained for the two data sets was 75% and 67% as mentioned in Table 3, which shows that data set 1 has higher chances of validity than data set 2.

Data Freshness
Timeliness of data processing has also been referred to as "perishable insights". If data is not collected and analyzed within the real time constraints, it no longer remains valid and is unusable. So, data freshness becomes one of the most important attributes for measuring the quality of data. It's a highly beneficial measure especially in case of an IoT application due to the heterogeneity of data sources, high interoperability and rapid generation of data from the sensors. Playing a major role for the analysis of data, it mainly refers to how often a new data is obtained.
As we have considered two separate data sets, so we obtained number of updates occurring in each month for both data sets shown in Fig. 3. Through this figure, we can clearly observe the number of updates occurring in each month for the two data sets. It was observed that in the months of March, August and October, difference in the frequency of updates in the data sets were quite significant. Since validity of data also depends upon the frequency of updates occurring in the data set i.e. more the number of updates more is the validity of data, the guarantee for data freshness is established.

Result 2
From the Fig. 4 we see that for the majority of months, number of updates is more for data set 1 as compared to data set 2. Thus, we found data freshness more in data set 1 than data set 2.

Availability
Data availability refers to the process of conforming that data is available to the user when it is required [33,34]. As we consider pervasive environment, we define availability with respect to an observation period and interval between updates.
(2) Where n is the number of updates occurring in a month, t i is the time at which update occurred, t i+1 is the time at which next update occurred. OP refers to the observation period (30 in our case), while T exp is the expiration time for evaluating availability of the data and is user dependent. As we can clearly observe from Eq. (2), that availability is directly proportional to the number of updates. More the value of n, more will be the value of ∑ n i=0 (t i+1 − t i ) . Also, as it covers all the updates that are occurring for a givenT exp , to evaluate availability for a given expiry time, it should accommodate all the updates and they should be additive in nature. The reason for implementing is because if in any case 0 is returned then it shows that very less amount of update in the data set has taken place which is not good for any data set.
We can view the performance of availability metric of two data sets for two different expiration times through Fig. 5a, b. Availability metric obtained using (2) for the two data sets were plotted for two different expiration times. The two curves in both the plots demonstrates the calculated value of availability for an entire year for the two data sets. Apart from the month of November, and May, the two data sets behaved similarly. The reason for such deviation is due to the difference in the update intervals for the two data set for the month of November and May. It shows that similar updates was not registered under the same set of conditions and hence can be treated as anomalous if accuracy is a determining criteria. The results of computing availability, through Eq. (2) for each month for two different expiry times for both data sets are shown in Tables 4 and 5. Thus, from observing the availability metric obtained from Fig. 5a, b and Tables 4 and 5, it was concluded that data set 2 was better in terms of availability as compared to data set 1.

Completeness
In case of contextual and representational quality of data, concise representation, completeness, value added, and relevancy are most important. Completeness generally refers to the property of any system that doesn't suffer from data loss [35]. It defines that whether the data source is capable of providing all the information which it mentioned and is required by the application [36]. Completeness can be evaluated at different levels such as: Mathematically, it can be defined as: In our study, completeness is directly proportional to the number of updates. With a greater number of updates, the chances of missing the updated data will be less. As we are computing completeness annually, it will depend on the update values of individual month(s) and will be calculated as a relative term with respect to the maximum number of updates found in a yearly dataset i.e.
Where: u i = no. of updates occurring in a month

Result 4
For evaluating completeness for the data sets, formula mentioned in Eq. (4) was used. Number of updates obtained for each month can be observed in Table 6. Using these updates values, for evaluating completeness, it was found that data set 1 had relative completeness measure of 75% while data set 2 had 69%. Thus, it was concluded that data set 1 is more complete than data set 2.
(3) Completeness = 1 − no. of incompleteitems total no. ofitems u max = Maximum no. of updates occuring in a month.

Usability
Data may be incomplete by design or due to operational requirements. Moreover, when data quality is measured on the basis of context, the same data is needed for multiple tasks but different required characteristics. As per application requirements, the same data representation may render useless if it requires aggregates or fields of data that do not exist. This causes poor relevancy to the application and is considered incomplete for analysis, thereby reducing its usability. Usability has also been correlated with the interpretability and ease of understanding of a data that may require expert opinion due to the enormous volume or simply the lack of common interpretability. We measure the usability of data by measuring the deviation from their aggregates to the individual values. The degree of aggregation and variation in their representations have been computed and as their maximum tolerable anomaly which is represented in Table 7.

Result 5
To decide usability, a month was selected randomly (February in our case). All the entries of the month were compared with the entries in Ref. [21] for the same month. Set of valid entries were determined by the interval limits mentioned in Table 7. A usability rate of 69% and 86% was found for the data sets respectively, which shows data set 2 had better usability than data set 1.

Conclusion
Our research work presents an approach to study data quality characteristics specifically to IoT applications. We present a method to measure sensor data quality which can in turn be correlated to estimating accuracy. We also show that datasets could be compared through metrics and dependent on the aIoT application requirements, end users could choose an appropriate dataset for further analysis. In this work we define metrics for measuring the data quality for an IoT application that uses weather data and display significance of the analysis. Existing IoT applications are fully dependent on the accuracy of data for the purpose of correct decision making, which is being generated by the sensors. As we know that such type of data falls under the category of Big data, therefore validating such type of data is not an easy task. Also, there is not a unique solution to validate the data because these data vary application wise. Therefore, we try to present methods to determine data validity in a uniform method. In addition, contrary to earlier research works, we propose metrics for data quality for aggregated data. We used five metrics namely validity, data freshness, availability, completeness and usability for observing the data quality on a collective basis. The result shows that validity, freshness and availability metrics for data set 1 is more than data set 2. We also depicted that completeness and usability of data collected may vary but is always aligned with the validity of data. A more valid data will be more complete, more usable, more available and up to date. Through this work, data accuracy for IoT applications can be estimated using previously defined metrics. Since IoT being a major part of Industry 4.0, validity of data is essential for the effective establishment of any IoT based business model. Therefore, this approach can be used to verify the data sets so that decisions made upon by analyzing the data would be correct. In future we aim to implement other context-based metrics on data obtained from IoT based applications.