Characteristic analysis of fingerprint datasets from a pragmatic view of indoor localization using machine learning approaches

Due to their availability on commercial smartphones, WiFi, Bluetooth, and magnetometer are commonly utilized for indoor localization as indoor spaces are GPS deprived. Indoor localization falls into the category of data-intensive applications. In this domain, most of the recent solution approaches deploy machine learning (ML) and deep learning (DL) techniques on the data collected through the sensors. However, the publicly available benchmark datasets on indoor localization suffer from certain issues requiring complicated and customized data preprocessing techniques for each dataset for applying a common ML/DL technique. Thus, a fair comparison of the ML/DL methods for indoor localization datasets and hence to check for the generality of a solution that spans across different indoor regions become infeasible. In this comparative study, we have investigated three key challenges of fingerprint datasets that should be addressed for real-life localization applications, namely (i) repetitive site survey, (ii) device heterogeneity, and (iii) granularity and subregion-specific performance variation. To demonstrate how these attributes might impact localization performance, experimental analysis is performed using five benchmark datasets. The novelty of the work is that it not only highlights the challenges but also analyzes the feasibility of possible future directions to address these challenges through implementation results. Formulating the application of a generative adversarial network to address the issue of repetitive site surveys has been discussed with implementation results.


Introduction
The desire of a tech-savvy society has been met in the modern period in many different ways by a variety of location-based services (LBS), in the field of health care [1], user navigation [2], smart cities [3], etc.The Global Navigation Satellite System (GNSS) successfully controls the acquisition of a geographic location outside, but this signal cannot pass through the obstacles in a complicated indoor region.However, urban citizens spend most of their time indoors for their daily activities and business.Fortunately, the popularity of smart lifestyles among consumers pushed makers of commercial smartphones to incorporate several sensors into their products that were later discovered to be useful for location prediction in enclosed indoor spaces.Some examples of the applications of indoor localization in LBS are summarized in Table 1.
For indoor localization, smartphone-based technologies such as WiFi, Bluetooth, and inertial sensors are frequently employed.For location prediction in interior context, distance-based [12,13] or angle-based [14,15] techniques were applied in the past.Multipath propagation of WiFi or Bluetooth signal due to varied impediments is a significant issue with these systems [16].A far more recent technique is fingerprinting, which uses a collection of vectors made up of signal strength measurements from different fixed sources taken at a certain location to uniquely identify that particular location.Along with WiFi and Bluetooth, magnetometer (one of the inertial sensors) plays a major role in identifying locations since the earth's magnetic field strengths varies depending on the locations [17].
The fingerprint-based localization method needs to address a few issues in order to be successful.Users may visit a location at different times of the day, in a crowded or empty setting, while carrying smartphones with varying configurations, holding them at varied angles, and so forth.All of these dynamic concerns must be taken into account to appropriately determine his or her location.This is where machine learning (ML) is required.ML has an automatic learning capability about dynamic feature values and their nonlinear relations that makes the localization system more adaptable [18].Researchers have employed several machine learning and very few deep learning (DL) techniques in recent indoor Healthcare services [4] in 2015 Tracking patients in hospital [5] in 2016 Automatically calling Nurses [6] in 2018 Monitoring elderly people and fall detection User tracking and navigation [7] in 2017 Estimate Users' location based on activity recognition [8] in 2018 User tracking in emergency situations [9] in 2022 User tracking for smart energy management Crowd detection/sensing [10] in 2016 Crowd sensing based localization by calibrating user locations using landmarks [11] in 2021 Maintaining social distance based on indoor localization 1 3 Characteristic analysis of fingerprint datasets from a… localization works [19,20].ML and DL have great potential to direct the way from indoor localization to indoor-outdoor detection and seamless navigation as well [21].The amount of labeled data needed to apply ML or DL is substantial, though.It takes a lot of time and effort to manually collect a significant number of well-annotated data points for each location across a vast experimental region.Therefore, repetitive site survey is a major research challenge for commercial adaptation of fingerprint-based indoor localization applications.
Although it is a promising area of research, we were unable to discover any study that enumerated the key features of fingerprint datasets.Neither any work could be found in the literature that addressed the problem of repetitive site survey w.r.t the available benchmark datasets.This reality motivated us to conduct this survey.In this paper, five benchmark datasets are considered, consisting of fingerprint data from WiFi, Bluetooth, and inertial sensors.With a brief description of each dataset, important characteristics and affecting factors are identified.The effects of these characteristics on localization performance are discussed with experimental results from the perspective of related research challenges such as context heterogeneity, device heterogeneity, and subregion-specific performance variation.How generative adversarial network (GAN) can be applied to augment collected fingerprint data to address the challenge of repetitive site survey is analyzed from implementation perspective along with experimental results.Consequently, the contributions of this paper are summarized as follows: • Important characteristics and affecting factors are highlighted concerning the WiFi-, Bluetooth-, and magnetic field-based fingerprint data for their importance to location estimation in an indoor environment.• A comparative analysis of five benchmark datasets has been conducted from the aspect of machine learning-based data analysis.The need for customization of the preprocessing techniques for each of them is presented.• The impact of generative adversarial network model in indoor localization especially to solve the problems of repetitive data collection tasks has been presented with implementation results.
The rest of the paper is structured as follows.Section 2 provides an overview of some recent indoor localization works based on low-cost technologies frequently used in commercial smartphones.Section 3 briefly presents the overview, important characteristics, and the preprocessing techniques required for each dataset.
The machine learning techniques that are frequently used to assess indoor localization strategies are described in Sect.4, including the workflow of data augmentation using GAN.On the basis of the experimental findings presented in Sect.5, it is possible to comprehend how localization performance varies for various dataset characteristics and data augmentation.Finally, in Sect.6 the core insights obtained from the analysis and experimentations are summarized, followed by a conclusion in Sect.7.

Existing indoor localization works based on low-cost technologies
This section provides a summary of the localization technologies popularly employed by recent works.These technologies are mostly available and do not require expensive equipment.The location estimation is done by applying ML and/ or DL techniques detailed as follows.

Wireless fidelity (WiFi)
The majority of indoor public spaces have WiFi infrastructure.Numerous localization strategies have been put forth by researchers using WiFi routers or access points (APs) as fixed sources.Subregions (such as a room, stairwell, or other specified zones) are discovered to be clearly identified when received signal strength indicator (RSSI) values are gathered over a large experimental region.The reason is that RSSI values received close to an AP are significantly higher (typically, − 30 to − 40 dBm) than those received at a distance (as low as − 90 to − 100 dBm).Since a subregion covers a significant area, it is possible to distinguish among two or more subregions using the RSSI values obtained from various APs.Utilizing this property, researchers have proposed room-level localization from crowdsourced data using adaptive k-means clustering [22].Even when experiments are conducted using various device configurations, as shown in [23], localization accuracy is anticipated to be adequate.
Here, a normalized support vector machine (SVM) was used to classify the various rooms of a shopping mall.The researchers in [24] suggested a weighted ensemble method to accomplish localization with respect to varying granularity.Additionally, the system evaluation took place in a different context than the training.The choice of significant features can be used to deal with the presence or absence of humans as environmental artifacts [25].

Bluetooth low energy (BLE)
Bluetooth is one of the successful short-range wireless technologies.According to Bluetooth standard 4.0, the short-range data transfer distance can reach up to 100 m and the battery-driven beacons can work for approximately 1-2 years [26].Traditionally, Bluetooth took significantly long time for scanning, but the new protocol, Bluetooth low energy (BLE) has overcome this limitation [27,28] consuming minimum amount of power, motivating researchers to utilize it in many domains including indoor localization system.To identify a specific location point, RSSI data is gathered from various fixed BLE beacon sources at that location to form the fingerprint.Researchers in [29] tested with dual transmission powers, 0 dBm and 4 dBm, and discovered that the 4-dBm power gain is more accurate than the 0-dBm power gain, despite the former's higher power consumption.Similar to WiFi, localization precision varies with the BLE signal.This problem was looked into in [30] using various rooms of various sizes as experimental regions.The goal was to use a fuzzy logic system to choose the most appropriate localization algorithm based on the size 1 3 Characteristic analysis of fingerprint datasets from a… of a particular room and a few other factors.Researchers in [31] proposed a crowdsourced localization system based on BLE fingerprint data to reduce the expense of collecting labeled data.Their goal was to determine online attacks by both running the system with legitimate beacons and preventing the malicious beacons while also identifying fake fingerprints using BERT model, during the database update phase.Since active user tracking systems tend to be highly intrusive, making these BLE-based active tracking systems less intrusive and more user-friendly has been the focus of many research efforts [32,33].

Magnetometer
An accelerometer, a magnetometer, and a gyroscope make up inertial sensors.The majority of commercial cellphones and other wearable smart devices have them.The user's movement pattern is detected by the accelerometer and gyroscope, while a location point could be identified by the magnetometer.However, it is questionable how much accuracy will be attained by solely magnetometer-based indoor localization because moving items made of ferromagnetic materials and other electrical gadgets may generate noisy fingerprints [34].Researchers in [34] observed that even with the use of an accelerometer to determine the gravity's direction, the magnetometer's magnetic north is unknown.Therefore, the tri-axial intensity and device inclination values play a major role in location estimation.After three months, the researchers performed their experiments at the same location and noticed a variation in the intensity values.But intriguingly, the inclination values were steady, and the cross-correlation coefficient, which indicates total intensity/inclination, was found to be more than 0.99/0.97.As a result, when localizing using magnetic field data, inclination is crucial.However, researchers have discovered that the performance of magnetic field-based localization in varying contexts is also acceptable [35].An orientation free localization approach was proposed in [36] using CNN, trained by hybrid fingerprint images of WiFi and magnetic field signals.
Table 2 represents a brief summary of the recent indoor localization approaches based on low-cost technologies.It can be observed that the granularity of the localization and context of the users are the prominent challenges addressed by the literature.Devices used by the users, most importantly how those were carried by the users also played a significant role.Interestingly, a single challenge is addressed by different methods depending on the dataset characteristics as shown in the table.This leads us to explore the characteristics of the benchmark datasets of indoor localization technologies as for any ML-or DL-based approach, data plays the major role.

Overview of indoor localization system (ILS) datasets and associated preprocessing techniques
Machine learning formulation of any research problem goes through the three major steps-(i) data collection, (ii) data preprocessing, and (iii) classification/ regression.Data can be collected by dedicated volunteers, from the crowd, or, The importance of the preprocessing techniques and the problem of selecting different preprocessing strategies for different ILS datasets is also discussed.

Preprocessing techniques for indoor localization fingerprint datasets
If a location point can be distinguished at a time instant by the RSSI vector obtained from fixed signal sources (such as, WiFi/BLE), then that RSSI vector can be regarded as a fingerprint having the location point as class label.However, the fingerprints could be noisy, some RSSI values could be missing, and so on.
Below is a brief description of a few preprocessing methods that are frequently used on fingerprint datasets.
Close-pack related data Depending on the various data collector applications that are used by the volunteers, the raw dataset can be divided up and kept in separate files.The data needs to be reorganized to meet the requirements of the localization problem at hand.For instance, there could be separate files consisting of the fingerprints for each location point (class label).These should be merged into a single file having a well-designated format of feature attributes along with the corresponding class labels.While merging, for a location point, beacons may not be received from a transmitter.This leads to the following challenge of how to interpret such missing entries.

Interpretation of missing values
WiFi or Bluetooth signal from each of the fixed sources may not reach each and every location point.As a result, at certain locations, the data collector might not receive any RSSI, which would result in a missing entry.Since we cannot feed a dataset with some null entries to an ML/DL model, a dummy value is determined that indicates a negligible signal strength and placed at all the location instances containing a missing entry.This problem is unlikely when trail-based localization takes place through accelerometers.

Noise removal
The WiFi and BLE signals are prone to environmental noise especially, in crowded environments.WiFi hotspots are also sources of errors for WiFi fingerprints.Sometimes, the transient presence of nearby ferromagnetic materials influences the fingerprints, especially the magnetometer readings.Thus, suitable filters could be applied for noise removal.Identification of outliers is also a challenging issue.Presence of such data points could be confusing for the ML/DL techniques.
Divide data based on requirements Data is frequently gathered across a large indoor area, including multiple floors of different buildings.A common localization system trained on the entire dataset may perform at different efficiency levels at different portions of the indoor region because the structural properties of each building vary at multiple places.In order to clearly understand how effective the localization system is at various locations throughout the region, the data may be divided into subgroups.Additionally, the system could be trained considering various device-specific variants using data collected from different smartphones and/or smartwatches.

Feature extraction
After the initial steps of data cleaning and preparation, the data instances are often segmented for feature extraction and selection discussed as follows.
Feature selection It is the process of extracting meaningful representation of the data.Instances are grouped into segments and statistical functions, such as mean and standard deviation, which could be applied on the individual segments to form the feature set for time-series data.For WiFi and BLE datasets, the RSSIs are treated as features itself, and thus, data segmentation is not performed.However, for inertial sensors, features are often extracted.
Each fixed source in a WiFi or Bluetooth fingerprint dataset contributes as a feature.However, some sources might not be helpful because they frequently offer little or no signal to the area.Depending on the problem definition, this absence of signal could be a useful feature or not so useful feature that could be removed.So, a feature should be selected for classification only if it helps to distinguish a class from the others.There is a vast literature on feature selection techniques.For ILS, mostly filter-based techniques are employed.For instance, the importance of the APs is determined by mutual information gain in [37].

Overview, analysis, and preprocessing of selected datasets
For comparative research, five benchmark indoor localization datasets have been chosen based on low-cost technologies that are commonly utilized in previous studies.Their specifications are listed in Table 3.These five benchmark datasets were chosen from a pool of accessible benchmark datasets because they are distinctly different from one another and enable exploration of a wide range of issues that are typically present in other localization datasets as well.These datasets span a wide range of differences, including those related to technology, sample count, device heterogeneity, ambient circumstances, and so on.After a brief overview, each dataset is analyzed with statistical measurements that help in the preprocessing before feeding them to ML methods.

3
Characteristic analysis of fingerprint datasets from a…

Dataset_1: UJIIndoorLoc Dataset
This is one of the first publicly available WiFi fingerprint datasets, used by many researchers [38].RSSI fingerprints were collected by 20 subjects using 25 devices from multiple floors of three buildings.Location ids were extracted from three different features: building id, floor id, and space id.These were merged together to form the class label.For each room, data were collected at the door and inside the room.The device and user heterogeneity of the training dataset is represented in Fig. 1.As mentioned in Table 3, there are 19,937 samples in the training dataset, among which maximum was collected using Device 13 and Device 14 as indicated in Fig. 1.User 1 and User 11 are reported to provide a large number of samples of the whole training data.This indicates two aspects for data analysis-(i) while investigating device heterogeneity, these imbalance should be normalized and (ii) the crowded locations could be identified.RSSI values were received within a frequency range of − 104 dBm to 0 dBm from 520 APs at 933 location points.Missing values were replaced by a dummy value, 100.Each sample represents the real-world longitude-latitude coordinates along with the floor number, user id, phone id, and timestamps.The sample distribution for different floors of different buildings is represented in Fig. 2. Since WiFi APs are not installed for indoor localization but for providing network connectivity, so depending on the building structural properties and usage, not all floors would require an equal distribution of APs.Thus, the minimum and maximum RSSI received at different locations across the floors would vary too.
The WiFi signal behavior as captured in the dataset w.r.t time and devices is summarized in Fig. 3.In Fig. 3a, we can observe that RSSI levels of an AP may vary for same location with change in time.The variation is prominent for AP1, AP3, and AP4.The timestamps t1 to t8 shown in the figure are consecutive timestamps.Again in Fig. 3b, we can observe how the RSSI levels of each AP differs for same location point while collecting data with different devices.Thus, to make a localization system work efficiently, possible variations of RSSI values for a specific location from specific AP must be included that are collected through multiple common device configurations.Otherwise, the system may not be able to identify the location effectively during certain times of the day.Since the RSSI values were recorded for many days and also different times in a day, it is important to detect the stable APs that were active for a long duration.The signal coverage area of specific APs depends on various issues, including the strength of the AP, number of receiving devices, and obstacles present on the way of signal propagation.Among the 172 APs, 42 APs are found to be strong enough to cover minimum 10% of the whole training data samples, i.e., 2390 data samples, as represented in Fig. 5. Three APs (AP003, AP006, AP007) are found to be strong enough to cover almost all of the location points, though variations of the RSSI values through the location points are expected.If in an ongoing localization system a certain AP is damaged, one can understand the feasibility of the maintenance of the system's performance from this analysis based on the importance of the damaged AP and may take action accordingly.
The RSSI levels at a specific location vary within certain range due to the effect of different device configuration and other ambient situations.The variation range is usually less than 10 units, as shown in Fig. 6, but there can be some exception also.For example, the RSSI level of AP004 is less than − 70 dBm

Dataset_3: BLE Dataset
Researchers built a Bluetooth-based dataset [3] for their experimentation on semi-supervised learning approach.They collected labeled and unlabeled data from an area of 200 ft.×180ft from 13 fixed iBeacon sources.Consecutive iBeacons were situated at a distance of 30-40 ft, and the region was divided into grids of size 10 × 10 square fts.There are 1420 labeled and 5191 unlabeled data samples.In our experiment, we have used the labeled data samples only as the base classifiers are applied.
The dataset has been divided into various granularity for evaluation.In the finest granularity level, the complete region is divided into 16 location points, and the sample distribution for the same is represented in Fig. 7a.Since the range of Bluetooth signal is short, each beacon covers a limited region from the fixed position.As a result, there is a lot of missing values (interpreted with dummy value −200 dBm) and the number of samples for which RSSI is received for the respective beacon is really small, as shown in Fig. 7b.

Dataset_4: Shopping Mall Dataset
A multi-sensor dataset for indoor localization was found from the competition "Indoor Location and Navigation" hosted by Microsoft and announced on the Kaggle website.The dataset contains RSSI values received from WiFi, Bluetooth, and inertial sensors in text format.Large portion of the data was collected from WiFi.We extracted the WiFi data and stored in excel format for analysis.The dataset The WiFi data of five buildings has been used in our experiment.The total number of data samples for individual site is represented in Fig. 8a.It can be observed that Site_5 contains maximum number of data among all the five sites.The number of stable APs per site is represented in Fig. 8b.Here, APs providing signal to at least 70% of total number of samples are identified as stable.Other APs, which provide signals to very few data samples, have been removed before experimental analysis.Although the quantity of samples will be higher due to the larger coverage area, there will not be many APs stable throughout a large building.This characteristic is prominently reflected for Site_5.Characteristic analysis of fingerprint datasets from a…

Dataset_5: IPIN 2016 Dataset
In [40], researchers proposed a dataset for indoor positioning and indoor navigation.In general, localization and/or positioning is mostly based on WiFi or Bluetooth technology, whereas indoor navigation strongly involves inertial sensors.This dataset comprises both.WiFi and geo-magnetic field fingerprints were collected with respect to distinct location points using smartphones and smartwatches.The direction or path of the moving user can be obtained from the inertial sensor data, and localization can be done based on WiFi data for static positions.Also, location points can be classified based on inertial sensor data as the geometric position has an impact on the three dimensional magnetic field values.Place IDs and collected signal values were stored in different files.By mapping the timestamps, at first, the collected sensor values were stored with respective place IDs in csv format.After that, the magnetic field data were used to identify separate location points.The average magnetic field strengths along the three axes with respect to different consecutive place IDs are represented in Fig. 9. Sometimes, a sudden hike toward positive or negative direction can be observed, usually caused by presence of some object sensitive to the magnetic field.For better interpretation of the data, inclination information is also considered along with magnetic intensity values, as detailed with experimentations in Sect. 5.
Thus, it can be observed that different datasets required different preprocessing strategies to prepare the data for further analysis.Close-pack related data is found to be a common preprocessing approach for dataset preparation, whereas missing value interpretation is particularly important for WiFi and BLE fingerprint data.Careful selection of device inclination features may increase the accuracy of inertial Fig. 9 Dataset_5: average magnetic field strengths at different location points sensor-based analysis, and merging consecutive locations to consider coarse granularity has the capability to increase accuracy for various kinds of fingerprint-based analysis.When developed with these findings in mind, a localization system could be more useful and generalized for use in common public indoor spaces.
A brief background on the major ML classification approaches used in the literature for indoor localization is discussed with their impacts on indoor localization works in the following section.These are also selected for our ML-based analysis of the five benchmark datasets.

Summary of ML/DL techniques commonly used for Indoor Localization
During the last decade, machine learning techniques have been used extensively in indoor localization works.Unlike other application domains, due to the limited availability of benchmark datasets having adequate data samples for analysis, deep learning techniques are not yet applied by many works.The classification techniques that are applied on the benchmark fingerprint datasets of indoor localization can be divided into two categories-supervised single ML classifiers and ensemble models.
There are a few works on semi-supervised approaches [41,42] and deep learning approaches [43,44].However, such approaches are specifically designed lacking the required generality for applying to a wide variety of benchmark sensor datasets for a comparative analysis.In this section, a brief background on the three categoriessingle classifiers, ensemble methods, and deep learning methods, has been discussed in the first three subsections.After that, generative adversarial network (GAN) is discussed in detail along with the workflow for data augmentation to solve a range of problems related to fingerprinting approach and enhance the applicability of deep learning-based analysis.

Supervised ML classifiers
In supervised classification, the whole dataset is divided into train data and test data subsets.At first, the model is trained on the train data and then tested using the test data.Researchers have used several classification algorithms for indoor localization, sometimes in an unchanged format and sometimes updated with their own specified rules and assumptions.
k-Nearest Neighbor (kNN) kNN is one of the most popular approaches to provide baseline accuracy for the fingerprint-based benchmark datasets.In kNN method, the distance between train data and test data points is calculated using distance metric such as Minkowski, Manhattan, or Euclidean distance.Each test data point is then classified based on the class of its nearest k train data points.kNN has been used various times for indoor localization for predicting unknown location points from received sensor data values.In [45], the authors proposed an algorithm named 1 3 Characteristic analysis of fingerprint datasets from a… LL-kNN algorithm, based on kNN to obtain location information on individual floor based on RSSI WiFi fingerprints.A weighted kNN algorithm was proposed in [46].
To solve the problem of changing APs, similarities were found among APs using weighted distances during online phase and neighbor APs were obtained.A Spearman distance-based kNN algorithm was proposed in [47] for localization using RSSI WiFi fingerprinting.
However, kNN works best for small training samples.Being a lazy classifier, as the number of samples increases, the location estimation and model training will be highly time-consuming.

Support Vector Machine (SVM):
In support vector machine (SVM), data points are classified using hyperplanes.For a classification task with only two features, a hyperplane is a line that linearly separates and classifies a set of data.Instinctively, the farther from the hyperplane the data points lie, the more possibility that they have been correctly classified.While dealing with multiple features, the dataset is plotted in an N-dimensional space, for N number of features.Then, a set of hyperplanes separate different classes of data.The aim is to consider the hyperplane with most stable points among the farthest data points of distinct classes [48].In [49], authors proposed an SVM-based indoor localization method.They used multi-class SVM along with RSSI measurements to perform zone-based localization.When evaluating on real-world datasets, this approach performed better than ANN-based method.Researchers used SVM in [50] to train the fingerprint data, and find the relationship with channel state information (CSI) for positioning.The fingerprint database was obtained previously by performing endpoints clipping, using CSI.
Naive Bayes Naive Bayes classifier is a probabilistic method based on Bayes theorem that performs classification by simplifying two assumptions.It assumes that the features are independent of each other and all of the features are equally important to obtain the outcome.However, the assumption of independence may not be true in real-world indoor positioning scenario, and performance will degrade in that case.Naive Bayes classifier is generally used in research works where researchers prefer a simple and fast classifier with lesser requirement of training samples.For example, in the work proposed in [51], researchers used this classifier for CSI-based passive indoor localization, combined with confidence level information.In [52], a multinomial Naive Bayes technique was proposed for indoor localization based on WiFi RSSI data.The authors modified the classical concept of independence of Naive Bayes technique by including the number of occurrences of a particular RSSI value and assigned weights accordingly.
Decision tree A decision tree has two stages: tree pruning and tree building.The problem is represented by a tree structure in which leaf nodes represent distinct class labels and internal nodes represent individual features.At first, the tree structure is built, and then, the dataset is divided into two subsets.The division process goes on till all subsets of the dataset are in the same class.Each node implements a test function with discrete outcome labeling the branches.At the end, all leaf nodes represent unique classes.Decision tree allows fast solution for classifying instances for large datasets.A Big-Data-driven method was proposed in [53] to detect user's location in a certain 'activity-based zone,' by analyzing the multimodal components of user interactions and the Bluetooth data.The performance was evaluated by a bunch of ML techniques, among which random forest and decision tree have been reported to produce minimum errors.

Artificial Neural Network (ANN)
ANN is a computing model that can efficiently identify nonlinear class boundaries [54].It can be viewed as a network of perceptrons connected through edges that are assigned weights.The model computes a nonlinear weighted sum of the input feature values.The loss is back-propagated during training time for effective tuning of the weights.ANN consists of an input layer, an output layer, and one or more hidden layers.With the increasing number of hidden layers, the network becomes deeper, and through different organization of the layers, high-dimensional features are extracted from the data.ANN can increase the flexibility of a system by selecting the important parameters required for location estimation [55].

Ensemble models
The WiFi, Bluetooth, and inertial sensor readings for a location point may vary depending on context.Thus, if the classification approach is tuned for a particular context, the classifier may not perform well when the testing context changes.In order to retain the effectiveness of the classification approaches for a variety of contexts including emergency scenarios, generality of the classification approach must be retained through a balance between bias and variance.So, ensemble models are found to be utilized by many works on indoor localization.The ensemble techniques can be grossly divided into two categories-boosting and bagging.Random forest is the most frequently used bagging-based ensemble technique that utilizes decision tree as the base classifier.AdaBoost and XGBoost are found to be applied in a few localization approaches [56,57].

Random forest
The random forest classifier consists of a large number of decision trees.Each individual tree in the random forest splits a class prediction.The randomness introduces robustness to the algorithm against noise and outliers.To enhance the smart building concept, researchers proposed a random forest-based indoor localization method in [58], in which this particular classifier is reported to outperform other ML classifiers.A high-precision indoor positioning system was built in [59], based on multivariable fingerprints and random forest variable selection (RFVS).

Characteristic analysis of fingerprint datasets from a…
Majority Voting A majority voting algorithm is one of the most widely used techniques to combine the outcomes of the base classifiers of an ensemble model.To achieve better performance in classification, all the predictions for each distinct class are summed up, and then, the label with the majority vote is predicted.However, the overall improvement in performance depends on the selection of the base classifiers [60].In various indoor localization works [45,61], researchers have used majority voting ensemble to particularly widen the applicability of the localization model.Weighted ensemble models are also designed where the base classifiers are assigned weights based on their training performance.In [24], Dempster-Shafer belief theory is combined with majority voting to achieve context heterogeneity.The work is implemented on JUIndoorLoc that is reported as Dataset_2 in this work.

Deep learning models
Artificial neural networks with multiple number of layers are the foundation of the machine learning subfield known as deep learning (DL).It has the ability to recognize complicated patterns in data with large number of samples and features.Due to its specialization properties, researchers are using different DL models on fingerprint data if sufficient data is available [62,63].Convolutional neural networks (CNN), autoencoders (AE), and generative adversarial networks (GAN) are three popularly used DL methods in recent fingerprint-based research.

Convolutional Neural Networks (CNN)
CNN is a three-layer neural network model, mainly used for image data.Convolutional layer is the core of CNN which identifies features in pixels, or corresponding cell value of the matrix in the case of numeric data.Pooling layer abstracts these features and a fully connected layer uses these features for classification.
Researchers have learned nonlinear mappings from signal properties with time and space to position coordinates using CNN to derive the temporal fluctuation patterns of RSSI [64].It has been discovered that localization performance is improved when the correlation between RSSI in time and space is taken into account.Researchers in [65] converted Bluetooth-based fingerprint data into images by simulating the diffusion behavior of the signal.Mitigating the influence of multipath fading, the proposed method achieved good positioning accuracy using CNN.To overcome the effects of multipath fading and signal fluctuation, researchers in [66] used the combination of an extreme learning machine and an autoencoder to extract important features and reduce the dimension of the data before feeding it to CNN for location estimation.

Autoencoders (AE)
An ANN-based model called the autoencoder has many neurons in the input and output layers and few neurons in the middle layer, also referred to as the bottleneck layer.By encoding unlabeled input, this model seeks to reduce its dimension; by decoding, it seeks to produce similar data with the required reduced dimension and a similar distribution.
Denoising AE has been used to extract reliable fingerprint patterns from RSSI measurements [67], and a fingerprint database containing reference locations in 3-D space has been created.Following the application of this feature extraction method, localization accuracy was enhanced in both the horizontal and vertical directions.In [68], AE was used to extract latent codes from multi-sensor fingerprint data, and based on these latent codes, three tasks were performed: location estimation, environment detection, and identification of the most useful sensor.

Generative Adversarial Networks (GAN)
GAN is a combinational model of two neural networks, namely the generator and the discriminator.It could be used to generate data while maintaining the original data distribution.Detail explanation is provided in the next subsection.
Researchers in [69] have generated RSSI fingerprint data using GAN and predicted their labels (pseudo-labels) using semi-supervised learning.Apart from RSSI, channel state information (CSI) data has also been extended using GAN [70].In this case, before augmentation, CSI data was converted into amplitude feature map to strengthen the pattern among data.

Data augmentation using Generative Adversarial Network (GAN)
Since site survey is a time-consuming and laborious task, collecting enough amount of data from each location point is practically difficult.The problem is, ML methods especially deep learning approaches demand sufficient amount of well-labeled data for good performance.In certain cases, researchers combine a small number of readily available labeled data points with a sizable amount of crowdsourced unlabeled data from indoor locations [71].However, another feasible way is data augmentation.Researchers have used GAN in a variety of fields for data augmentation [72] and data imputation [73].In this section, we represent the basic architecture of GAN and propose a framework for augmenting labeled data using GAN.
Basic Architecture of GAN GAN [74] is a special kind of machine learning algorithm that works on a gaming principle, namely the zero-sum principle.In this framework, two neural networks work as the basic building blocks, known as a generator and a discriminator.The generator generates new data samples based on the random input, without understanding the semantics of the input data features.Both the original and generated data are fed as input to the discriminator, which is responsible for the identification of real and fake data.It learns the representation of the original data samples keenly and guides the generator by sending feedback.That means, GAN works on the indirect training method through the discriminator, which predicts how realistic the input seems.The two blocks work in an adversarial way and improve each other.The generator learns from the received feedback and tries to create new data samples very close to the original data samples.The model converges when the discriminator identifies most of the generated data as real data.At

3
Characteristic analysis of fingerprint datasets from a… this point, both the generator and discriminator come to a minimum stable training loss value.The basic architecture of GAN is represented in Fig. 10.
Proposed workflow for data augmentation using GAN We have shown the utility of GAN through two datasets, namely Dataset_2 and Dataset_4 both, based on WiFi fingerprints.For Dataset_2, data from two specific rooms are considered for this purpose, including fingerprints for all devices and ambient conditions.For Dataset_4, the data of one single floor of one single building is considered for the same purpose.The general workflow for augmentation of the datasets is represented in Fig. 11.Each of the dataset is divided into separate groups, each group containing data for a specific class label (location point).From each location point, approximately 75% to 80% samples are selected for training purpose and rest of the sample has been kept for test purpose.Only the train data samples have been fed as input to the generator of GAN for each location point at a time, to produce new data samples for that class.When samples for all the location points are generated, they are merged with the original training data samples to form the complete augmented training dataset.Augmented samples are added only to the train data; the test data remains the same.
The discriminator model is fed with an n-dimensional input and a one-dimensional output, where n is the number of features (APs in this case) of the dataset.The discriminator consists of four hidden layers.The first, second, third, and fourth hidden layers are composed of 256, 128, 64, and 32 neurons, respectively, with rectified linear unit (ReLU) activation function.After each hidden layer, dropout has been used to avoid overfitting.The output layer is composed of a single neuron along with the sigmoidal activation to represent a probability value.The structure of discriminator model is represented in Fig. 12.
The generator model is fed with the n-dimensional input, which will receive sample points of specific class labels ( p 1 , p 2 , … , p n ), and an n-dimensional output providing ( q 1 , q 2 , … , q n ) points resembling from the training data.The generator is composed of two hidden layers with 128 and 256 neurons, respectively, each followed by ReLU activation function, and a linear activation output layer that consists of n neurons.In this way, the output is generated with a vector of n elements The model is trained repetitively with binary cross-entropy loss function.The binary cross-entropy function is selected for training the discriminator because it considers a binary classification task.That is, whether a sample from a particular class label can be augmented or not.It is also preferable to train the generator, and its output is fed as input to the discriminator, which in turn provides a binary feedback as output.Finally, both the discriminator and generator are optimized with Adam algorithm.Characteristic analysis of fingerprint datasets from a…

Experimental results
We have evaluated the datasets with supervised and ensemble classifiers, considering different ambient conditions, different subregions (building and floors), device heterogeneity, and varying granularity, as applicable.Thus, the important challenge that is faced for widespread implementation of indoor localization has been investigated for the datasets.Later, we have applied GAN to augment data of subparts of Dataset_2 and Dataset_4 and compared the classification performance between original and augmented data parts.Characteristic analysis of fingerprint datasets from a… be observed for building id 1, as represented in Fig. 14.Comparing with the sample count distribution per floor shown in Fig. 2, it can be seen that it is not the count of the samples, but the distribution of the samples for different conditions and more importantly, the availability of strong signals received from more no of APs is what matters for good localization accuracy.However, most of the benchmark datasets report the count and there is no unified metric to show how many strong APs are detected per location point.One thing to note is that the accuracy obtained for floor 2 of building 1 is 86%, and the maximum accuracy obtained for floor 4 of building 2 is 62%.It reflects the fact that we must accept a common localization system that also offers a moderate level of localization accuracy for a pretty large indoor area.5 with 80-20 train-test split.We can observe that in 13 among 16 cases, greater than 90% accuracy is obtained.For device D1, least accuracy has been obtained as it reports less WiFi sensitivity as compared to the other devices shown in Fig. 4. The table is again an indication that not the total no of samples but the AP coverage per location point should be reported for good benchmark datasets.

Result for
The results obtained for device D3 for different ambient conditions is represented in Fig. 15.Comparable accuracy values are reported for change of conditions.This is because for rooms with standard windows and doors, signal strengths do not vary much with minor change of context as indicated in Fig. 4. Thus, not only the device sensitivity, building properties matter for the localization performance.This is investigated in the experiment.The results obtained for different room/corridor and the corresponding device configuration are shown in Table 6 with 80-20 train-test split.For long corridors, it can be observed that accuracy may vary for same region when RSSI is collected with different devices.

Result for Dataset_3 (BLE Dataset)
Researchers have found that localization accuracy varies for different granularity levels in the case of WiFi [75] and inertial sensor [76] data.In our analysis, we have performed experimentations on BLE dataset.The experimental region is divided into different location points four times, with four different types of granularity level, and classification result is obtained for each as shown in Fig. 16.In the finest granularity level, there are 16 subregions or location points, each having approximately the same size.In the case of coarse granularity, the region is subdivided into only three horizontal parts.It can be observed that the performance of the classifiers increases with an increase in the size of the    Although the Bluetooth signal is stable over a comparatively larger area, the maximum distance covered by the signal is shorter.That is why in the dataset we found that seven of the 13 beacons contributed less than 5% of the total samples in the dataset.However, they cannot be ignored for certain reasons.Each of them significantly identifies one or a few regions that are not significantly identified by any other beacon.Hence, removal of even one such beacon will decrease the overall localization performance, as represented in Table 7.Here, beacons received from a Characteristic analysis of fingerprint datasets from a… transceiver is removed in each case to assess its impact on localization performance.The removal of many beacons providing signal to a diverse small area significantly reduces performance, as represented in Fig. 17.For this analysis, accuracy has been computed at the finest level of granularity.
Result for Dataset_4 (Shopping Mall Dataset) An experiment is conducted for different sites (buildings) reported in Dataset_4 with a train-test split of 75-25.Accuracy above 90% is obtained for Site 3 and Site 5.Among the base classifiers, overall best accuracy is obtained with kNN classifier within a range of 68-93%, as shown in Fig. 18.Although Site_3 has a minimum number of stable APs (Fig. 8), it provides the best accuracy in a consistent way for all classifiers.The number of samples for this building is also low; thus, it may be a small building (building sizes are not provided).For a stable AP, a strong signal such as − 30 to − 40 dBm is received near the AP, and it gradually weakens (up to − 85 to − 95 dBm) with an increase in distance.Site_3 may have a lower number of APs installed, resulting in a lower number of stable APs throughout the building, but some nearby APs, however, provide a comparatively weaker signal (− 70 to − 80) dBm at certain locations and no signal at all (a missing entry filled with a dummy value, e.g., − 110 dBm) at its subsequent location.This certainly distinguishes the fingerprint pattern of two adjacent location points, improving the overall performance of localization.

Result for Dataset_5 (IPIN 2016)
As the smartwatch-based dataset was sufficiently large and covered most of the 324 place ids, we selected its magnetometer readings for evaluation.The magnetic field intensity along three different axes is considered as features.Since the magnetic field intensity is affected by temporary issues (as explained with Fig. 19), the device's inclination information was also considered, in terms of azimuth, pitch, and roll.The accuracy obtained with different classifiers within the range of 87% to 97% is represented in Fig. 19, with a train-test split of 80-20.However, to investigate how the device inclination information truly affects the localization performance, we omitted each possible combination of the inclination information one by one and observed the change in accuracy as shown in Table 8.When all inclination features are used to estimate the location, the accuracy is highest.Removing any one or two of these features gradually degrades the performance, from 4% (decision tree) to 20% (Naive Bayes) for different classifiers, clearly indicating the impact of each feature on location estimation.

Implementation results of GAN
In our experiment, the GAN model is implemented using PyTorch library, which is an open source ML library used for designing and training neural network-based deep learning models.To have a deterministic randomness, a static seed value has been specified.In machine learning, seed is used for initialization of a pseudo random number generator.Static seed value is used to obtain the same pattern of data every time.With a learning rate 0.001, the model is trained repetitively 200 times with binary cross-entropy loss function.Different fingerprint patterns could be obtained at different locations from a set of APs (features) within a smaller region, like a room or a moderate floor.The pattern of the data will depend on the limited feature set.Since the learning of GAN depends on the data pattern and data augmentation is practically applied to smaller data sets, it will be beneficial to examine the data augmentation process with different smaller data portions.We have selected two rooms from Dataset_2, and a floor from Dataset_4 for the experiment.
For each location point, we store the data samples separately and generate new data samples using GAN, as shown in Fig. 11.The number of neurons in the hidden layers in the generative model and discriminative model is adjusted so that we get the generated data as close as possible to the original data.In Fig. 20, it can be observed that initially the loss of the discriminator is high and that of the generator is less.The reason is that the generator generates random samples on its own and the discriminator is completely unaware of the expected pattern of the generated data.After just a few epochs, the discriminator starts to learn from the original and generated data samples and sends feedback to it accordingly.Approximately after 125 epochs, the discriminator and generator both come to a convergence.
For Dataset_2, combining approximately 70-75% original data of each label with equal amount of newly generated data, the augmented training dataset is prepared.The test set is prepared with the left over 25-30% original data samples, respectively.Classifiers are trained separately on the original data and augmented data.23 that the accuracy improves with a spike after adding 25% data, and then, the rate of improvement slows down.A maximum increment is found for ANN.Characteristic analysis of fingerprint datasets from a… Although there may be a few cases where the augmentation is not sufficient to improve the accuracy, for example, SVM in Fig. 21, the overall implementation results of GAN indicate that the data generated for each location strongly inherits the fingerprint features of the original data of that location.To validate the closeness among the statistical distributions of the generated data and the original data, Kullback-Leibler divergence (KL-divergence) [77] is used, as shown in Table 9. KLdivergence represents the similarity between the data distributions of two vectors.After generating equal amounts of data, KL-divergence is computed from each i th RSSI vector of the original dataset to the i th RSSI vector of the generated dataset, and its mean is obtained.For each of the three cases, the mean KL-divergence is between 0.4 and 0.5, indicating a negligible difference between the data distributions of both datasets.

Comparison of classification performance for different datasets
In this section, the different classification approaches are compared for the five different datasets.As is evident, all the datasets require different preprocessing for better results.Thus, finding a common basis for testing the performance of any classification algorithm is difficult in this research field.For comparison, for each classifier, we consider the case where the best accuracy is obtained for the dataset when experimented for different characteristics, as shown in Fig. 24.
The Naive Bayes classifier is based on two basic assumptions: Each feature is independent of the others and contributes equally to the outcome.The first assumption is true for any WiFi or Bluetooth beacon source, i.e., one source does not dominate the signal strength of another source.The magnetic intensity in each axis or inclination information, on the other hand, is related to one another, but each feature contributes significantly to the outcome.Hence, the second assumption is validated.Again, the second assumption is questionable in the case of WiFi-or Bluetooth-based localization.A highly fluctuating or intermittently active WiFi AP does not contribute as much as an always active and stable AP.A beacon placed near the center of the experimental region will significantly contribute to a larger number of locations, but the other beacons will not, due to the short range of the Bluetooth signal.The maximum accuracy obtained for BLE_Dataset (Dataset_3) is 80%, little poor with respect to that of the other classifiers, for maximum level of granularity.Performance on Dataset_4 is similar, which includes large number of WiFi APs.However, for Dataset_1, performance degrades more as the number of features was so large and varied that equal or nearly equal contributions from each were impossible, resulting maximum accuracy of 73%.Characteristic analysis of fingerprint datasets from a… Satisfactory performance was observed for magnetic field-based Dataset_5, whereas WiFi-based Dataset_2 provided significantly better performance, as represented in Fig. 24a.This clearly shows that Naive Bayes may function well for a small or moderate number of data samples and a limited feature set, even though a real-world fingerprint-based dataset does not strictly follow the assumption.SVM is encouraged for binary classification, but it can classify multi-class datasets as well, with considerable accuracy, as shown in Fig. 24b.For fingerprint datasets, there are usually many fixed sources (especially for WiFi), generating a large feature set and making the data high dimensional.The problem with high-dimensional data is overfitting.When there are a lot of features and relatively few instances, it is easy for the ML model to figure out some spurious relationships between the features and target instances.SVM handles this problem by automatic regularization, i.e., penalizing the model for adding complexity and utilizing the features according to their importance.However, there must be a greater number of training instances than features.In Data-set_1, there are 529 features.The dataset was divided into subregions (refer to Fig. 2), and each subregion's 80% data was considered for training.Hence, the ratio between the training instances and the features becomes significantly smaller.For this reason, maximum accuracy found with SVM on Dataset_1 is comparatively poor and detailed experiment is not provided.
Fingerprint datasets are usually small to moderate in size, and kNN works well for such datasets, as represented in Fig. 24c.However, Dataset_1 contains much more features than other datasets, and the degradation of performance with an increase in dimensionality is visible.
In decision tree algorithm, best features are selected using attribute selection measures (ASM) to split the dataset.Using those as decision nodes, the classification is performed.Since in fingerprint-based localization, each feature makes a different impact on the classification, this ASM-based decision results in a considerable accuracy for all the datasets as shown in Fig. 24e.The performance becomes more stable and reliable when decisions of multiple decision trees are combined in random forest, shown in Fig. 24d.
ANN uses different weights for different training instances and gradually improves the weight.In the case of fingerprint datasets, sometimes it becomes difficult for the model to identify the noisy instances and give those a smaller weight because the overall quantity of the samples in the dataset is not enormous, which may result in a nearequal ratio of good instances and noisy instances.For Dataset_2, Dataset_3 and Data-set_5 ANN results in satisfactory (Fig. 24f), although for Dataset_4 ANN performs as accurate as kNN.In general, more fingerprint instances per location point are needed for getting good classification accuracy with ANN.
Considering that the best accuracy for most of the dataset for kNN is above 90%, it can be considered as an effective option for fingerprint datasets.

Discussion
We experimented with different classifiers on the selected datasets based on some key characteristics.From the experimental study, our main observations are listed as follows: (i) It will be more efficient to create custom localization systems for each building rather than creating a common localization system for an entire indoor region with multiple buildings.Different building property structures and feature sets may affect how well a particular ML method performs.So, it is best to test a localization system's performance on various floors or rooms before installing it in a building to see whether it is appreciably accurate in various subregions.(ii) When developing a ML-based user tracking system for some public indoor space, it is preferable to collect training data using various common device configurations because different devices are differently prone to the sensors, which has an impact on the system's outcome.(iii) For Bluetooth-based systems, ML methods identify the locations more accurately when individual location points are moderately large in size, for example, a small room or two to three parts of a large room.This performance reduces with precise granularity, that is, with the reduction in size of the location points, as the Bluetooth signal maintains a constant range over a comparatively larger area than WiFi but completely fades away after a relatively smaller area than WiFi.
Hence, fine-grained localization goes better with WiFi than Bluetooth.(iv) To improve the performance of magnetometer-based localization, device inclination information can be used as a feature along with magnetic intensity to maintain consistent localization performance and reduce the effect of sporadic appearance of ferromagnetic materials.(v) The challenge of repetitive site surveys can be addressed with GAN.New fingerprint data samples can be generated by maintaining the RSSI characteristics of the original data, and this in turn will improve the accuracy of a ML-based localization system.

Conclusion
Fingerprinting is the emerging path for indoor localization, based on commercial smartphone technologies.Despite the acceptable performance of many localization techniques, there are still some challenges because of the various traits of the fingerprint datasets.In this experimental study, we have investigated three key challenges that should be addressed to enhance real-life localization applications, namely: (i) repetitive site surveys; (ii) device heterogeneity; and (iii) granularity and subregionspecific performance variation.Five benchmark datasets are divided based on device heterogeneity, subregions, granularity of locations, and a few other ambient conditions.Machine learning methods were used in experiments to visualize how these characteristics affect localization performance.To address the challenge of repetitive site surveys, data augmentation process with the application of a GAN was discussed along with implementation results.Based on the results, it can be observed that kNN and decision tree are found to be effective options for indoor localization.When GAN is applied to augment data on a per class basis, it is found to improve the localization performance.

3
Characteristic analysis of fingerprint datasets from a… In future, we plan to extend the GAN-based augmentation strategies for better handling of the effect of context heterogeneity.Fusion of different sensing modalities based on deep learning toward a stable localization system is another viable future research direction.
Author contributions Both authors have contributed to the paper.

Funding
The authors received no financial support for the research, authorship, and publication of this paper.

Fig. 1 3 Characteristic analysis of fingerprint datasets from a… 3 . 2 . 2
Fig. 1 Dataset_1: distribution of number of samples across different users and devices

Fig. 2 Fig. 3
Fig. 2 Dataset_1: number of samples for each building and floor ID

Fig. 4 3
Fig. 4 Dataset_2: distribution of samples across various devices for different conditions-human presence/absence and open/close room Fig. 6 Dataset_2: effect of changing devices and presence/absence of human on RSSI levels of APs at a specific location

Fig. 7
Fig. 7 Dataset_3: sample count for location points and signal received/not received per beacon

Fig. 10
Fig. 10 Basic architecture of GAN

Fig. 11
Fig. 11 Proposed workflow for data augmentation

Fig. 12
Fig. 12 Structure of discriminator model used in experiment

Fig. 14
Fig. 14 Experimental result of Dataset_1 for different floors of Building 1 Dataset_2 (JUIndoorLoc) The complete training dataset of JUIndoor-Loc has been divided into 16 sub-datasets, based on device configuration, presence/ absence of human, and open/close condition of the room.Accuracy obtained for all 16 combinations with different base classifiers and majority voting are represented in Table

Fig. 15
Fig. 15 Experimental result of Dataset_2 with device D3 and other ambient conditions.H0: absence of human; H1: presence of human; R0: open room; R1: close room

Fig. 16
Fig. 16 Experimental result of Dataset_3 with different granularities; LP: location points

Fig. 17 Fig. 18
Fig. 17 Decrement in localization performance with elimination of beacons providing signals to less than 5% of total sample count of Dataset_3

Fig. 21 Fig. 22 Fig. 23
Fig. 21 Localization performance is improved for augmented data for first room of Dataset_2

Fig. 24
Fig. 24 Performance of ML classifiers on different datasets

Table 1
Some applications of indoor localization in location-based services

Table 2
Key challenges of fingerprint datasets addressed in some recent low-cost technology-based indoor localization works Technology

Table 4
Experimental result of Dataset_1 for different subregions

Table 6
Experimental result of Dataset_2 with different rooms and device

Table 7
Impact of each beacon source of Dataset_3

Table 8
Effect of device inclination information on magnetic field-based localization; evaluation with Dataset_5Characteristic analysis of fingerprint datasets from a…