In his work, Bosch-Capblanch [8] defined three key characteristics necessary for the harmonisation of variables: a unique identifier, a semantically identical description, and consistent statistical metrics for its values. Cunningham et al. (9) further define semantic harmonisation as the process of collating these data into a singular consistent logical view. Although harmonisation and curation tools, such as BiobankConnect software (10), SORTA (11) and DataSHaPER (12), exist, their operation is underpinned by expert crafted ontology and schema-based data annotations, which are difficult to create. Simpler rule-based approaches have also been employed, but these rely on variable name similarity and are not general (8). An alternative that can overcome these challenges is the use of data-driven artificial intelligence (AI) and machine learning (ML) algorithms (9). Using techniques such as natural language processing (NLP) and unsupervised learning, we demonstrate tools that support semantic data harmonisation and curation. We evaluate the performance in terms of the accuracy and time savings of two semantic harmonisation automation pipelines: (1) semantic search for domain-relevant variables and (2) semantic clustering for semantically similar variables.
Evaluation dataset
We use the English Longitudinal Study of Aging (ELSA) (13) datasets to evaluate the semantic data harmonisation process. The ELSA study surveyed households with at least one adult aged over 50 years with the aim of gaining insight into all aspects of the UK’s aging population. The study was conducted in a series of 10 stages, commencing in 1998, with the most recent stage ending in 2019. Each wave took place 2 years after the previous wave, with the same participants surveyed who were subject to consent and other extenuating circumstances. A total of over 18,000 people participated in the study, with a consistent population of over 8,000 throughout the last 9 waves. The sample is based on respondents in the Health Survey for England (HSE), which annually surveys health and lifestyle changes. A variety of data collection methodologies were used, including face-to-face interviews, assisted measurements (both clinical and physical) and questionnaires (both paper-based and web-based). Local area data can enable data linkage with consensus data concerning income, education, and employment.
Although attempts have been made by Lee et al. (14) to harmonise the ELSA datasets, not all available data have been incorporated. Additionally, no use of harmonisation tools is reported. In the ELSA, 94,037 variables are recorded across 67 tabular files, leading to significant difficulties when navigating and analysing the datasets. This complexity makes the ELSA an ideal use case for testing the proposed semantic harmonisation methodology.
The number of variables across all waves in the ELSA study can be found in Fig. 1. A significant portion of the variables across the ELSA datasets for waves 1–9 longitudinally capture the same information but do not have consistent naming between waves. Following the Bosch-Capblanch definition for harmonisation of variables (8), we perform ELSA identifier-level harmonisation by matching variable identifiers in a case-insensitive manner. This initial step ensures that variables with the same identifiers are recognised and treated as identical, despite potential variations. The identifier harmonisation eliminated variable identifier duplication, resulting in a reduction from 94,037 variables to 22,402 unique variables.
Semantic harmonisation methodology
The focus of our study is on the semantic analysis of variable descriptions to identify semantically identical variables via NLP and ML technologies. We discuss the state-of-the-art semantics-aware text embedding technologies that underpin our approach. We then detail the design and implementation of the two semantic harmonisation pipelines: 1) semantic search to identify domain-relevant variables and 2) semantic clustering of similar variables.
Efficient Semantics-Aware Text Embedding
We investigate NLP technologies that can efficiently generate text embeddings that capture semantic context for our harmonisation pipelines. NLP embeddings (i.e., dense vector representations) have gained prominence in medical research for analysing unstructured textual data from electronic health care records (EHRs), intensive care units (ICUs), social media and the scientific literature (15, 16). Embedding models are trained in an unsupervised manner, capturing knowledge from large unlabelled corpora in high-dimensional vector spaces. These embeddings can be leveraged in semantics-aware clustering and search tasks.
Numerous methods of sentence embedding have been proposed. Skip-Thought (17) trains an encoder-decoder gated recurrent unit (GRU) architecture to predict surrounding sentences from a given passage using an unsupervised methodology. By utilising the encoder, a latent space of semantically similar sentences is created, enabling its use in semantic similarity tasks. Universal Sentence Encoders (USE) (18) improve upon Skip-Thought by introducing a transformer network for significant performance gains at the expense of model complexity, computation time and memory usage. Contextual embeddings that are aware of the ordering and identity of each word are first computed and subsequently summed at each word position into a fixed-size 512-dimensional vector. The encodings are designed to be general-purpose and applicable to a wide range of domains. Chen et al. (16) utilised USE within the healthcare domain to find similar sentences in EHR. However, testing on the BEIR dataset (19) indicates subpar performance compared to other Neural-based methods.
Bidirectional encoder representations from transformers (BERT) models (20) are pretrained transformer networks that produce contextual embeddings. Words are tokenized using WordPiece (21) with a 30,000 token vocabulary, after which 12 layers of multihead attention are applied and passed to a simple regression function. RoBERTa demonstrated further improvements by adapting the training process by tuning hyperparameters and expanding training set sizes. Although BERT-based models can be adapted to embed sentences by iterative processing of singular words, they are limited to a predetermined fixed-sized sentence length, restricting comparison performance and increasing storage requirements. The sequences of BERT word embeddings may be averaged into a single sentence vector (22, 23); however, this results in significant performance degradation.
The Sentence-BERT (SBERT) (24) model has demonstrated good performance in semantic textual similarity (STS) tasks, with semantically meaningful embeddings. It can map textual sentence input, up to 250 words in length, to a single fixed size vector. A modification of the BERT architecture was made using Siamese and triplet networks and subsequent pooling (20). A cosine similarity objective function (24) is utilised to calculate the similarity between processed sentences. Other metrics, such as the dot product, have been shown to outperform cosine similarity on specific datasets; however, on average, cosine similarity has marginally better performance (19).
We leverage the SBERT architecture to underpin our semantic data harmonisation and curation solutions. We analyse and compare four pretrained SBERT-based language models to empirically investigate the impacts of model size and training set domain on harmonisation performance. These four models are MiniLM, MPNet, Sentence-T5-xxl and BioLinkBERT, and their specific training details are described below.
MiniLM
MiniLM (25) was proposed by Wang et al. and implements an SBERT architecture (24). The model compresses large Transformer models into smaller, more efficient models through deep self-attention distillation. Leveraging subsequent development by Reimers et al. (26), the MiniLM model was adapted to only six layers with an embedding vector size of 384. This results in the fastest inference times of 14200/sec on a V100 graphics processing unit (GPU). Training used 100 thousand steps on a tensor processing unit (TPU) v3.8 with 1.17 billion sentence pairs, with the majority from Reddit Comments (27), S20RC (28), WikiAnswers (29) and PAQ (30).
MPNet
MPNet (31) by Song et al. improves upon the BERT (20) and SBERT pretraining methods by reducing positional discrepancies and leveraging dependencies among all tokens in a sentence through permutated language modelling. Further fine-tuning of MPNet resulted in the creation of all-mpnet-base-v1 (32), which was pretrained on 1.1 billion sentence pairs as with MiniLM. This model has increased complexity, with a 768-dimensional embedding space and slowing inference to 2800/sec on a V100 GPU.
Sentence T5-xxl
The Text-to-Test Transfer Transformer (T5) introduced by Raffel et al. (33) excels in a variety of NLP tasks by leveraging the Colossal Clean Crawled Corpus (34) and harnessing transfer learning. Ni et al. (35) scaled up the T5 model to 11 billion parameters and incorporated an SBERT architecture to develop the Sentence-T5-xxl model. Sentence-T5-xxl retains state-of-the-art performance in sentence embedding tasks, with 768-dimensional embeddings, but at the expense of very slow inference (50/sec on a V100 GPU). The model is trained on a corpus of two billion question-answer pairs from various online communities as well as the Stanford Natural Language Inference (SNLI) dataset (36).
BioLinkBERT
Yasunaga et al. proposed the LinkBERT (37) pretraining method, which leverages links between documents, views a text corpus as a graph of documents and creates document contexts. This approach is especially relevant for the pretraining of domain-specific models. BioLinkBERT is a pretrained language model that uses LinkBERT on PubMed to achieve state-of-the-art performance in BioNLP tasks such as BioASQ (38) and USMLE (39). The model uses a 512-dimensional embedding space and has comparable inference times to MPNet.
Language models vector space comparison
To gain insight into the models’ vector spaces, we computed and plotted the cosine distance distributions of the embeddings for all variable descriptions in our datasets – see Fig. 2. The plot indicates important similarities and differences in the vector spaces of the four models. MiniLM (M = 0.869, SD = 0.142) and MPNet (M = 0.856, SD = 0.133) have similar distributions. T5 (M = 0.346, SD = 0.055) and BioLinkBERT (M = 0.189, SD = 0.067) had significantly lower means and denser distributions. Compared with those of T5 and BioLinkBERT, the wider cosine distance distributions of MiniLM and MPNet provide greater discrimination ability in downstream tasks.
Semantic Search for Domain-relevant Variables
Semantic harmonisation is the process of collating data into a singular consistent logical view (9). Often, this logical view is the collation of variables relevant to domains of interest. Semantic search can automate the suggestion of variables within a domain.
Guha et al. (40) introduced semantic search methodologies for improved web search results on the semantic web. Unlike previous approaches that merged textual and semantic information into single search indices, this study uses inverted indices for searching for textual content, contrasting with forward indices, which fetch information using unique identifiers.
Traditional keyword-based retrieval models require explicit observation of search terms, thereby increasing the index size and total query time. In contrast, neural embedding-based methods alleviate these inefficiencies by utilising a unified, both textual and semantic, embedding space (41). For instance, word embeddings have achieved success in extending full-text searches for legal document collections (42).
In the current work, we propose a neural embedding-based solution to automating a semantics-aware search for variables relevant to a given domain of interest. The solution enables the user to specify a phrase whose embedding will be compared against all variable description embeddings, enabling the closest matches to be selected. This significantly reduces the time taken for variable selection, as well as improving performance over basic approaches such as keyword search, by leveraging semantic contexts. Based on our analysis of efficient semantics-aware text embedding technologies, we utilise the SBERT model architecture and evaluate the MiniLM, MPNet, BioLinkBERT and T5-XXL pretrained models.
As illustrated in Fig. 3, we incorporate the SBERT model into the proposed semantic search pipeline. Embeddings of variable meta-data descriptions are precomputed, enabling the use of efficient semantic search methods. We use the cosine similarity function to compare qualitative domain-specific phrase embeddings to all variable embeddings. Although other metrics, such as the dot product, are also appropriate, it has been shown that the cosine distance has the best performance on average (19).
Finally, to select the domain-relevant variables, the proposed pipeline outputs the top N descriptions with the greatest similarity to the search phrase can be chosen. An alternative to this current functionality could be to apply a thresholding function on the distance of the variables’ embeddings from the search phrase embedding. However, as presented in Fig. 2, various models have varying sparsity of embeddings; therefore, thresholds need to be appropriately adapted for each model.
Semantic Clustering of Variables into Domains
Building on the pipeline for identifying variables relevant to a specific domain of interest, we propose a new pipeline for the unsupervised grouping of variables into semantically cohesive domains. We base this pipeline on unsupervised ML methods for dimension reduction and clustering to enable a fully automated grouping of semantically similar variables based on the sentence embeddings of the variable descriptions in the dataset metadata. Figure 4 depicts the pipeline for unsupervised variable domain clustering, which, in addition to the text embedding algorithm, incorporates an algorithm for dimensionality reduction of the high-dimensional embedding space and an algorithm for clustering. Variables within the same cluster are semantically similar and are harmonised together in the same domain.
Previous efforts have been made to cluster the embeddings of supervised models, with varying levels of success. Nikifarjam et al. (43) embedded short-form tweets using Word2Vec (44) and clustered them using K-means, after which a conditional random fields classification model was trained. Xu et al. (45) used K-means to cluster dense neural embeddings with a unique convolutional neural network model. Bodrunova et al. (46) used hierarchical agglomerative clustering to group universal sentence encoder embeddings, with the addition of the Markov stopping moment to choose the optimal number of clusters. Similarly, An et al. (47) used a range of both static and dynamic sentence embeddings, which are clustered with K-means into a specified number of groups by spatial histogram analysis. Gupta et al. (48) reported that lowering the embedding dimensionality prior to clustering using an encoder-decoder model improves the clustering performance.
The above unsupervised clustering algorithms require pairwise dissimilarity to be computed for every combination of description embeddings. As stated previously, we use cosine similarity for the comparison of the SBERT embeddings. Cosine similarity is converted to cosine distance by the following simple conversion \(\:cosine\:distance\:=\:1\:-\:cosine\:similarity\), as the clustering algorithms use a distance measure. Furthermore, embedding vectors are normalised prior to cosine distance calculations to ensure consistency between various embedding models.
For the pipeline in Fig. 4, we compared three dimensionality reduction algorithms, namely, PCA, t-SNE and UMAP, and three clustering algorithms, namely, K-means, Hierarchical Agglomerative Clustering and HDBSCAN.
Dimensionality Reduction Algorithms Selection
Gupta et al. (48) found that naive clustering of high-dimensional contextual BERT embeddings produces deficient results. An et al. (47) reinforced this theory by surveying an embedding model’s clustering ability using spatial histograms and reported that high-dimensional dynamic SBERT is less able to cluster than low-dimensional static GloVe models. We argue that by reducing embedding dimensionality and therefore clustering complexity, an increase in clustering performance can be observed.
Established techniques such as principal component analysis (PCA) (49) observe the principal components with maximal variance in an unsupervised methodology. These seek to preserve pairwise distance structures (50) at a local level.
Van der Maaten et al. introduced T-distributed stochastic neighbour embeddings (t-SNE) (51). The algorithm maps high-dimensional elements to a 2- or 3-dimensional representation while preserving distances from neighbouring elements. In contrast to PCA, t-SNE seeks to preserve local distances over global distances (50). It has extensive use for visualising high-dimensional vector spaces. However, t-SNE shows detrimental performance when mapping to more than 3 dimensions, as it frequently converges to local minima. This prohibits its use for clustering description embeddings because of the limited range of dimensions.
Uniform manifold approximation and projection (UMAP) (50) performs nonlinear mappings to arbitrarily lower dimensions, as opposed to t-SNE. The algorithm preserves the global structure while displaying superior time efficiency, enabling scaling to significantly larger datasets, which is vital for the Big Data health care domain. Although UMAP is a stochastic algorithm, it may be initialised with a predefined seed to ensure deterministic execution. Superior performance over t-SNE and PCA has been shown when classifying the MNIST and Fashion-MNIST datasets (50).
We leverage UMAP’s superior performance and adaptability to map variable embeddings across various dimensions: 10, 50, 100, 200 and 300.
Clustering Algorithms Selection
K-means clustering is a prominent method of vector quantisation that was introduced by MacQueen et al. (52). Datapoints are assigned to a fixed number of clusters by minimising intracluster distances between the centroid and all assigned datapoints. This process is repeated over a specified number of iterations. The number of iterations can be determined by the Lloyd Expectation Maximisation algorithm (53) or set to a maximum number.
In an unsupervised setting, when the number of domains is not predefined, it is challenging to find the optimal number of clusters. This often necessitates reliance on labour-intensive methods such as visualisation and human judgement to infer groupings of variables (54). Moreover, this approach lacks adaptability in modifying the number of clusters; it requires the number of clusters to be specified beforehand and necessitates a complete re-computation of the model for minor adjustments in hyperparameters. Hierarchical clustering alleviates this inefficiency.
Hierarchal agglomerative clustering (HAC) (55) groups high-dimensional embeddings into a hierarchical structure based on any distance information. These can then be truncated at a desired level into distinct clusters. The algorithm is highly flexible with satisfactory performance across any distance metrics, as opposed to centroid- and median-based algorithms. The algorithm offers significant adaptability over simpler methods such as K-means by allowing fine granularity adjustments by altering the linkage threshold. Stepwise dendrograms enable the visualisation of hierarchal tree structures for comprehensive analysis of variable similarity irrespective of the linkage threshold. Computational efficiency is greatly increased for a lower linkage threshold by requiring only shallow inspections of the hierarchical tree structure, offering major time reductions compared to K-means. However, its full space partitioning assumption means that all points must be assigned to a cluster, forcing outliers to be assigned to a cluster, which affects cluster cohesiveness and decreases harmonisation performance. This inefficiency can be addressed by allowing some points to be treated as noise and not assigned to clusters.
HDBSCAN (56) extends density-based spatial clustering of applications with noise (DBSCAN) (57) by using a clustering hierarchy in addition to allowing for noise points, i.e., outliers, which are not assigned to clusters. Empirical testing demonstrated substantial performance gains over competing algorithms such as OPTICS (58) in the majority of cases. However, due to the complexity of the algorithms, a major computational expenditure is necessary; however, compared to K-means, it is still significantly faster than HAC for large datasets.
Density-based algorithms, such as DBSCAN, can efficiently identify anomalies in low-density regions and discard them in accordance with a single linkage: the minimum number of samples, which dictates the minimum number of neighbouring components to a core point for it to be established. HDBSCAN generalises this with an additional hierarchal minimum cluster size parameter, which states that clusters with fewer components are not established and are deemed spurious. By forgoing clustering completeness, stronger harmonisations may be achieved. An extension of Prim’s algorithm is used to construct a minimum spanning tree, given density-based groupings, in order to extract the HDBSCAN hierarchy. An optimisation method is used to extract a globally optimal solution from the hierarchal structure (47–51).
Clustering Goodness Metrics Selection
Evaluating the goodness of clustering results across various clustering algorithms, hyper-parameters and dimensional mappings have long been considered vital issues that are essential to the success of clustering applications (59). Clustering validation evaluates the goodness of clustering results (60) without the need for external validation measures such as labelled validation datasets.
Lie et al. (61) reviewed 11 metrics and analysed properties such as monotonicity, noise, density, and subcluster criteria, in addition to the criteria of compactness and separation. Empirical evidence suggests that the silhouette score (62) correctly identifies optimal clustering in most cases; however, it promotes the merging of nearby subclusters into one for datasets with prominent subclusters to maximise intercluster separation. In contrast, S_Dbw (63) satisfies all five aspects at the expense of computational complexity. However, this property may not be desirable for use with sparse embeddings from SBERT models, as it may prioritise smaller subclusters, dividing semantically similar variables into separate clusters. Nisha et al. (64) also promoted the use of the silhouette score for evaluating the goodness of clustering. The silhouette score is valued in clustering analysis for its ability to measure both the cohesion within clusters and the separation between them, providing a combined metric that ranges from − 1 to 1. It is applicable to various clustering methods without requiring ground truth labels, making it suitable for unsupervised learning scenarios. However, this approach can be computationally intensive.
We incorporate the silhouette score goodness of the clustering metric due to its favourable qualities (61) and reported performance. The metric computes the pairwise difference between intracluster (within cluster) and intercluster (between clusters) distances (62).
Validation Approach
To analyse and validate the performance of the Semantic Search and Semantic Clustering pipelines, we created a testing dataset by manually partitioning a set of variables, which is an appropriate approach when ground truth data are absent (47). We developed validation domains built on the Simpson et al. (5) Delphi Study, which identifies 31 domains related to determinants of improved care in multimorbidity. We identified a subset of 12 validation domains relevant to ELSA variable descriptions, including finance; housing; engagement in meaningful activities and social participation; access to social care, community-based services and other provisions; use of technologies to support individuals at home; recognition of and support with lifestyle factors; prescribing and medication management; enhanced support from family and other informal carers; person-centred and holistic care; supporting self-management of conditions; support with daily living and independent living; and environmental factors and wider social determinants of health. A random sample of 2000 variables from the ELSA dataset were taken and manually labelled with 12 validation domains to create a test set for comparison. Manual comparison is performed only using the description of variables and no other external information, allowing for comparisons between human and automated pipeline performance.
For the Semantic Search pipeline evaluation, the resulting cosine similarity score for each variable is evaluated using the AUC metric (65), calculating the area under the receiver operating characteristic (ROC) curve. This ensures that the performance is measured for a given validation domain and search phrase, irrespective of the chosen similarity threshold, by comparison against the labelled test set.
For semantic clustering pipeline evaluation, we first use the silhouette score (62) to converge on the optimal set of clusters and then use the V-measure (66, 67) to evaluate clustering performance against the test set. Standard pairwise comparison is not possible because the arbitrary number of clusters is not equal to the fixed number of 12 validation domains in our test set, requiring an alternative approach. Therefore, for a given domain, the cluster with the maximum V-measure is assumed to match that domain. To quantify harmonisation performance across multiple embedding dimensions and clustering algorithms, a mean of the maximal v-measures is taken across all domains to enable thorough comparison. Boltužić et al.(67) utilised the V-measure metric (66), which measures a harmonic mean of homogeneity and completeness, which are more desirable aspects of clustering than accuracy. In contrast to precision and recall, the V-measure is not influenced by incomplete clustering, where some elements are not clustered. The measure is independent of the dataset or clustering algorithm utilised and it is vital to note that it more coherent incorrect samples. Similar measures such Q2(68) are dependent on the number of clusters and do not explicitly calculate completeness. The V-measure(69) is invariant to the number of clusters. Empirical evidence has demonstrated effective evaluation of high-dimensional TF-IDF vectors (66), as well as transcriptomic data for breast and lung cancer (70), using the V-measure.