Transformer-based Denoising Adversarial Variational Entity Resolution

Entity resolution (ER), precisely identifying different representations of the same real-world entities, is critical for data integration. The ER question has been studied for many years, and many methods have been proposed to solve it. Although deep learning has achieved good performance in ER tasks, there are some challenges regarding manual labeling and model transfer. This paper proposes a novel ER model, Transformer-based Denoising Adversarial Variational Entity Resolution (TdavER). For entity embedding, we develop an unsupervised entity embedding model based on denoising autoencoders and pre-trained language models, which takes corrupted input as training data to motivate the encoder to generate rather stable and robust high-quality entity representations. Furthermore, we propose an unsupervised entity feature transformation model based on adversarial variational autoencoders to ease the constraints on entity representations from training data. This transformation model converts low-level entity embeddings to high-level probability distributions, which are not constrained by the source data and contain deep similarity features. To better implement the feature transformation, we adopt adversarial networks to optimize the variational autoencoder’s training process and help it learn the correct posterior distribution. Extensive experiments confirms that the performance of our proposed TdavER is comparable with the current state-of-the-art ER methods and that its entity feature transformation model is transferable.


Introduction
Entity resolution (ER), the process of identifying different representations of the same real-world entity (Garcia-Molina, 2004), has been studied for over 70 years.Consider the example shown in Fig. 1.T 1 and T 2 are two sets of tuples about academic papers from different databases.ER identifies tuple pairs that refer to the same paper, e.g., tuples r 11 from T 1 and r 21 from T 2 are identified as co-reference.As an integral part of data cleaning and data integration, ER plays an important role in various fields, such as national censuses, fraud detection, and online shopping (Christen, 2012).It has been extensively studied by various methods.In early studies, rule-based approaches are used to solve the ER problem (Fan et al. 2011;Guo et al. 2010;Whang & Garcia-Molina, 2013).Since the matching rules need to be acquired manually, this results in poor scalability and robustness of this approach.Many learning-based methods have been proposed with the development of machine learning (ML).This method requires manual feature extraction, which increases human involvement cost and fails to extract high-level features (Arasu et al., 2010;Ebraheem et al. 2018).
In recent years, deep learning(DL) techniques have been used in the ER field and achieved good performance (Ebraheem et al. 2018;Mudgal et al., 2018).They typically present a representation scheme followed by interaction: firstly, each tuple (or attribute) is encoded as a semantic representation by deep neural network; then, the representations of the two tuples are processed with similarity functions (On et al., 2014) or learnable classifiers to obtain the final matching results.Although this DL-based approach is very effective, it requires much manual work to label data and is not transferable.In addition, the ER data is characterized by low quality and unbalanced categories (Barlaug & Gulla, 2021).Therefore, the ER remains a challenging task.
This paper proposes a novel ER model, Transformer-based Denoising Adversarial Variational Entity Resolution (TdavER).The processing flow of the TdavER is depicted in Fig. 2. To prevent the DL-based ER model from being too integrated and complex, our proposed TdavER model divides the ER into three processing stages: entity embedding, entity feature transformation, and entity matching.The entity record pairs are converted into the corresponding representation vectors in the entity embedding phase.We adopt an entity embedding model based on an unsupervised denoised autoencoder to obtain high-quality entity representations.This entity embedding model trains the autoencoder using corrupted input data, which motivates the encoding network better to extract valuable feature data from the entity records.The entity representations obtained from the entity embedding model are low-level and are limited by the training data.We propose an entity feature transformation model based on an adversarial variational autoencoder in the entity feature Fig. 1 An ER example Fig. 2 The overall process of the TdavER model transformation phase.With this feature transformation model, low-level entity embeddings can be transformed into high-level features (e.g., probability distributions), which are not constrained by the training data and can contain deep-level entity similarity information.For entity matching, we adopt a supervised approach to adjust the parameters of the entity feature transformation model to match the real matching scenario.Our contributions are summarized as follows: 1. Unsupervised entity embedding models based on pre-trained language models and denoising autoencoder are proposed.By using the corrupted entity records as training data, this entity embedding model motivates the encoder network better to extract valuable feature information from the inputs.

Background and related works
The typical ER approach is divided into two main phases for structured data (Elmagarmid et al., 2006): blocking (Fellegi & Sunter, 1969;Uppada et al., 2022) and matching (Maskat et al., 2016).Universally, entity matching requires each entity to be compared with all other entities, so the ER problem has an inherent quadratic complexity (Vieira et al., 2019).Many methods were proposed to reduce the complexity of the ER problem, the most prominent of which is blocking.With blocking techniques, we can drastically reduce the number of unnecessary comparisons in the ER problem and obtain a small number of entity pairs that need to be compared precisely.The matching phase is the exact comparison of the remaining entity pairs, and is what we focus on in this paper.
For the currently popular ER methods, matching models fall into four main categories: classification models (Ebraheem et al. 2018;Bilenko & Mooney, 2003;Konda et al. 2016), graph-boosted methods (Primpeli & Bizer, 2021), and generative models (Wu et al., 2020).In this paper, classification-based matching methods are our central focus and have become a hot topic of recent research.Pixton and Giraud-Carrier (2006) used a structured neural network model to predict the final match results based on the similarity score corresponding to each attribute.Gottapu et al. (2016) used a convolutional neural network model to learn the similarity function for entity matching and combined the model's output with crowdsourcing to improve the matching accuracy further.DeepMatcher (Mudgal et al., 2018) offers a three-part ER model architecture: attribute embedding, similarity characterization and classification.Each of these parts corresponds to multiple optional DL methods.Based on the combination of different DL techniques, DeepMatcher proposes a design space to explore the application of DL techniques in ER models.
Several efficient pre-trained language models (e.g., BERT (Devlin et al., 2018), Dis-tilBERT (Sanh et al., 2019), and RoBERTa (Liu et al., 2019)) have been proposed with the significant development of natural language processing.These pre-trained language models can effectively extract semantic information from entity records and facilitate learning similarity functions between entity records.DeepER (Ebraheem et al. 2018) transform entity records into distributed representations based on recurrent neural networks (RNN) and proposes a blocking technique for such representations.Kasai et al. (2019) proposed a transferable ER model from a high-resource setting to a low-resource one based on pretrained language models and active learning.DITTO (Li et al. 2020) achieves significant performance based on a pre-trained language model and three optimization methods.The pre-trained language model (e.g., BERT) is also applied in our proposed TdavER model.
Although the DL-based ER model achieves significant performance, the model becomes very complex.Complex models are often highly dependent on large amounts of labeled data and are not facilitating transfer learning.Some recent findings have been presented to address this issue.Alex Bogatu et al. (2021) proposed a transferable ER model, Variational Active Entity Resolution (VAER), based on a deep autoencoder.The VAER separates feature learning from similarity learning and implements unsupervised entity representation models based on the variational autoencoder (VAE) (Kingma & Welling, 2013) to alleviate the complexity of the model.While the VAER has reduced the cost of human involvement and achieved good performance, some things could be improved here.On the one hand, VAER merely adopts common embedding models (e.g., LSA (Dumais et al. 2004), W2V, BERT (Devlin et al., 2018), EmbDI (Cappuzzo et al., 2020)), and the stability and robustness of the obtained entity representations need to be improved.On the other hand, the VAE used in the VAER cannot capture the true posterior distribution because the inferred model does not have sufficient expressive power (Mescheder et al., 2017).The VAE-based representation model cannot accurately map entity records to the latent space (Gaussian distribution).
In this paper, we propose a new ER model, TdavER, to alleviate the above problem.We employ an unsupervised entity embedding model based on denoising autoencoders and pretrained language models in the entity embedding phase.This embedding model utilizes the destroyed entity records as training data, which can motivate the encoder to generate stable and robust high-quality entity representations.To enhance the mapping capability of VAE, an adversarial network is used to optimize the training process of VAE so that it can learn a more accurate posterior distribution.We propose an entity feature transformation model based on an adversarial variational autoencoder.This transformation model can accurately map the entity embedding vectors to the corresponding latent space variables (Gaussian distribution).

Unsupervised denoising entity embedding
High-quality representations are essential for many tasks, including text classification, image processing, and ER.The DL-based model takes numerical data as input.The quality of the numerical data depends on the amount of information contained in it that is valid for the target task.In the ER task, the primary purpose of the entity embedding model is to generate high-quality entity representations.Real-world ER data contains a large amount of noise that originates from various forms, such as human-entered typos, missing data, alternative data schema, diverse syntax standards, and varying content formats.Traditional embedding models are easily disturbed by these noises and fail to produce high-quality entity representations, even for pre-trained language models.
The reconstruction standard on which traditional autoencoder relies does not guarantee that the encoder can extract useful features from the input data (Vincent et al. 2010).To make the reconstruction criteria more challenging, we use the corrupted input as the training data for the autoencoder.The traditional reconstruction standard then attaches the task of repairing the corrupted input (denoising).This modification can motivate the encoder to learn good features, which can be generated from corrupted inputs and help reconstruct the clean input.The entity representations generated by the resulting encoder are more stable and robust.Inspired by (Vincent et al. 2010;Wang et al., 2021), we propose an unsupervised denoising entity embedding model based on denoising autoencoders and pre-trained language models.The structure of the entity embedding model is depicted in Fig. 3.The overall architecture of the entity embedding model is based on a denoised autoencoder, where the encoder network consists of a pre-trained language model and a pooling module.
For a given structured entity record e, containing m attribute values {Attr 1 , Attr 2 , ..., Attr m }, we first serialize it because the pre-trained language model takes token sequences as input.The serialization process is done by the serialization module, which first splits each attribute value in e into individual tokens, and then all tokens are formed into a token sequence ts.We use noise-added input data to train the entity embedding model.Adding noise is implemented by the adding noise module, which randomly removes a certain percentage of tokens from the input token sequence ts.The noise-processed token sequences are passed to the encoder as input data.
In the encoding phase, the semantic information in the input token sequence is first extracted by a pre-trained language model.Then, the polling module processes this semantic information to produce a fixed-size latent space vector se that is the embedding vector corresponding to the entity record e.In the specific experiments, the pooling module employs mean processing.Finally, the latent space vector se is reconstructed as a noise-free token sequence rts based on the decoder.Formally, the unsupervised training objective is: (1) Where x = {token 1 , token 2 , ..., token k } denotes the input token sequence, x means the noisy token sequence, E means the encoder, E( x) indicates the latent space vector obtained by encoding x, D represents the decoder, D(E( x)) denotes the reconstructed x, and d is the function used to measure the reconstruction error.
Traditional feature engineering is to extract the semantic information of entity records into a fixed-size vector (entity embeddings).Although entity embeddings can also be used as feature vectors to train matching models, a fixed-size vector can only contain limited and low-level information.Furthermore, the low-level features also hinder the transferability of the matching model because training data limit the Entity embedding.In this section, We propose an unsupervised entity feature transformation model based on the adversarial variational autoencoders to obtain viable entity representations.The feature transformation model transforms low-level, fixed-size entity embeddings to higher-level, more expressive probability distributions.Moreover, the obtained probability distribution features are not constrained by training data and data domain, giving the ER model transferable potential.

Model architecture for entity feature transformation
Based on the monotonicity assumption of precision (Hou et al., 2019;Arasu et al., 2010), we assume that the embedding vectors corresponding to the duplicate entities are derived from similar prior distributions.We try to approximate the prior distribution corresponding to the entity embedding as a Gaussian distribution.The effectiveness of the feature transformation depends on whether the model can learn the exact posterior distribution.The inference models used in the VAE fail to capture the accurate posterior distribution (Mescheder et al., 2017).Inspired by (Mescheder et al., 2017), we adopt an adversarial network to optimize the training process of the VAE to obtain the accurate posterior distribution.The architecture of the unsupervised entity feature transformation model is depicted in Fig. 4.
The given entity embedding se, generated by the unsupervised denoising entity embedding model, is first transformed into a multidimensional Gaussian distribution, N h ( μ, σ ) = {(μ 1 , σ 1 ), (μ 2 , σ 2 ), ..., (μ h , σ h )}, based on the encoder.Where N h ( μ, σ ) denotes the multidimensional Gaussian distribution, h denotes the dimensionality of the multidimensional Fig. 4 The architecture of unsupervised entity feature transformation model Gaussian distribution, μ = {μ 1 , μ 2 , ..., μ h } denotes the mathematical expectation of Gaussian distribution in each dimension, σ = {σ 1 , σ 2 , ..., σ h } denotes the standard deviation of each dimensional Gaussian distribution.Each entity record corresponds to a multidimensional Gaussian distribution.The similarity between entity records is determined by the distance between their corresponding multidimensional Gaussian distributions.Then the Sampling module implements ancestral sampling from the obtained multidimensional Gaussian distribution N n ( μ, σ ) based on the reparameterization trick from VAE.The result of sampling is a fixed-size vector z = {z 1 , z 2 , ..., z h }.The sampling vector z is finally reconstructed as an entity embedding, the reconstructed se, based on the decoder.
For the above processing flow, we introduce an adversarial network to optimize the training process of the variational autoencoder.The input of the adversarial network is a vector pair from ( x, z) or ( x, sample z ).Where x denotes the input entity embedding se, z denotes the vector sampled from the multidimensional Gaussian distribution N n ( μ, σ ), and sample z denotes the vector randomly sampled from the standard Gaussian distribution N(0, 1).The main purpose of the adversarial network is to correctly distinguish between ( x, z) and ( x, sample z ).The main purpose of the encoder network is to generate vector pairs ( x, z) that are indiscernible to the adversarial network.As a result, a two-player game is formed between the encoder network and the adversarial network.

Model training for entity feature transformation
For a given set of entity embeddings corresponding to n entity records, {se 1 , se 2 , ..., se n }, we consider each se as data generated by a latent space variable z.The latent space variable z originates from a priori distribution p(z) containing high-level similarity-salient information.For given se, we want to reason about the properties of the latent variable z.In other words, we want to calculate a posterior distribution p(z|se) according to p(z|se) = p(se|z)•p(z) p(se) and p(se) = p(se|z) • p(z)dz.However, since both se and z are multidimensional vectors, computing p(se) is intractable.Alternatively, we can use a tractable distribution (e.g., Gaussian) q(z|se) to approximate the true posterior distribution p(z|se) by variational inference (Beal, 2003).In practice, this translates into minimising the dissimilarity between q(z|se) and p(z|se) and can be achieved by maximizing the objective of Eq. 2.
E p D (se) E q(z|se) (log p(se|z) − (log q(z|se) In Eq. 2, p D is data distribution, the first term represents the expected log-likelihood of faithfully reconstructing se given some z from q(z|se), the second terms represent the difference between the approximate distribution q(z|se) and the actual prior distribution p(z).In order for Eq. 2 can be optimized by stochastic gradient descent, we implicitly represent the term (Eq. 3) as the optimal value of an additional real-valued adversarial network T (se, z).For given q(se|z), the optimization objective for T is shown in Eq. 4: Where, σ indicates the sigmoid function (σ (t) = 1 1+e −t ).Intuitively, T (x, z) attempts to distinguish between the pair (se, z) randomly sampled over distribution p D (se)p(z) and the current inferred model p D (se)q(z|se).In the nonparametric limit case (Mescheder et al., 2017;Goodfellow et al. 2020), the adversary T (se, z) is flexible enough to approximate any function on the variables IR and z.By introducing the adversarial model T (se, z), the objective in Eq. 2 can be rewritten as (Eq.5): (5) In practice, the above inference process is implemented using the neural network model.Specifically, the encoder network maps the d-dimensional input I Rs to two k-dimensional variables, μ and σ .The variables represent the parameters of the latent Gaussian distribution (we assume that the latent distribution is multivariate Gaussian), q(z|se).The input data se is reconstructed from the latent variable z sampled from q(z|se) through the decoder network.Intuitively, each input data se is mapped to the latent variable z, which is not a single point but rather represents a region that satisfies some latent distribution, meaning a series of variations of the input se.The process reflects the variability and uncertainty between duplicate se.

The transferability
Since the entity feature transformation model takes entity embeddings (numerical vectors) as input data, the resulting model after training is not associated with domain-specific information (e.g., domain-specific terms and data types of the field).Inheriting the advantages of the unsupervised entity representation model in the VAER (Bogatu et al., 2021), our proposed unsupervised entity feature transformation model is also transferable.The entity feature transformation model based on one ER task data can be directly applied to another ER task without repeating the training.This properties reduces the training cost of applying the model in different realistic scenarios.

Supervised entity matching
Our proposed unsupervised entity feature transformation model (Fig. 4) can map entity embeddings to Gaussian distributions.For most entity records, the similarity between probability distributions can be well used to measure the similarity between entity records.However, suppose the differences between duplicate records are significant, probability distributions corresponding duplicate records may be far from each other, especially when the entity feature transformation model is transferred to other ER tasks.In addition, the entity feature transformation model is based on unsupervised training.The learned feature transformation relationships (from entity embedding to probability distribution) may differ from the actual mapping relationships.Therefore, the entity feature transformation model needs to be further tuned by supervised training to represent the similarity between actual records better.
The architecture of the matching model is shown in Fig. 5.The primary function of the matching model is to determine whether the two input records represent the same entity.The matching model takes a pair of entity embeddings, se i and se j , as input data and transforms them into the corresponding multidimensional Gaussian distributions, N i ( μ, σ ) and N j ( μ, σ ), by the encoder, respectively.The initialized encoder is derived from the unsupervised entity feature transformation model introduced in Section 4. The two encoders in Fig. 5 have the same initial network parameters and remain consistent in supervised training.
The similarity measure between the entity embeddings se i and se j is converted into a distance measure between the Gaussian distributions N i ( μ, σ ) and N j ( μ, σ ).The two common distance measures for probability distributions are 2-Wasserstein (Mallasto & Fig. 5 The architecture of matching model Feragen, 2017), and Mahalanobis (Gallego et al. 2013).This paper employs the former (2-Wasserstein distance) as a measure of probability distributions.The 2-Wasserstein distance between two probability distributions is the minimum cost to transform from one probability distribution to the other.For the Gaussian distribution N i ( μ, σ ) and N j ( μ, σ ), the 2-Wasserstein distance is described in Eq. 6.
Where W 2 denotes the 2-Wasserstein distance, h denotes the dimensions of the multidimensional Gaussian distributions N i ( μ, σ ) and N j ( μ, σ ), μ i k and σ i k denote the mathematical expectation and standard deviation of the Gaussian distribution of the kth dimension in N i ( μ, σ ), respectively, μ j k and σ j k denote the mathematical expectation and standard deviation of the kth dimensional Gaussian distribution in N j ( μ, σ ), respectively.
In addition, we calculate a distance vector as the input data for the following classification network.Based on Eq. 6, we compute the 2-Wasserstein distance corresponding to each dimensional Gaussian distribution from N i ( μ, σ ) and N j ( μ, σ ).As shown in Eq. 7: Where d (i,j ) represents the 2-Wasserstein distance vector corresponding to N i ( μ, σ ) and N j ( μ, σ ), and h indicates the dimensionality of the multidimensional Gaussian distribution N i ( μ, σ ).Then, we pass the distance vector d (i,j ) to the following classification network to predict the final matching results.
The matching model is trained by supervised learning, and its optimization objective is divided into two aspects.Objective 1 is to minimize the 2-Wasserstein distance between duplicate entity records and maximize the 2-Wasserstein distance between non-duplicate entity records, i.e., to optimize the parameters of the encoder network from the unsupervised entity feature transformation model.Objective 2 is to minimize the classification error of the classification model, i.e., to improve the accuracy of the classification network by optimizing its parameters.Inspired by (Bogatu et al., 2021), to optimize both objectives simultaneously, we adopt a contrastive loss function (Neculoiu et al., 2016), as shown in Eq. 8: Where se i and se j denote the two entity embeddings of the input, w represents the weight parameter used to mitigate the category imbalance problem, y indicates the true labels of the entity record pairs corresponding to i and se j , W 2 means the 2-Wasserstein distance metric, E θ refers to the encoder network, θ indicates the parameters of the encoder network, E θ (se i ) and E θ (se j ) denote the multidimensional Gaussian distributions corresponding to se i and se j , B φ represents the classification network, φ refers to the parameters of the classification network, d (i,j ) indicates the distance vector corresponding to E θ (se i ) and E θ (se j ), B φ ( d (i,j ) ) represents the probability of se i and se j being duplicate records.Inspired by (Bogatu et al., 2021), we also set a margin parameter M that controls how the encoder parameters are adjusted according to the labeled training data.The setting of the parameter M also ensures that the matching model does not over-separate the probability distributions corresponding to non-duplicate records during training but instead focuses on complex pairs of duplicate records.The first two items in Eq. 8 correspond to the objective 1 of the matching model, and the third item corresponds to the objective 2.

Experiments and evaluation
In this section, we implement various experiments to evaluate the performance of our proposed TdavER model.Firstly, the optimal value of the margin parameter M is set based on the experimental results.Second, the significant performance of our proposed TdavER model is demonstrated by comparing it with other ER methods.Third, ablation experiments are implemented to demonstrate the effectiveness of our proposed entity feature transformation model and entity embedding model.Finally, we demonstrate the transferability of the unsupervised entity feature transformation model.

Benchmark datasets
Eight public datasets1 are used for experiments to evaluate the TdavER.All datasets are obtained from the ER benchmark dataset.The details of the datasets are described in Table 1, including the name of the datasets (Dataset), the data domain (Domain), the number of labelled tuple pairs (Size), the number of positive instances (#Pos.), the number of attributes (#Attr.).These datasets come from multiple domains, such as music, restaurants.
The probability of matching pairs ranged from 11.6% (FZ) to 24.4% (iTA 1 ).These datasets contain three dirty datasets, iTA 2 , DA 2 , and DS 2 .All dirty datasets are taken from the corresponding structured datasets above.The value of each attribute is randomly transferred to the title attribute of the same tuple with a 50% probability.This process imitates the typical dirty data problem in the real world.Each dataset is partitioned into a training set, validation set, and test set with a 3:1:1 ratio.

Compared methods and evaluation measures
To evaluate the performance of TdavER, some current approaches are compared with TdavER, such as Magellan (Konda et al. 2016), DeepMatcher (Mudgal et al., 2018), DITTO (Li et al. 2020), and VAER (Bogatu et al., 2021).The information on the comparison methods are described below: Magellan.Magellan is an state-of-the-art ML-based ER ecosystem that provides various tools to handle ER tasks, including blocking techniques, matching models, data cleaning and information extraction.
DeepMatcher.DeepMatcher is a DL-based ER model that defines a design space on DL solutions and proposes four solutions with different attribute summarizations : Heuristic, RNN, Attention and Hybrid.
DITTO.DITTO is a state-of-the-art ER system based on pre-trained language models (e.g., BERT), which treats the ER task as a sequence pair classification problem.In addition, DITTO employs three optimization methods to enhance the experimental results, including text summarization, data augmentation, and domain knowledge.
AVER.VAER is a transferable DL-based ER model that separates feature learning from similarity learning.VAER uses LSA (Dumais et al. 2004) to generate entity embeddings and then employs a variational autoencoder to map the entity embeddings to the latent space.
As usual, the matching performance of the model is measured by P recision(P ), Recall(R), and F 1-score(F 1 ).Its corresponding formula is shown: P = tp tp+fp , R = tp tp+f n and F 1 = 2 × P ×R P +R .Where tp denotes the number of record pairs where both the prediction results and the data labels are matched, fp denotes the number of record pairs labeled as mismatched that are predicted to be matched, and f n denotes the number of record pairs labeled as matched that are predicted to be mismatched.

Implementation and experimental setup
The TdavER model proposed in this paper is implemented based on the TensorFlow1.15.0 architecture.For the entity embedding model, we adopt bert-base-uncased as the pre-trained language model and mean as a pooling process.In addition, we use random removal of tokens to add noise, and the percentage of removed tokens is set to 60%.The dimensionality of the entity representation vector is 768.For the entity feature transformation model, the dimensionality of the multidimensional Gaussian distribution obtained from the entity embedding is 128.For the matching model, the classification network consists of multiple linear layers and a softmax layer.The weight parameter w is set to 3.0.The other components of the total ER architecture are composed of multiple linear network layers.
The evaluation2 process of TdavER is divided into four stages: testing the effect of margin parameters on model effects, performance comparison of TdavER with other ER models, ablation experiments, and evaluation of model transferability.

Setting of margin parameters
There is an essential hyperparameter in TdavER -the margin parameter M. For the same dataset, various values of M have different experimental results.In this section, the experimental results for diverse values of M are tested to select the best parameter settings of TdavER for each dataset.For the parameter M, we set five values 0.1, 0.5, 1, 5, 10 respectively.The experimental results are depicted in Fig. 6 and Table .2. The value of M significantly influences the experimental results for the Br, iTA 1 and iTA 2 data sets.The experimental results have only minor fluctuations based on the different M values for the FZ, DA 1 , DS 1 , DA 1 and DS 1 data sets.From Table 1, the Br,iTA 1 and iTA 2 datasets only have no more than 550 training data.However, the other datasets all contain a large amount of training data, especially the DA 1 , DA 2 , DS 1 and DS 2 datasets.The overall experimental results show that variation of the M value only impacts small datasets' experimental results.This is because M-value modifications affect the model parameter updates' magnitude.For small datasets, changes in model parameters may lead to significant changes in experimental results due to too few test cases.As seen from Figure 6, the model achieves the best performance on small datasets (Br, iTA 1 and iTA 2 ) when the M value equals 1.Therefore, we set the M value equal to 1 for the subsequent experiments.In addition, the model's performance tends to drop when M=5 for some datasets, especially for very small datasets (Br, iTA 1 and iTA 2 ).However, when the training and test sets contain only a very small amount of data, the model's training process is somewhat unstable and inconclusive.Therefore, it cannot be completely stated that the model has a performance drop at M = 5 for small data sets.

Performance Comparison
This section compares the TdavER model proposed in this paper with other ER methods, including Magallan, DeepMatcher, DITTO and VAER.Eight benchmark datasets are used for the experiments.The comparison results are described in Table 3.For datasets Br, iTA 1 and iTA 2 , our proposed TdavER model achieves the best scores and improves by 1.41%, 3.67% and 5.59%, respectively, compared to the highest scores in the comparison method.
For the data sets FZ, DA 1 and DA 2 , the difference between the scores achieved by our TdavER model and the highest scores in the comparison model is very small, and the difference is only 0.00%, 0.08% and 0.60%, respectively.For datasets DS 1 and DS 2 , our TdavER model achieved slightly lower scores than the highest scores in the comparison method, and the degree of difference was 3.70% and 3.50%, respectively.This result is achieved because of two things.First, the data sets DS 1 and DS 2 have a large amount of data, and their data quality is poor, which causes our unsupervised entity embedding model to fail to obtain high-quality entity representations.Then, DeepMatcher and DITTO, in the comparison method, are fully supervised deep learning-based models which rely on a larger amount of labelled data.The higher integration of the models is not convenient for transfer learning.
From the overall experimental results, it can be seen that our proposed TdavER achieves a performance comparable to that of the current state-of-the-art ER methods.

Ablation experiments
To demonstrate the effectiveness of our proposed entity embedding model and entity feature transformation model, we design two variants on the TdavER model, TdavER (BERT) and TdavER (VAE), respectively.The TdavER (BERT) is obtained by replacing the original entity embedding model with a pre-trained language model (bert-base-uncased).The TdavER (VAE) is created by replacing the adversarial variational autoencoder with the VAE.
For other network structures and parameters, both TdavER(BERT) and TdavER(VAE) are consistent with the TdavER model.The comparative experimental results of the TdavER model and its two variants are described in Table 4.
The experimental results in Table 4 show that TdavER is significantly outperforming its two variants, especially TdavER (BERT).We perform a simple statistical analysis of the experimental results.For the 8 test datasets, TdavER scores improved by an average of 7.8% compared to TdavER (BERT), and the greatest improvement is 32.8% for the iTA 2 dataset.Compared to TdavER (VAE), the TdavER model achieves an average performance improvement of 2.2% for each dataset, and the biggest improvement is 8.16% for the iTA 2 dataset.The overall experimental results illustrate the effectiveness of our proposed unsupervised entity embedding model and unsupervised entity feature transformation model.In addition, the comparison results between TdavER (BERT) and TdavER show that a high-quality entity representation vector is very important for later entity matching.The experimental comparison results between TdavER (VAE) and TdavER illustrate that variational autoencoders with adversarial networks can map entity embeddings to probability distribution spaces more accurately.To further evaluate the effectiveness of our proposed adversarial variational autoencoder, TdavER (VAE) and TdavER are further compared from other aspects below.
Data sensitivity.The data sensitivity of the TdavER and its variant TdavER(VAE) is evaluated in this section.Specifically, two sets of datasets are used in this experiment, the first set is the original structured datasets (i.e., iTA 1 , DA 1 , DS 1 ), and the second set is the corresponding structured datasets with dirty (i.e., iTA 2 , DA 2 , DS 2 ).Then, all the datasets are divided into four proportions of training data, i.e., 10%, 30%, 50%, and 60%, and the model's performance is evaluated based on various proportions of training data.The results of the ablation experiments are depicted in Fig. 7.For the overall experimental results, the matching performance of the model gets better as the proportion of training data keeps increasing.For the datasets iTA 1 , iTA 2 , DS 1 and DA 2 , TdavER consistently outperforms TdavER(VAE) with increasing training set size.For datasets DA 1 and DS 1 , TdavER-VAE outperforms TdavER when the training data is below 20%, and TdavER starts to become comparable or even outperforms TdavER(VAE) as the training To demonstrate that our proposed entity feature transformation model can more accurately map entity embeddings to the corresponding Gaussian distributions, we compare the results of feature transformation for adversarial variational autoencoder and VAE.First, the adversarial variational autoencoder-based and VAE-based entity feature transformation models are trained using training datasets.Then, we utilize the resulting model's encoder to transform the entity embeddings of the test sets into the corresponding multidimensional Gaussian distributions and compute the corresponding 2-Wasserstein distances of the entity pairs.Finally, the mean values of the corresponding 2-Wasserstein distances are calculated for entity pairs of the same class (matching or mismatching).The experiment uses five benchmark datasets, and the comparison results are described in Table 5.
The average 2-Wasserstein distance of the matched entity pairs corresponding to the TdavER model is significantly smaller than that of TdavER (VAE) for datasets Br, iTA 1 , DA 1 , and FZ.For the dataset DS 1 , although the average 2-Wasserstein distance of matched entity pairs corresponding to TdavER(VAE) is smaller than that of TdavER, the difference between the average 2-Wasserstein distance of non-matched and matched entity pairs corresponding to TdavER(VAE) is very small.This is not conducive to the training of subsequent entity matching models.For the transformation process of entity features, a smaller distance for matched entity pairs is desired when the difference between the corresponding distances of matched and non-matched entity pairs is noticeable.From the overall comparison results, the feature transformation performance of our proposed adversarial variational autoencoder is better than that of VAE.

Transferability experiments
proposed unsupervised entity feature transformation model uses numerical data as input.Therefore, the trained entity feature transformation model is not constrained by the training data and can be transferred to other ER tasks without repeated training.In this section, we perform a series of experiments to evaluate the transferability properties of the entity feature transformation model.
For the local experimental results of TdavER, the entity feature transformation model trained based on the local data are used in the experiments.For the transferable experimental results of TdavER, the entity feature transformation model based on dataset DS 1 is used for datasets DA 1 and DA 2 , and entity feature transformation model based on dataset DA 1 is used for other datasets (except DA 1 and DA 2 ).The local and transferable experimental results of TdavER model are depicted in Table 6.The experimental results based on transferred entity feature transformation model are close to or even slightly better than the experimental performance based on local entity feature transformation model.From the experimental results, transferability allows the entity feature transformation model to be reused for other ER tasks, reducing the cost of model training without impacting effectiveness.
Further, the results of transferability experiments of TdavER and AVER were compared.The comparison results are shown in Table 7 ( Experimental results for the VAER model are from (Bogatu et al., 2021)).The results show that the experimental performance of TdavER's transferability is significantly higher than that of the AVER for datasets Br and iTA 1 .The experimental performance of the TdavER is equal to or slightly higher than that of AVER for datasets FZ, DA 1 , and DS 1 .From the overall comparison results, the transferability of the TdavER model is better than VAER.This result also proves that our proposed entity feature transformation model can learn more accurate posterior distributions and better map entity embeddings to the probability distribution space compared to the VAER.

Conclusion
In work, we propose a novel ER model TdavER.The TdavER model divides the ER task into entity embedding, entity feature transformation, and entity matching.For entity embedding, the TdavER model uses an unsupervised entity embedding model that converts entity records into high-quality embedding vectors to ensure the validity of the subsequent model.And then, the TdavER proposes an unsupervised entity feature transformation model based on an adversarial variational autoencoder.This feature transformation model transforms low-level entity embeddings into high-level probability distributions, which are not constrained by the training data and contain high-level similarity features.In the experiments, the comparison results demonstrate that our proposed TdavER has comparable performance compared to the current state-of-the-art ER algorithms and achieves optimal performance on some domain datasets.In addition, it is experimentally demonstrated that the unsupervised entity feature transformation model used in the TdavER has better transferability than the entity representation model used in the VAER.

Fig. 3
Fig.3The architecture of unsupervised entity embedding model

Fig. 6
Fig. 6 The variation in the model's performance based on different M values for all baseline datasets Extensive experiments are conducted to verify the effectiveness and relevant properties of the TdavER model.The experimental results show that the TdavER has excellent transferability and outperforms existing solutions on partial datasets.
(e.g., gaussian distribution) based on an adversarial variational autoencoder.The addition of the adversarial network can optimize the training process of the variational autoencoder and help it learn the accurate posterior distribution.The high-level features obtained are not constrained by the original data.The resulting model is transferable and can be reused in other ER scenarios without repeated training.3.

Table 1
Datasets for our experiments

Table 2
F1-scores of TdavER based on different M values

Table 3
Experimental results (F1-scores) of various ER methods

Table 5
The average 2-Wasserstein distances corresponding to matched and unmatched entity pairs

Table 6
Recall/F1-score with local/transferred representation models

Table 7
Recall/F1-scores with transferred representation models for AVER and TdavER