Matching Document Pairs using Multi-Feature Semantic Fusion Based on Knowledge Graph

Discriminating the homology and heterogeneity of two documents in information retrieval is very important and difficult step. Existing methods mainly focus on word-based document duplicate checking or sentence pairs matching except manual verification which need a lot of human resource cost. The word-based document duplicate checking can not judge the similarity of two documents from the semantic level and the matching sentence pair methods can not effectively mine the semantic information from a long text which is frequent retrieval results. A concept-based Multi-Feature Semantic Fusion Model (MFSFM) is proposed. It employs multi-feature enhanced semantics to construct a concept map for represent the document, and employs a multi-convolution mixed residual CNN module to introduce local attention mechanism for improve the sensitivity of conceptual boundary information. To improve the feasibility of the proposed MFSFM based on concept maps, two multi-feature document data sets are set up. Each of them consists of about 500 actual scientific and technological project feasibility reports. Experimental results based on the actual datasets show that the proposed MFSFM converges quickly while expanding the latest methods of natural language matching at the accuracy rate.


Introduction
Recognizing the relationship of document pairs is an indispensable Natural Language Understanding (NLU) task, which is essential for document duplication and document search. For example, a project system needs to review newly declared projects to check whether there are duplicate declarations. Early document recognition methods were based on term similarity and rules. Traditional matching methods based on term appraise the semantic information between document pairs through unsupervised indicators [1], e.g., via TF-IDF vectors [2], BM25 [3], LDA [4]. In querying document, retrieving and searching information, these approaches have been successful [1]. The rule-based method requires experienced experts to summarize the rules [5,6], and the stability of the model depends on the knowledge structure of the experts, and there may be contradictions between the rules given by different experts. In order to overcome the shortcomings of rule-based methods, Bengio et al. proposed a document recognition method based on machine learning [7]. The main method of machine learning is to divide documents into multiple categories and then classify them. The classic machine learning classification includes Hidden Markov Model (HMM) [8], Maximum Entropy Model (MEM) [9], Maximum Entropy Markov Model (MEMM) [10], Conditional Random Field (CRF) [11] and Support Vector Machine (SVM) [12] can be used for document recognition. These methods have achieved good results in different fields of the corpus, but in the training process, it is necessary to design features for specific fields first. The effect of the model mainly depends on the selection of features, and the generalization ability is not strong.
In recent years, a variety of deep neural network models for text matching have also been proposed [13,14], which can be recursive or convolutional neural The network layer captures the semantic dependence (especially the order dependence) in natural language. Lample et al. proposed a multi-language general's BiLSTM-CRF model that uses word embedding as a feature to identify named entities [13]. Pinherio et al. [14] first used CNN combined with CRF to achieve good results in CONLL2003 corpus. Huang et al. [15] constructed a BiLSTM-CRF model with artificially designed spelling features, which achieved an F1-measure of 88.83% in CONLL2003 corpus. Chiu and Nichols et al. established the BiLSTM-CNNs model to achieve an F1-measure of 91.62% in CONLL2003 corpus [16]. Dernoncourt et al. designed an easy-to-use neural network entity recognition tool named NeuroNER [17], which allows users to directly tag entities and perform training -------and prediction by using the web graphical interface. Crichton et al. proposed a multi-task learning method for biomedical named entity recognition [18], which increased the average F1-measure by 0.8% compared to single-task learning. Shen et al. proposed a named entity recognition method based on deep active learning [19], and deep active learning has also achieved great results in the fields of medicine and imaging [20][21][22], compared to the deep learning method, it requires only a small amount of training data to get the same effect.
However, the existing deep models mainly involve matching sentence pairs, such as paraphrase recognition, answer selection in documents, omitting keywords, entities, or complex interactions between sentences in longer documents. Therefore, although the document is important for matching, it has not been fully studied.
Semantic matching between long documents is largely an untapped area although there are many datasets for sentence matching. However, as far as we know, there is no public dataset of tags for matching long documents. To facilitate the evaluation and further study of the documents, this paper created two labeled datasets, one annotated whether the project feasibility report document pairs (from different projects) belong to the same project, and the other annotated whether the document pairs belong to the same topic. These documents are the scientific and technological projects declared from the subsidiaries of State Grid Hunan Electric Power Co., Ltd. Note that similar to most other natural language matching other natural language matching models, all the methods proposed in this article can also be easily applied to other languages. Specifically, we have made the following contributions: (1) First, we propose the Concept Graph (CG), which treats a document as a weighted graph of concepts. A keyword or a group of closely connected keywords represents a concept vertex. We use the sentences in the document associated with each concept as a local comparison with the same concept that appears in another document. In addition, we use the weighted edges to connect two conceptual vertices in the document, and use edges to indicate their interaction strength. CG not only captures the essential semantic unit in the document, but also provides a method for anchoring comparison between two documents based on discovered concepts.
(2) Second, we propose a divide and conquer framework to match a pair of documents based on the constructed CG and Graph Convolution Network (GCN). The idea is that for each concept vertex appearing in two documents, we first obtain a local matching vector via a series of text encoding schemes (including neural encoding and term-based encoding). Further, the multi-convolution mixed residual CNN (MCMR-CNN) module is used to obtain local attention information and improve the sensitivity of concept boundary information.
(3) Finally, based on the output of MCMR-CNN as the input, we propose a concept-based Multi-Feature Semantic Fusion Method (MFSFM) , where first design a Contextual Multi-Feature Embedding (CMFE) structure to improve text representation. CMFE performs multi-feature semantic enhancement through multiple features in the dataset, and then performs multi-level feature enhancement through the CNN network. Compared with RNN-based sequential modeling, the MFSFM decomposes the matching process into partial matching sub-problems on the graph. Extensive experiments show that the algorithm we proposed has made significant improvements in matching news pairs. Specifically, the classification accuracy of our proposed MFSFM on the two datasets has been improved by 13.17% and 19.82%.

Related Works
Traditional document representation methods mainly include vectors such as Bag Of Words (BOW) [23], Term Frequency Inverse Document Frequency (TF-IDF) [2], Latent Dirichlet Allocation (LDA) [4], and compute the distance among the vectors. However, semantic information can not be captured, and generally fail to achieve good performance.
Graphical document representation is proposed in order to better capture the semantic distance. Most of the existing graphical document representations can be summarized into four categories: word, text, concept and hybrid graph. In the word graphs, the words in the document are used as fixed points, and edges are constructed based on syntax analysis [24], co-occurrence [1,25] or the previous relationship [26]. In text graphs, sentences, paragraphs or documents are all utilized as vertices, and use word co-occurrence, location [27], text similarity [28] or hyperlinks among documents [29] to build edges. We link the terms of documents to real-world concepts based on knowledge storehouses such as DBpedia [30] in the concept map, and construct edges based on semantic and syntactic rules. Hybrid graphs [31,32] are composed of vertices and edges, which them are different types.
In recent years, there have emerged different neural network architectures for matching documents pair tasks [18,33,34]. These representation-focused models usually convert the document pairs into a context vector via the Siamese network, and then according to the context vector, use a fully connected layer or scoring function to give a matching result [13,35]. For models that focus on interaction, they extract all the features of paired interactions between words in a documents pair, and combine the interaction information through a Deep Neural Network (DNN) to derive matching results [14]. However, these neural network models do not make full use of the inherent structural characteristics of long text documents. Therefore, these models underperformer in matching long text pairs. There are also some researches using knowledge [34], hierarchical attributes [36] or graph architecture [25] for matching long text. On the contrary, the proposed MFSFM represents the document through a novel graphical notation, and then combines the notations with GCN. Soon after, there have merged pre-training models such as BERT [37], which they can also be used for text matching. However, these models has high complexity and is difficult to meet the speed requirements in practical applications.
The previous GCN architecture was mainly used to make up missing attributes/links [38], classification [39] or node clustering, but they were all within the scope of a single graph such as a knowledge graph, social or citation network. In this paper, the proposed CG uses a simple method to represent project documents through weighted undirected graph, which actually helps to decompose these documents into se- ntences subsets, each of which focuses on different concept or subtopics. Compared with the previous use of Natural Language Processing (NLP) to deconstruct the document, our method can better reflect the semantics of the document.
In addition, the manual review method is not applicable in the case of large amount of data, and the word-based document duplicate checking method can not mine the in-depth semantic information of the document. These models cannot effectively mine the semantic information of long texts. In order to match document pairs more accurately and easily, and considering the location features and part-of-speech features of the key words, this paper proposes a Multi-feature Semantic Fusion Model (MFSFM) to identify citation entities. The model does not require manual rules and templates, and it can also better identify citation entities based on the extracted generic features.

Concept Graph
As mentioned earlier, the Concept Graph (CG) represents a document as an undirected extended graph. Firstly, the document are decomposed into subsets of sentences, each of which aligns to a different concept. In a document , we define a graph as a CG. Each of is initially called a concept, and it is a keyword or a group of highly related keywords [1]. The above is also the most common concept mentioned in the sentence. Therefore, the beginning will have its own set of sentences, which are disjoint.
As shown in Fig. 1, it describes how we can transform a document into a Concept Graph. We can use standard keyword extraction algorithms such as TextRank [27] to extract the keywords " 变 压 器 ", " 电 气 设 备 ", and every other two concepts from the document. In CG, each concept is a subset of closely related keywords. We first group keywords into concepts, and then append each sentence onto its most relevant concept vertex [1]. For example, in Fig. 1, sentence 1 and sentence 2 mainly discuss the relationship between "变压器" and "电气设备" , so it is appended to the concept (transformer, electrical equipment). Therefore, we use a key concept map to denote the original document. Each concept map has a subsets of sentences and the topology relationship between them. Fig.  2 indicates the alignment of the discovered concepts and construction of the CG of the document. Herein, the detailed steps are described for splitting the document and merging the CG: (1) Constructing KeyGraph: Given a document , we apply TextRank [27] to extract named entities and keywords. Further, we build a keyword co-occurrence graph based on the set of found keywords, called Key Graph (KG), where each key is a vertex. If two keywords appear in the same sentence at the same time, we will connect them through constructing an edge. To further improve the model, we can implement common citation analysis and synonym analysis to combine keys with the same meaning. Since time complexity, these operations does not work.
(2) Concept detection: The architecture of KG reveals the interaction relationship between keywords. We will build a densely connected subgraph in Key Graph when a subset of keywords are highly correlated, we call it a concept [1]. Further, we use community detection algorithm to extracte concepts. The community detection algorithm can divide KG into a group of communities = { 1 , 2 , …, | | }, where each community contains a keyword of a certain concept. Each keyword may appear in multiple concepts by using overlapping community detection. Since the number of concepts in different documents varies greatly, we use an algorithm based on the centrality score of betweenness [40] to detect keyword communities in KG. It is worth noting that each keyword is directly utilized as a concept. The advantage of concept detection is that it reduces the number of vertices and increases the matching speed.
(3) Attaching sentence: After discovering concepts through keywords, we further group sentences by concepts by similar methods. Then, the cosine similarity is calculated between sentence and concept. We use TF-IDF vectors to represent them respectively [2]. Each sentence is attached to the concept that is most similar to that sentence. Those sentences that do not include concepts match will be appended to virtual vertices. It does not include any keywords.
(4) Constructing edges: The relationship between concepts is reflected by putting edges between concepts. For each vertex, we express its sentence set as a series of sentences connected to it, and use TF-IDF similarity to calculate the edge weight between the two vertices. Note that, we can use other ways to determine the edge weight, but constructing an edge through TF-IDF is better because it will generate a CG, which is more closely connected.
As shown in Fig. 2(a), we use the above steps to address a pair of documents and while performing item matching. It is different that for each common concept vertices, we align the CGs of the two documents according to the concept vertices, and for local comparison, we merge the sentence sets in and .

Document Pair Matching
Given the merger of the two documents and introduced, a pair of documents are matched by matching the sentence sets. As shown in Fig. 2, we match the set of sentences in and related to each concept. Then, we use multiple graph convolutional layers to aggregate the local matching results into the final result, and use a "divide and conquer" manner to match a pair of words [1]. To overcomes Fig. 2. The outline of the way we construct a CG from documents pairs and classify it through a GCN (Similar to Ref. [1]). the disadvantages of previous algorithms and capture more semantic interations in longer texts, we use a graphics perspective to spread the text representation from a grid perspective.
As shown in Fig. 2, it presents the overall architecture of the proposed MFSFM, including four steps: a) expressing documents pairs through a single merged CG, b) studying the multi-viewed matching feature from each concept vertex, c) structuring transformation of the local matching graph by the features of the convolutional layer, and d) grouping local matching features to obtain the final result. The above four steps can be trained in end-to-end manner.
Given the grouped CG , MFSFM first learn a fixed-length matching vector for each concept ∈ to represent the TF-IDF semantic similarity between A ( ) and ( ) and the sentence sets from recording and separately. It means that the two documents matching will be converted to match sentence sets pair for each vertex. Especially, local matching vectors are generated according to term-based techniques and neural networks. Siam network encoder [41] is applied to each vertex ∈ to transform the word embedding of { , ( ) } into a hidden feature vector ( ), which is fixed-sized. In this paper, the Siamese structure is used to take and ( )as inputs. Then, and ( ) are encoded into two context vectors by the context layers. This can achieve the purpose of sharing the same weights in Fig.  2(b). In the context layer, one or multiple BiLSTM or CNN layers are included. The purpose of BiLSTM and CNN is to capture the contextual message in and ( ) . Define and ( ) as the context vectors, which are used to obtained for and ( ), respectively. Then, we calculate m AB ( ) for through the subsequent aggregation layer [1]. m AB ( ) concatenates the element-wise multiplication and the element-wise absolute difference of the context vectors A and B, i.e., where ∘ represents Hadamard product [1]. According to different similarity algorithm, there are different calculation method for matching vertors. There are usually 4 indicators (TF-IDF cosine similarity, TF cosine similarity, BM25 cosine similarity and Jaccard similarity of 1-gram) to calculate the term-based similarity between and ( ). As shown in Fig. 2(b), in this paper, we use the four similarity scores to concatenate the comprehensive similarity into another matching vector m ' ( ) of . It is different from Ref. [1]. Matching aggregation through GCN need to aggregate the local matching vector into the final matching score of documents pairs. In Ref. [38], the function of the GCN filter is recommended to obtain the patterns shown in CG on multiple scales. Generally, a graph = ( , ) is considered as the input of GCN, vertices ∈ V and = ( , ) ∈ . In addition, the vertex feature matrixs represented by F = {f } =1 are included in the input. For vertex , f is the feature vector. Then, CG of documents pairs and , which contains the connected matching vector on each vertex into GCN, so that f of in GCN is expressed as: Next, we slightly bewrite the GCN layer used in Fig. 2(c) [38]. The weighted adjacency matrix of is given by A ∈ ℝ × where = that is the TF-IDF similarity between vertex and . Denote as a diagonal matrix, and let = . The input layer of GCN is (0) = X.
The original vertex features is contained by (0) . We express ( ) ∈ ℝ × as the hidden representation matrix in the ℎ layer. Then, the following graph convolution filter was applied to the previous hidden representation by each GCN layer: where = + , is the identity matrix, is the diagonal matrix, and its value is = .
is the adjacency matrix of the graph G. is the degree matrix. The trainable weight matrix is indicated as (l) in the th layer. σ( • ) means an activation function (Sigmoid or ReLU function, etc.). On the graph , the first-order approximation of the local spectral filter inspires this graph convolution rule [38]. The interaction pattern between vertices can be extracted when employed recursively [1]. Finally, according to the obtained average value of the hidden vectors of all vertices in the last layer, we merge the hidden meaning in the final GCN layer into a single vector of fixed length. We can employ a classifier, such as Multi-Layer Perceptron, to compute the final matching score based on . Apart from the above matching vector , other global matching features are appended to the , which further expand the feature set. We encode two documents to calculate these additional global functions, where we use the latest language model such as BERT [37] as encoder. In addition, we also can calculate the term-based similarityies as the global features.

Model construction
The MFSFM's architecture is shown in Fig. 3. According to the citation dataset constructed in Sec. 3, MFSFM first design a Contextual Multi-Feature Embedding (CMFE) structure to obtain word vectors to better express semantic information, and use the designed residual CNN module to obtain entity boundary information of variable length, design LSTM module to further obtain context information and clarify timing, and finally use CRF module to perform Entity recognition. Secondly, considering the uncertainty of the entity boundaries in the citations divided by division granularity, for example, each author in Author list entity is generally composed of 2 to 3 divisions, but Title entity may be composed of 4 to 20 divisions (According to the division granularity of Chinese and English citations, the authors in Chinese citations consist of 2 to 3 characters, and the authors in English citations consist of 2 to 3 words. Title entity is similar.), so MFSFM constructed a multiconvolution kernels mixed residual CNN module to obtain the local attention and entity boundary information. Thirdly, MFSFM used a LSTM module which composed with BiLSTM and one-way LSTM to enhance the timing information learning. Finally, MFSFM used the CRF module to identify the citation entity.
The citation entity recognition first needs to generate the text representation for words or characters of the citations according to the division granularity, mainly including one-hot representation and distributed representation [42]. Because the one-hot representation does not take into acco unt the relevance of the words and may present a "highdim- ensional disaster", we choose distributed representation. Existing distributed representation models include neuro-probabilistic language models [43], word2vec [44], BERT [37], XLNet [45] and so on.
Word2vec can represent each word as a low-dimensional vector to compress the data scale, which can capture less contextual information. And it is small scale, fast and easier to learn, so this paper used word2vec to get a preliminary text representation. As for the citation dataset in Sec. 2 not only having the characters (words) features, but also having part-of-speech features and relative position features, this paper proposed CMFE method. The CMFE mainly includes two processes: Multi-Feature Semantic Enhancement (Fig.3) and Multi-Level Feature Enhancement (Fig. 4). The main steps are as follows: Multi-Feature Semantic Enhancement steps: i) The word vector matrix , = 1,2, …, ( represents the number of feature) of each feature in dataset is obtained by using the word2vec model.
ii) For each feature in each division, input it to the matrix , = 1,2, …, and getting the corresponding feature vector. Then, using the fully connected (FC) layer to obtain the weighted word vector (without the bias vector), FC can be trained and can reflect the semantic influence of different feature.
iii) The window parameter is used to obtain the context information that needs to be included in the final multi-feature semantic enhancement vector to reflect the semantic. It mainly uses FC (without bias vector) calculation or concatenate the n divisions weighted word vector. The n divisions are split into follows: front has n-(n-1)/2 divisions, rear has (n-1)/2 divisions, and current division (the division is not smaller than 1, and not bigger than the max division number).
Multi-Level Feature Enhancement steps: Considering that multi-feature semantic enhancement only extracts shallow features, and it does not specifically capture deep features (relevant information) between data divisions. In order to express the semantics of data with a matrix of word vectors, that is, to better express the different semantics of a word between different data. And the convolution operation in the CNN network can obtain relevant information of data division by expanding the receptive field. Multi-level feature enhancement uses a two-layer CNN network (using one-dimensional convolution) to obtain the two-level feature vectors. The two-level feature vectors are combined with multi-feature semantic enhancement vector to get the CMFE vector.
In Fig. 5, is the multi-feature semantic enhancement's output, is the max division number, and the one-dimensional convolution's outputs ℎ (1) and ℎ (2) are the two-level feature vectors, so CMFE vector is shown in (7) (⊕ is concatenation operation).
The CMFE method can be used to obtain the word vector representation of each division data, and the multi-feature is used to strengthen the semantic information and the simple CNN network to strengthen the hierarchical information, making the subsequent learning process easier.

Results
We evaluate the proposed method to identify whether a pair of feasibility study project reports belong to the same project (or event), and whether they have the same theme. In fact, the proposed matching scheme for document pairs has been deployed in the project declaration application for project verification. Please note that traditional project document review methods include manual verification and character-based duplicate checking methods. Although manual verification has a high accuracy rate, it requires a lot of human resources; the character-based duplicate check method only judges the repetition rate at the character level, and cannot infer whether the document pair belongs to the same project or the same topic at the semantic level. Therefore, manual verification methods and character-based duplicate verification methods are not available here. It is not even possible to determine how many project clusters exist. This is different from news document pair matching. There are uncertainties about the number of project categories. The topic of the project document is fixed. The task of classifying whether two project declaration documents belong to the same project or the same subject is crucial.
In our tasks, "project" refers to a task set up to solve a certain problem. Multiple tasks may publish documents with different narratives and wordings in the project. Note that our goal is different from traditional event references [46] or SemEval-2018 Task 5 [47]. Their task is to detect all events mentioned in the document (or actually "actions", such as shooting, car accidents) [1]. In contrast, although a project document may mention multiple entities or even domain-specific terms, the "project" in our data set always refers to the problem that the document intends to study. Our task is to determine whether two documents intend to study the same topic.
(3) Based on the large-scale pre-trained language model BERT [37]. The basic idea is that the bi-transformer is responsible for extracting features, and then the entire network adds a fully connected linear layer as fine-tuning.
In this paper, we focus on how to better match long text. Therefore, in our method or baseline, any short text information ( such as headings ) have been abondoned. Pratically, the interaction of two projects is not limited to "whether they belong to the same project". The proposed MFSFM can identify general interaction between projects, for example, whether there are two transformer feasibility reports describing transformer noise. We use the labeled training data to define and supervise the interaction. The interaction contains the same project or the same topic. We can not assume the feasibility of other information (such as titles) for these experiments. Table 2 evaluate different variants of the proposed MFSFM to show the impact of different sub-modules. In the model name. In Fig. 2, "Siam" means an encoder using a Siamese, and "Sim" means an encoder using a term-based similarity. "CG" means that in Concept Graph (CG), if community detection is not used, keywords are directly used as concepts, and " CG " represents that each concept vertex in CG. These vertex includes a keywords set grouped by community detection. Therefore, for each vertex, a matching vertor is produced; "GCN" indicates that we take the vertices vector of the GCN layer to convolve the local matching. Finally, " " means using other global functions provided by BERT, and " " means using the above four term-based similarity measures. These features are appended to the matching vector m of the graph merge for final classification [1].

Datasets
For matching long document tasks, as far as we know, there is no public available data set that can be used. In this paper, we constructed two datasets: the Chinese Feasibility Study Same Project Data Set (CNSR) and the Chinese Feasibility Study Same Subject Data Set (CNSI). These datasets are all marked by professional editors. They contain long-form feasibility technology documents collected from China's Hunan State Grid Electric Power Co., Ltd., covering various topics in various areas of the company. The CNSR data set contains 4678 pairs of feasibility study reports with tags. These tags indicate whether a pair of project documents are describing projects in the same field. Similarly, the CNSI data set contains 2464 pairs of tagged documents, indicating whether the two projects belong to the same subject. The average number of words in all documents in the data set is 9034, and the maximum is 32461.
In these data set, we only marked the main research items of the feasibility study report. Please note that we do not generate randomly the negative samples in the two datasets. Rather than, we choose project document pairs that include similar items (keywords), and exclude samples whose TF-IDF similarity is below a certain threshold. Table  1 shows the detailed classification of these two datasets.  Train  Dev  Test  CNSR  2010  2668  3275  702  701  CNSI  1200  1264  1725  369  370 For these two data sets, we use 70% of all samples as the training set, 15% as the validation set, and the remaining 15% as the test set. In this paper, we need to ensure that the different segmentation do not include any overlap, which avoids data leakage. The indicators used for performance evaluation are the accuracy of the binary classification results and the F1 score. For each evaluation method, we take training for 10 periods, and then select the period with the best verification results for evaluation on the test set.

Experimental setting
We use Stanford CoreNLP for word segmentation (Chinese text) and named entity recognition. For the concept interaction graph construction with community detection, we set the minimum community size (the number of keywords contained in the concept vertices) to 2, and the maximum size to 6. In our neural network model, there are word embedding layers, Siamese encoder, graph convolution layer and classification layer. In word embedding layers, the pre-trained word vectors are loaded and repaired during the training process. The embedding of words outside the vocabulary is set to a zero vector. In the Siames enocder, we employ 1-dimensional convolution and 64 filters, followed by the ReLU and the Max Pooling operation. In graph convolution, we use 3-layer GCN [38] to conduct experiments on the CNSS dataset, and use 3-layer GCN to conduct experiments on the CNSE dataset. The output size of the GCN layer is set to 32 when the vertex encoder has a 4-dimensional feature; The output size of the GCN layer is set to 128 when the vertex encoder is a Siamese encoder. Note that, except for the last layer. In the GCN layer, we always set the output size to 32. In the last classification layers, there are a linear layer with an output size of 32, a ReLU layer. It is worth nothing that this classifier is also used for the benchmark SimNet. We use tensorflow 2.0 to implement the proposed MFSFM. The experiment without BERT was performed on a MacBook Pro equipped with a 2 GHz Intel Core i7 processor and 8 GB of memory. L2 weight attenuation is used for all trainable variables, parameter λ = 2 e-16. The loss rate between every two layers is 0.005. The gradient clipping with a maximum gradient norm of 5:0 is used in this paper, and the ADAM optimizer [51] is also applyed, where 1 = 0.85, 1 = 0.99 , ϵ = 1e8 . The learning rate warm-up scheme to increase its inverse exponent is set from 0.0 to 0.001 in the first 1500 steps, and then keep a constant learning rate in the rest of the training. The maximum number of training epochs is set to 20 in all experiments.

Analysis
In order to verify the effectiveness of the Contextual Multi-Feature Embedding (CMFE) proposed in this paper, the BiLSTM-CRF model proposed in [16] was used as the citation entity recognition model, and the CMFE was compared with CBOW and Skip-gram. Among them, the parameters of the CBOW and Skip-gram algorithms were set as follows: the context window=5, the number of negative samples=10, word2vec size=128. The parameters of the CMFE were set to use CBOW and Skip-gram algorithms (using the same parameters as before), and word2vec size = 128 (64 dimensions for multi-feature semantic enhancement (the parameter used for multi-feature semantic enhancement was 3). 64 dimensions for Multi-level features enhancement (the size of the convolution kernel used was 3). It can be observed in the table that the model using the CMFE is significantly higher than other methods in entity recognition. And CMFE (Skip-gram) obtained the best recognition effect, the BiLSTM-CRF model can obtain an average F1-measure of 88.80% on the Chinese citation dataset, and an average F1-measure of 88.84% on the Chinese-English mixed citation dataset. Compared with the original CBOW and Skip-gram methods, the average F1-measure is increased by more than 10%.
The performance of all comparison methods on the two datasets can be concluded in Table 2. Note that, the idea of vectorizing documents through concept graphs comes from Ref. [1]. Based on this, this paper proposes the MFSFM model, which uses the result of document vectorization as input through the concept graphs. The proposed MFSFM achieves the best performance on both datasets and is significantly better than all other methods, which is caused by two reasons. First, the two documents are aligned along the corresponding semantic unit to facilitate conceptual comparison because the input of the document pair is reorganized into a CG. Second, the proposed MFSFM encodes the local comparisons around different semantic units into local matching vectors, and aggregates them through graph convolution, taking into account the semantic topology. Therefore, it solves the problem of matching documents through divide and conquer, and is suitable for processing long texts. The effect of Graphical Decomposition: By comparing method No.11 in Table 3 with method No.6. They with same word vector use Neural Networks (NN) for encoding text. The pivotal difference is that No.11 compares documents pair on CG in a vertex-by-vertex decomposition in our methods. It can be observed that the performance of algorithm No.11 is outperformer algorithm No.6. Equally, comparing our algorithm No.14 with algorithm No.9, both of which apply the same term-based similarity. However, our method MFSFM greatly outperformer No.9 via using graphical decomposition. Thus, it can be concluded that graph decomposition can significantly improve matching performance for long text. It is worth nothing that the No.6 algorithm lead to poor performance. This is because they are deep text matching algorithm. Besides, they are mainly invented for matching sequence and can not obtain meaningful semantic information in project document pairs at all. It is difficult for matching document pairs to obtain a suitable context representation when the context is too long. For NN models that focus on interaction, most interactions among words is meaningless for two long documents.
The effect of graph convolution: In our comparative experiments, we take comparative experiment of algorithm No.12 and algorithm No.11, and algorithm No.15 and algorithm No.14. it can be observed that the performance improves significantly in the two datasets by merging GCN layers. Each vertex hidden vectors are updated by each GCN layer integrate its neighboring vertices into vectors. Therefore, local matching features needs to be studied how to aggregate them into the final result graphically in the GCN layer. We take comparative experiment comparing algorithm No.13 and algorithm No.12, and algorithm No. 16 and algorithm No.15, it can be saw that the community detection will bring about briefly worse performance because the conceptual vertices that directly use keywords can offer more anchor points to compare documents pairs. As mentioned earlier, the community detection technology refers to a group of keywords forming a concept instead of a keyword. However, consistent keywords can be highly grouped together by community detection, and the average size of CG can be reduced from 35 vertices to 16. The total training time of the proposed MFSFM can be reduced by 53.6%, and the same is true for test time. Therefore, you can choose whether to use community detection to weigh accuracy in exchange for acceleration.
Time complexity: For the keywords of technical project document, in real-world science and technology project declaration system, we usually extract them through efficient tools and predefined vocabulary rules. Below we will explain the time complexity of the proposed MFSFM through the process of constructing CG. In two documents datasets, denote as the number of sentences, as the number of unique words, and means the number of unique keywords. The operation of community detection needs ( 3 ) , and constructing a keyword map requires ( + 2 ) . The operation of attaching sentences and calculating weight needs ( + 2 ) complexity. For the final step, that is results classification, due to the proposed MFSFM is not big and can effectively address document pairs, the complexity of the classification operation can be ignored.
The effect of multi-view matching: In our comparative experiments, we take comparative experiment of algorithm No.17 and algorithm No.15. It can be observed that the concatenation vectors (from different view matching, such as Siamese encode features and term-based) can further outper former other algorithms, which proves the benefit of concatenating multi-view matching vectors. We also take comparative experiment of algorithm No18, No.19, No.20 with algorithm No.17, it can be concluded that the more global features always underperformer other algorithm. These global features includes documents pairs similarities and/or encoding. It indicates that the main factors of improving performance are decomposition and convolution of graph. This is similar to Ref. [1]. This is because the models have learned to summarize that local comparisons should be putted into global semantic relations, and the additional design of global features is of no avail.
Model size and parameter sensitivity: In our experiments, the largest model without BERT is No.18, which only includes about 54K parameters. In contrast, there are 130M-340M parameters in BERT. However, the proposed MFSFM is greatly better than BERT. In addition, we conduct some tests in the model about the sensitivity of different parameters. It can be found that the performance of the 2-3 GCN layers is better. Furthermore, adding more GCN layers will not be better than the 2-3 layers, but if it is zero or only one GCN layer, the performance will be worse. In addition, hidden vectors with sizes between 32 and 256 have good performance in GCN. And larger size will not cause a significant improvement in performance. When we construct CG, we need to select the size of the community for the opt- Table 3. Comparison of accuracy and F1-score under different methods based on CNSR and CNSI datasets.
ional community detection step. It can be found from experiments that the performance will be worse if the maximum size is from 8 to 10 and the minimum size is from 2~3. This shows that the proposed MFSFM is steady and insensitive compared to the parameters. All in all, the MFSFM proposed in this paper based on concept maps is better than other algorithms.

Conclusion
This article studies document pair matching. First, we propose a concept map, which represents a document as a weighted map of the concept; second, we propose a divide-and-conquer framework based on the constructed CG and graph convolutional network to match a pair of documents. Finally, we propose a multi-feature semantic fusion model called MFSFM. Compared with sequential modeling based on RNN, MFSFM decomposes the matching process into partial matching sub-problems on the graph. In addition, with the help of professional editors, we created two new datasets for matching long documents, which contained 7100, which we conducted extensive evaluations. Experimental results show that the proposed MFSFM is significantly better than a wide range of latest solutions, including terminology-based and deep learning model-based text matching algorithms.
However, the expressive power of document matching lies in the understanding of the document. Although this paper expresses the document in the form of a concept map, the construction of the concept map is based on document keywords, and its accuracy depends on the semantic understanding of the document. In future research work, we will focus on researching new model frameworks to improve text semantic understanding.