Seq2EG: a novel and effective event graph parsing approach for event extraction

Event extraction is a fundamental task in information extraction. Most previous approaches typically transform event extraction into two subtasks: trigger classification and argument classification, and solve them via classification-based methods, which suffer from some inherent drawbacks. To overcome these issues, in this paper, we propose a novel event extraction model Seq2EG by first formulating event extraction as an event graph parsing problem, and then exploiting a pre-trained sequence-to-sequence (seq2seq) model to transduce an input sentence into an accurate event graph without the need for trigger words. Based on the generative event graph parsing formulation, our model Seq2EG can explicitly model the multiple event correlations and argument sharing and can naturally incorporate some graph-structured features and the rich semantic information conveyed by the labels of event types and argument roles. Extensive experimental results on the public ACE2005 dataset show that our approach outperforms all previous state-of-the-art models for event extraction by a large margin, respectively, obtaining an improvement of 3.4% F1 score for event detection and an improvement of 4.7% F1 score for argument classification over the best baselines.


Introduction
Event extraction (EE) is an essential and challenging information extraction (IE) task for natural language understanding. The event extraction task has been shown beneficial to a wide range of downstream tasks, such as document summarization and question answering [1,2]. Technically speaking, as defined by the ACE 2005 dataset, a benchmark for event extraction [3], the event extraction task can be divided into two subtasks, i.e., event detection (identifying instances of specified types of events) and argument extraction (identifying arguments of each event type and labeling their roles). For example, an event extraction instance is shown in Fig. 1.
EE is an actively studied task in IE where deep learning models have been the dominant approach to deliver the state-of-the-art performance. Nevertheless, most previous work typically treat EE as a classification problem. Specifically, most existing approaches generally transform the event extraction task into two subtasks: trigger classification and argument classification, and then perform the two subtasks in a joint fashion or a pipelined fashion [4][5][6][7][8]. Some recent works focus on use syntactic dependency structure or external knowledge to boost the classification performance [9][10][11][12]. More recently, Li et al. [13] propose to first perform the trigger classification and then to reformulate argument extraction as a machine reading comprehension (MRC) task to utilize sophisticated MRC methods and large annotated external MRC data.
Methodologically speaking, the existing event extraction approaches suffer from the following inherent drawbacks: Firstly, most previous approaches depend heavily on the trigger word. On the one hand, triggers are nonessential to event detection and event extraction; on the other hand, the identification and classification of trigger words may, to some extent, impede the accurate recognition of the events, due to the fact that some events may be expressed by multiple discontinuous words or phrases in one sentence (See more illustrations in Sect. 3.2). Particularly, the trigger-based models are prone to suffer from the long tail issue [14]. Literatures available show that Liu et al. [15] are the only work for event detection without using trigger words, by simply casting event detection as a multi-label classification problem for input sentences, which cannot address the inherent issues with the trigger-based approaches, as illustrated below.
Secondly, current EE models do not exhibit good solutions to explicitly modeling the correlations between multiple events in one sentence and multiple arguments of different roles, and the event argument sharing issue. Though some existing works have investigated the multiple events phenomenon [7,10]. These approaches explore to aggregate more contextual information from surrounding trigger candidates to generate a powerful representing vector for current candidate trigger by employing a self-attention mechanism or a hierarchical tagging scheme, then respectively predict the trigger label [7,10]. However, note that modeling the associations between triggers is not equivalent to modeling the correlations  Fig. 1 between events. That is to say, the existing models cannot explicitly model the correlations between multiple events and multiple arguments.
Lastly, the existing approaches cannot leverage the semantic information of the labels of event type and argument role. As a matter of fact, both of them are informative and conducive to event extraction. However, such rich semantic information is neglected by the existing approaches.
To address these issues stated above simultaneously, we take a fresh look at event extraction and formulate it as a graph parsing problem. By regarding the multiple events expressed by one sentence as a whole, we argue that the goal of the EE task is to output an event graph, as shown in Fig. 2 . On the one hand, the event graph is constructed to model the potential interactions between the multiple events; on the other hand, this graph parsing formulation can flexibly integrate some graph-structured features. Furthermore, we employ a pre-trained sequence-to-sequence model to generate the event graph, without the need for the identification of the trigger words. The experimental results demonstrate that our method substantially outperforms all previous state-of-the-art models on the public dataset ACE2005.
To sum up, this paper makes the following contributions: • In this paper, we innovatively formulate the event extraction task as graph parsing, which delivers some typical benefits compared to the existing EE models. First, this graph parsing formulation can naturally model the correlations between multiple events in one sentence and the argument sharing; second, the event graph can be flexibly constructed to utilize more useful information, such as the semantic representations of the event type labels and argument role labels. • We propose a transformer-based encoder-decoder model to derive the events from the global contextual information in the input sentences without relying on the trigger words. Furthermore, we propose some skillful strategies for the event graph linearization and an effective decoding algorithm to boost the generation performance. • The extensive experiments over the public dataset ACE2005 demonstrate that the proposed simple approach outperforms the previous state-of-the-art models for event extraction by a large margin. 1 Particularly, our model does not use any syntactic dependency information and external knowledge.
This paper is a significant extension of our conference paper [16]. However, the work in this paper differs substantially from the previous work in the following three aspects: (1) firstly, this paper focuses on the event extraction task, which is more complicated than the event detection problem in our previous paper [16]. Specifically, the event graph constructed in the event extraction task is a general graph structure, not a simple tree structure containing only the event type nodes as in the event detection problem, therefore adapting the graph parsing approach for the event extraction problem is not a trivial work; (2) secondly, in order to deal with the argument sharing issue in the event graph for event argument identification, we innovatively propose the use of special pointer symbols to represent argument nodes in the linearized sequence, and the special symbols approach is used in combination with the graph traversal techniques; (3) due to the richness of the event graph structure in the event extraction task, we further present some skillful strategies for the event graph linearization, and particularly, we propose a novel and effective decoding algorithm, a nested constrained beam search algorithm, to greatly improve the generation performance of the event graph.
The rest of this article is organized as follows. Section 2 discusses the related work. Section 3 describes the novel view of event extraction. Section 4 presents the detailed event parsing method via a seq2graph transducer. Sections 5 and 6 describe the experiment settings and report the experimental results and model analysis. In Sect. 7, we summarize the proposed approach and describe future work.

Event extraction
In this paper, we focus on the event extraction task that includes two basic subtasks: event detection and argument extraction. Most recent works have focused on using neural networks in this task and have achieved significant progress. We roughly divide the recent approaches into three categories as following: • Sequence-based models: This line of research operates on the word sequences using the deep neural networks. Chen et al. [4] devise a dynamic multi-pooling convolutional neural network to capture more information. Nguyen et al. [5] present a joint model based on bidirectional RNN for event extraction. Sha et al. [6] add dependency arcs with weight to BiLSTM to make use of tree structure and sequence structure simultaneously.
• GCN-based models: This line of research adopts the graph convolutional network (GCN) over the dependency tree of a sentence to boost the performance. Nguyen et al. [9] is the first attempt to use GCN in ED. Liu et al. [10] employ a syntactic GCN and a selfattention mechanism to model multiple events extraction. Yan et al. [11] improve GCN by combining multiorder word representation from different GCN layers. • Machine Reading Comprehension (MRC)-based models: Span-based MRC tasks involve extracting a span from a paragraph [17] or multiple paragraphs [18]. Du et al. [19] introduces a new paradigm for event extraction by formulating it as a question answering (QA) task. Liu et al. [20] and Li et al. [13] propose to first perform the trigger classification and then to reformulate argument extraction as a machine reading comprehension (MRC) task to utilize sophisticated MRC methods and large annotated external MRC data.
Unlike the existing EE models based on trigger classification and argument classification, we formulate EE as a novel graph parsing problem; therefore, it can explicitly model the multiple event correlations and incorporate some graph-structured features and the rich information regarding the event types and arguments.
Recently, Lu et al. [21] propose a sequence-to-structure model Text2Event for EE, which can directly extract events from the text in an end-to-end manner. This work has the similar spirit with our work, and it is roughly orthogonal to our work in terms of time. However, different from the tree structure framework proposed in Lu et al. [21], we use a powerful event graph structure to model the correlations between events, which can provide a natural formulation to express event argument sharing relations between different event types in a sentence.
Additionally, slightly different from the aim of this paper, another recent line of research explores the joint entity recognition and event extraction [19,22,23].

Graph parsing
The goal of graph parsing is to transform a natural language word sequence to a graph structure such as dependency tree or AMR graph. Typically, Li et al. [24] present a sequenceto-sequence dependency parser by directly predicting the relative position of head for each given word. Recently, Zhang et al. [25] and Blloshmi et al. [26] perform sequence-to-graph transduction on the AMR parsing task and achieve excellent performance.
On the other hand, some NLP tasks are skillfully casted as graph parsing problems in order to take advantage of the merits of the graph structure. Barnes et al. [27] model sentiment analysis as a dependency graph parsing problem, where the sentiment expression is the root node, and the other elements have arcs which model the relationships between them. Qiu et al. [28] transform the social network as an and-or graph for the consistency of relations among a group and leveraging attributes as inference cues. Similarly, Paolinio et al. [29] propose a new framework, translation between augmented natural languages (TANL), to encode structured information (such as relevant entities) in the input, and to decode the output text into structured information. Obviously, the work in this paper falls into the second category.

Pre-trained Seq2seq models
Pre-training a universal model and then fine-tuning the model on a downstream task have recently become a popular strategy in the field of natural language processing [30]. Recent studies also propose approaches to pre-training seq2seq models, such as MASS [31], PoDA [32], PEGASUS [33], BART [34], and T5 [35].
In this paper, our experiments only examine BART. We leave explorations of these models for future work.

Task description
Given a text document, an event extraction system should predict specified types of events mentioned in the input text and their arguments from each sentence. The most commonly used benchmark dataset in previous work is ACE 2005 corpus. The task defines eight event types like life, business and so on, and 33 subtypes like attack, end-position, etc. Table 1 summarizes relevant terminologies.
Following some previous work [8,10,20], we also assume that the golden-standard entity mentions are provided as the argument candidates to the event extraction systems.

Formulating EE as event graph parsing
Traditionally, an event extraction system first recognizes a single word or phrase as the trigger in order to predict the event types of interest, and then identifies the event arguments for each derived event type [7,10,15]. However, as pointed out by Liu et al. [15], triggers are nonessential to event detection and event extraction. To a certain extent, the dependence on trigger may impede the accurate recognition of the events in the sentence.
In particular, some events may be triggered by multiple discontinuous words or phrases in one sentence, not by a single word or phrase. Take a concrete example in ACE 2005 dataset to illustrate: She lost her seat in the 1997 election. In this sentence, an event type (Personnel:Elect) is mentioned, and its gold trigger was labeled as the word lost. In effect, to correctly recognize the event type (Personnel:Elect) from this sentence, we should comprehensively consider both the phrase lost her seat and the word election in the sentence (See more cases in Sect. 6.5). Therefore, it does not seem plausible that the problem of predicting an event from a whole sentence is reduced to the representation learning of the single trigger word for trigger classification or sequence labelling. Additionally, the trigger-based models are prone to suffer from the long tail issue, which makes supervised methods prone to overfitting and perform poorly on unseen/sparsely labeled triggers [14].
In this paper, we look at the EE task from a new perspective. Given an input text, EE aims to recognize and predict the mentioned event types and their corresponding arguments. Intuitively, the multiple events derived from the same sentence should have a certain degree of correlations between them. Therefore, to model the correlations, we can view the multiple events expressed by the same sentence as a whole, by linking them together as a single graph, as shown in Fig. 2.
Specifically, we first introduce a special node as the root, and then attach each event type node as a child of the root; further, the multiple arguments of each specific event type are linked as its children, with the edges being labeled as the argument roles. It is worth noting that the root of this event graph is not a virtual node. The root can take two possible values: EVTS and NA. While the input sentence does not contain any event, the root is assigned the value NA; otherwise, it is assigned the value EVTS. Therefore, the prediction of root value is to judge whether the input sentence expresses some events or not. In addition to facilitating modeling multiple event correlations, our graph parsing formulation for EE also allows for the straightforward inclusion of other types of graph-structured features: • First of all, our event graph can be flexibly constructed to exploit more useful information.
For example, we know that each argument candidate is an entity mention which has a specific entity type such as person, location, and vehicle. This argument type is a critical feature in predicting the role of an argument candidate. It is common practice to employ an auxiliary feature embedding to encode the argument type for each argument candidate [7,10]. In our event graph parsing formulation, in order to make full use of the argument type feature, we skillfully introduce a kind of argument type nodes in the event graph to represent the entity types of the argument nodes to be generated next, as shown in Fig. 2. • Another important benefit of our event extraction paradigm is that it can provide a natural formulation to express event argument sharing relations between different event types in an event graph, which is the exact reason why the event graph constructed is a graph instead of a tree. For instance, the entity mention Baghdad is an argument of the event type Life:Die, and it is also an argument of another event type Conflict:Attack, as shown in Fig. 2. • Additionally, our approach can effectively utilize the semantic representation of event type label and argument role label. Most previous classification-based approaches to EE generally view each event type or argument role as a specific class, omitting the semantic information conveyed by these type labels. In fact, the type label itself, such as Divorce and Injure, is informative to the learning of EE models. In our graph parsing formulation, it is straightforward to incorporate the semantic representation of type label into the model. Specifically, during decoding, we can encode every previously generated node or edge with the corresponding type label embedding to assist the prediction of later nodes. advantages of applying a seq2seq model to event graph parsing are twofold. First, there is no need to use trigger words for event detection. Second, when predicting next node during decoding, the global contextual information in the input sentence can be taken into consideration by the cross-attention mechanism between the decoder and encoder. Specifically, the overall framework of the proposed model Seq2EG is shown in Fig. 3.
In this section, we first introduce our strategies for event graph linearization; next, the neural network model adopted for the seq2graph transduction and the decoding algorithm is illustrated, respectively; lastly, a simple postprocessing procedure is illustrated.

The linearization strategies of event graph
While applying a seq2seq paradigm to event graph parsing, we first need to convert the event graph into a sequence of tokens by using linearization techniques. We do not particularly consider the order of events in an event graph, even for the special cases where there are two event nodes with the same type. Specifically, we employ a depth-first traversal (DFS) as it is quite closely related to the way natural language syntactic trees are linearized. While building the event graph for each sentence during the training phase, we simply append the event nodes to graph by the order of appearance in the ACE annotation. Additionally, when applying DFS to linearizing an event graph, we also simply traverse the graph in a natural order from left to right. However, different from the conventional graph traversal procedures, we particularly propose some effective strategies for the event graph linearization to boost the generation performance.
Firstly, to tackle the argument sharing problem in the event graph, we innovatively propose the use of special pointer symbols < P 0 >, < P 1 >, . . . , < P i > to represent argument nodes in the linearized sequence and to handle sharing arguments. Whenever such special symbols occur more than once, it indicates that a specific argument node serves as multiple roles for multiple different events in an event graph. Our special symbols approach is used in combination with the graph traversal techniques, i.e., DFS.
Secondly, we introduce a flexible linearization ordering strategy while traversing the event graph. Generally speaking, the linearized sequence of a graph consists of the values of nodes and the labels of edges by the visiting order in the traversal procedure; that is to say, for a given event type node, the edge label (argument role) always comes in front of its child (argument type node). However, it is intuitively plausible that the argument role should be predicted after both the event type node and its child (argument type node) are generated. Therefore, we specially adjust the linearization order by postponing the output of argument role to the back of the argument type node and argument node in the linearized sequence of an event graph. For instance, for the example event graph shown in Fig. 2, the linearized representation generated by using standard DFS procedure is "< E V T S >, Li f e : Die, Place, Location, < P 0 >, Baghdad, victim, Person, < P 1 >, Cameraman, I nstrument, V ehicle, < P 2 >, tank, < stop >, Con f lict : Attack, Place, Location, < P 0 >, T arget, Person, < P 1 >, I nstrument, V ehicle, < P 2 >, T arget, Facilit y, < P 3 >, hotel, < stop >". As a contrast, the sequential representation generated by applying our linearization ordering strategy is also illustrated in Fig. 2. The importance of this linearization ordering strategy is also verified by the results of ablation experiments (see Sect. 6.3).
Lastly, we found that the labels of part of edges (e.g., EVT-1, entity, etc.) are not informative to event extraction in the preliminary experiments; we therefore omit these labels while linearizing the event graph.

The transformer-based generation network
Let x =< x 1 , . . . , x n > be an input sentence, and each x i is a token in the sentence. Also, let E =< e 1 , . . . , e k > be the entity mentions in this sentence (k is the number of the entity mentions and can be zero). Each entity mention comes with the head and the entity type. Our approach sequentially decodes a list of tokens y =< y 1 , . . . , y m > where each y i may be an event type, an event argument (i.e., an entity mention e j ), an argument role, an entity type, or a special symbol. When generating the argument nodes for a specific event type node, our model predicts the head of each argument as the argument output. Let Y be the output space. The transduction problem is to seek the most likely sequence of nodes given x: To tackle the transduction problem, we adopt the transformer-based encoder-decoder architecture to generate the event graph [37]: At the encoding stage, we convert the input text into the hidden vector representation by employing a multi-layer transformer encoder with the multi-head attention mechanism. It is worth noting that our encoder just encodes the tokens in the input sentence without using any additional information, including the POS tags and the syntactic dependency structures. The decoder predicts the output sequence by following a similar scheme as the encoder, but including an encoder-decoder attention sublayer in between to deal with input-output alignment. The generated sequence starts from the special token "BOS" and ends with the special token "EOS." In order to alleviate the data sparsity, we adopt the pre-trained language model BART as our transformer-based encoder-decoder architecture [34], so that we can exploit the model's latent knowledge (e.g., of semantics, linguistic relations, etc.) that has been captured through pre-training. The BART architecture can be viewed as a natural progression of "vanilla transformers" by Vaswani et al. [37], but with pre-training inspired by BERT's masked language model objective.

A nested constrained beam search decoding algorithm
For decoding in testing phase, it is a natural choice for our Seq2EG model to design a decoding algorithm based on beam search that generates the token sequence of an output event graph incrementally. However, while designing the beam search algorithm, we face two practical problems: (1) how to guarantee the generation of a valid event graph; (2) how to achieve fair and reasonable comparison when picking top-k best partial graphs among all the candidate items on each step of beam search.
For the first problem, it is relatively easy to tackle by incorporating the event schema knowledge into the search process to construct a constrained beam search algorithm. To be specific, at the different generation step during the search process, we can limit the candidate vocabulary for the choice of current item by referring to the knowledge of event schema. For example, if the current item to be predicted should be an event type name, we simply set the candidate vocabulary as the set of type names defined by the event schema. Relatively speaking, the second problem is more challenging. Unlike the target sentence generation task in traditional seq2seq models for machine translation where all elements in the target sequences are words, the elements in a linearized event graph sequence include many distinct types, such as the event type, argument type, argument role, entity mention, and some special symbols. Thus, at each timestep during beam search, the candidates in the beam may be the partial linearized event graph sequences ending with different types of elements, which may not be compared directly with each other. Therefore, the standard beam search algorithm cannot work well in this scenario. To address this issue, we propose an effective nested beam search strategy for the decoding. On the whole, the linearized sequence of an event graph consists of multiple events from a coarse-grained view, and at the fine-grained level, each event may contain a different number of arguments with different roles. Thus, to obtain fair comparison, we introduce two types of beam-search in a single decoding process: inter-event beam-search and intra-event beam-search. In a nutshell, the inter-event beam-search is used to compete over complete event candidates; while extending an event to identify its type and its arguments, we switch on an intra-event beam-search to find top-k event structures. To facilitate the nested beam search process, we particularly use a special symbol <stop> to indicate the end of each event. Additionally, in the inner beam-search, an event with more arguments will result in lower score, we therefore normalize it by the number of arguments.
Based on the two considerations mentioned above, we design a nested constrained decoding algorithm to generate a valid and accurate event graph for the given input text. Algorithm 1 shows the pseudocode for the complete procedure of the decoder. For purpose of brevity, we introduce some functional symbols in Algorithm 1. The function Normalize(y, score) is used to normalize the score by the number of arguments in the event structure y. The function CalConstraitedSet(last_token) returns a set of valid candidate tokens for the prediction of next token based on the preceding token represented by the parameter last_token. For example, if the parameter last_token represents an argument type, this function returns the set of all argument role names of the event type currently being predicted by referring to the knowledge of event schema.

Postprocessing
In the preliminary experiments, we found that our event parsing model has a bias toward identifying the entity itself as an argument of the predicted event type. For example, in the sentence Powell, the most moderate member of the Bush cabinet, said he fully agreed with the president's policy on Iraq and had no plans to leave, for the golden event type Personnel:End-Position with the trigger word leave, the pronoun he, which is adjacent to the trigger, is annotated as an answer argument with the role Person. However, our trigger-free generative model may tend to predict the entity name Powell as the argument of this event type. Conceptually speaking, the two entity mentions in this example are co-referenced and semantically equivalent. One possible reason is that while our model extracts the arguments for a specific event type, it recognizes the argument relations mainly by inspecting the contextual information surrounding the candidate arguments without depending on the triggers, and maybe the entity name itself contains richer contextual information than its mentions. In the dataset ACE2005, however, some entity mentions closer to the trigger of event type are usually annotated as the gold-standard arguments. Therefore, before the experimental evaluation, we perform a light postprocessing to recover co-referring nodes in the event graph predicted by our model. Concretely, we perform a reference resolution operation by simply using the coreferee package that comes with python 3.8, and the ablation test of the use of coref system is shown in Table 2 in Sect. 5.

Dataset and evaluation metrics
We utilized the ACE 2005 corpus as our dataset. For comparison, as the same as previous work [8,38,39], we used the same test set with 40 newswire articles and the same development set with 30 other documents randomly selected from different genres and the rest 529 documents are used for training. The evaluation of vent extraction generally includes three subtask: event detection, argument identification and argument classification. Also, following previous work [4,8,15,39], we use the following criteria to evaluate the results: • An event type is correct if the predicted event type and subtype match those of a reference event. Note that the trigger matching is omitted in our trigger-free framework, similar to the work [15].
• An argument is correctly identified if its event subtype and offsets match those of any of the reference argument mentions. • An argument is correctly identified and classified if its event subtype, offsets and argument role match those of any of the reference argument mentions.

Implementation details
We adopt BART-Large, which has 12 encoder and decoder layers, 1024 hidden units, and 16 attention heads, as our encoder-decoder model. Other hyper-parameters are tuned on the validation set. Specifically, the models are trained using cross-entropy with RAdam as optimizer and a learning rate of 5 * 10 −5 . Gradient is accumulated for ten batches. Dropout is set to 0.25. Our models are trained for 50 epochs, and the batch size in our training experiments is set to 400. For decoding, we set beam size to 3.

Overall performance
In this section, we comprehensively compare our performance with the following state-ofthe-art related methods that focus on the two event extraction subtasks: event detection and argument extraction: • JointBeam [8] proposes a structure-based system by manually designed global features which explicitly capture the dependencies of multiple triggers and arguments. • DMCNN [4] uses dynamic multi-pooling to extract the best features from the different parts of a sentence according to the position of trigger and argument candidate. • JRNN [5] proposes a joint framework with bidirectional recurrent neural networks and manually designed features to jointly extract event triggers and arguments. • dbRNN [6] is an LSTM-based framework that leverages the dependency graph information to extract event triggers and argument roles. • JMEE [10] models dependency relations between words by graph convolutional networks (GCNs) to exploit syntactic information. • RCEE [20] proposes a new learning paradigm of EE, by explicitly casting it as a machine reading comprehension problem (MRC) based on BERT-Large model. • EKD [14] leverages the wealth of the open-domain trigger knowledge to improve the event detection subtask. • Text2Event [21] proposes a sequence-to-structure model Text2Event for EE, which can directly extract events from the text in an end-to-end manner. Table 2 shows the overall performance comparison between our best system and the above state-of-the-art models. From Table 2, we can see that our approach achieves the best Precision, Recall and F1 score in event detection, argument identification and classification among all the compared methods. It is worth noting that our model simultaneously significantly improves both Precision and Recall without using any additional information including the POS tags, the syntactic dependency and external knowledge, which shows the superiority of the proposed graph parsing formulation for EE.
In Table 2, for our approach we also conduct ablation study on beam search to investigate contributions from the model architecture itself and the nested constrained beam search algorithm. Our model Seq2EG without beam search is already better than the previous best models. Further, the proposed decoding algorithm results in a significant improvement of 2.7% F1 score for final argument classification subtask. In addition, we conduct the ablation The bold numbers indicate the best result test of the use of coref system mentioned in the postprocessing section. If our model does not use the coref system, the F1 value for argument classification is 66.5%, leading to a performance drop of 1.8%. Particularly, among all baselines, the Text2Event model is similar in spirit to our approach though the two methods have different experimental settings. For fair comparison, we modified their public code 2 to include the golden entity mention information as input by specifying the set of candidate arguments in the decoding algorithm, and presented the corresponding results in Table 2. Thus, in the same setting, our model significantly outperform the Text2Event model by 5.4% F1 value for argument classification. The possible reasons are twofold: (1) our trigger-free fashion leads to more accurate event detection performance; (2) more importantly, our graph parsing framework can naturally model shared argument elements compared to the tree-based model (see more experimental analyses in Sect. 6.2).

Effect of multiple event extraction
Compared to the existing work, our EE approach provides a more natural formulation to model the multiple event correlations. To evaluate the effect of our approach to the multiple event recognition, we divide the test data into two parts (1/1 and 1/N) following previous work and perform evaluations separately [4,5]. 1/1 means that one sentence only has one trigger or one argument plays a role in one sentence; otherwise, 1/N is used. Table 3 illustrates the performance (F1 scores) of DMCNN [4], JRNN [5], JMEE [10] and HBTNGMA [7], the four baseline models and our model for EE task. As shown in Table 3, our model significantly outperforms all the other methods. In the 1/N data split, our method is 7.9% better than the best baseline in the event detection phase, and 9.9% better than the best baseline in the argument classification phase. The experimental results demonstrate that our method works well on the task of multiple event extraction.

Effect of exploiting of the graph-structured features
As illustrated in Sect. 3.2, our graph parsing formulation allows for incorporating some graph-structured features for EE, including exploiting the label semantics of event type and argument role, and introducing the argument type node to extend the event graph. In this section, we check the effects of these graph-structured features by the ablation study. Concretely, the effect of the semantic representations of the event type and argument role labels is verified by treating them as a special symbol, without using their word embedding learned in the pre-trained language model. More specifically, we utilize both the event type label and the subtype label by averaging their word embeddings to make full use of the semantic representations with different granularities. Additionally, we evaluate the effect of the argument type nodes by removing this type of nodes from the extended event graph. Table 4 shows F1 scores of the full Seq2EG model and with different components turned off one at a time. We can observe that ignoring the semantic representation of event type and argument role labels leads to the decrease in F1 score of argument classification by 3.2% and 3.8%, respectively. Additionally, removing this type of nodes from the extended event graph results in a 4.2% drop in terms of F1 score of argument classification. We verified that all these components contribute to the main model, as the performance deteriorates with any of the components missing.
In order to further verify the effect of our graph parsing formulation on solving the argument sharing problem, we first construct a test subset by selecting the sentences with argument sharing phenomenon in the test data, and then run our model Seq2EG and the tree-structure based model Text2Event on this subset respectively. As shown in Table 5, our approach can substantially improve the argument classification performance by 8.9% in terms of F1 score compared to the baseline Text2Event, which demonstrates the great superiority of our graph parsing formulation in dealing with argument sharing phenomenon.

Do different linearization strategies matter?
In this section, we inspect the effects of linearization strategies we proposed for the linearizing the event graph in Sect. 4.1. Firstly, we evaluate the linearization strategy for handling argument sharing by removing the use of special pointer symbols < P 0 >, < P 1 >,..., < P i > that represent the argument nodes in the linearized sequence of an event graph; next, we investigate the performance of our linearization ordering strategy by adopting the The bold numbers indicate the best result conventional graph traversal order, i.e., not postponing the output of argument role in the linearized sequence. Finally, we also try another traversing method breadth-first search (BFS) for comparison. From Table 6, we can observe that both the two different linearization strategies are greatly beneficial to the performance boosting of our model. The linearization ordering strategy can improve the argument classification performance by 4.3% in terms of F1 score. Particularly, a significant performance difference is visible in the argument sharing strategy. Removal of argument sharing part leads to a 7.6% drop in terms of F1 score of argument classification. This result indicates that the argument sharing plays a key role in the overall performance. Besides, it is easy to understand that the performance drops in event detection are relatively small, compared to the argument classification subtask. The results in the last row in Table 6 demonstrate that DFS is a better traversing method for the event graph linearization compared to BFS.

Can our approach alleviate the long tail issue?
The trigger-based event extraction models generally suffer from the long tail issue [14,40]. Taking the benchmark ACE2005 as an example, trigger words with frequency less than 5 account for 78.2% of the total. The long tail issue makes the trigger-based models perform poorly on unseen/sparsely labeled trigger words. In this section, we evaluate whether our approach could cope with the long tail issue.
Following previous work [14], we divide the event instances in the test set into three categories: unseen, sparsely labeled and densely labeled, according to their trigger frequency in the training set. Specifically, the frequency of sparsely labeled is less than 5 and the frequency of densely labeled is more than 30. Also, following the work [14], we choose the following baselines for comparison: (1) DMBERT [4], (2) DGBERT [41], (3) BOOTSTRAP [42], and (4) the method EKD [14]. Note that the encoders in the first three baselines are replaced with more powerful BERT to make the baseline stronger.
As shown in Table 7, our approach substantially outperforms all baselines in two settings, especially on unseen setting (+ 14.7%). Why can our approach effectively mitigate the long tail issue? Besides the better generalization endowed by our seq2seq event graph parsing formulation, an important possible reason is that since our approach adopts a trigger-free way to detect the events, the event types corresponding to the unseen or sparsely-labeled triggers can also be expressed with other different triggers and thus appear many times in the training set, thereby alleviating the long tail problem. The experimental results clearly indicate that the trigger-free event extraction approach may be a better alternative to the traditional trigger-based models. The bold numbers indicate the best result

Analysis of cross-attention mechanism
In the absence of trigger words, can our Transformer-based seq2seq event extraction framework capture the key clues in the source sentence that express the target event type? In this section, we answer this question by the case study. Figure 4 presents several examples of the attention distributions learned by our model. In the first case, the target event type is Life:Die, and the gold trigger is the word killed.
We can see that when predicting this event type, our attention not only successfully attends the trigger word killed, but also attends another strongly indicative phrase two people with higher score. In the second case, the target event type is Conflict:Attack, and the gold trigger is the word strike. It can be observed that the three words: destroyed, houses, and killed are assigned with higher attention scores than the trigger strike, which seems plausible for this target type prediction. In the third case, the target event type is Personnel:Elect, and the gold trigger is the wordlost. For this target type, there are relatively strong connections with the phrase lost her seat and another indicative word election.
These cases demonstrate that though the triggers are not used in our model, the crossattention mechanism between the decoder and encoder can learn to automatically capture the correlation between the target event type and multiple indicative words or phrases in the source sentence. However, on the other hand, we also found that our model may derive some redundant predictions due to the flexibility of the cross-attention mechanism. For instance, for the sentence the demonstration came as Iraq's top US overseer Paul Bremer began his second week on the job amid continuing lawlessness in the country, the annotated target event type is Conflict: Demonstrate. Given this input sentence, our model predicts an additional Fig. 4 Visualization of cross-attention scores of sample instances learned by our model event type Personnel: Start-Position besides the target event type. Through analysis we consider that the event Personnel: Start-Position is wrongly predicted presumably because both the word began and job in the source sentence are strongly attended by the attention mechanism. Therefore, we will explore to employ the multiple attention mechanisms under the encoder-decoder architecture to further enhance the prediction accuracy in future work.

Conclusion
This paper presents the first work to formulate event extraction as a graph parsing task, and introduces a novel generation-based method to predict event graph by using a pretrained seq2seq model. Our approach is conceptually simple and does not use syntactic dependency information and any other extra knowledge; however, it significantly outperforms the traditional classification-based encoder-only approaches, advancing the state of the art in event extraction.
In future work, we will integrate the syntactic dependency structure and external knowledge into our model to enhance the event extraction performance; additionally, we will further extend our model to perform the joint entity recognition and event extraction.
Junsheng Zhou is a professor at the School of Computer and Electronic Information, Nanjing Normal University, China. He is also the director of the Research Center for Educational Intelligence Technology and Application. His research focuses on natural language processing and multimodal learning technology, in particular the application of artificial intelligence in education domain. He has published widely in highly refereed venues (EMNLP, ACL, COLING, IJCAI) on topics ranging from information extraction to semantic parsing and NLP application.
Li Kong is a lecturer at the School of Computer and Electronic Information, Nanjing Normal University. She received her PhD from Software Institute, Nanjing University and BE from School of Software and Microelectronics, Northwestern Polytechnical. She was a visiting scholar at University of Texas at Dallas. Her research interests are in natural language processing, sentiment analysis, text mining, and education intelligence.
Yanhui Gu is a professor at the School of Computer and Electronic Information Science, Nanjing Normal University. He received his PhD from the University of Tokyo, MSc and BS all from Hohai University. He has over 30 publications in database, natural language processing and social computing. His research interests are in data analytics, large language model, and other exciting research.
Weiguang Qu is a professor at the School of Computer and Electronic Information, Nanjing Normal University, China. He received his PhD from Nanjing Normal University, MSc from Harbin Marine Engineering College, and BS from Dalian Institute of Technology. His research focuses on natural language processing and computational linguistics. He has over 60 publications in natural language processing and computational linguistics.