We reviewed the essays and annotated various types of errors. Most essays were reviewed by one annotator, except for 54 essays that were reviewed by two annotators to assess inter-annotator agreement (see Section 4.3). All annotators were native speakers of Hebrew, with an undergraduate or a graduate degree in linguistics. The remainder of this section details our annotation scheme. Overall, we annotated 1013 essays out of the 3000 non-native ones. Table 4 specifies the number of annotated essays, sentences and tokens per L1. The distribution of annotated essays over test scores is shown in Fig. 3 (this distribution is a subset of the one shown in Fig. 1, which includes non-annotated essays as well).
Table 4 Statistics of annotated non-native essays per L1
L1 |
Essays |
Sentences |
Tokens |
Arabic |
342 |
2023 |
50304 |
French |
338 |
2989 |
48893 |
Russian |
333 |
2993 |
47213 |
Total |
1013 |
8005 |
146410 |
Our annotations consists of three distinct pieces of information. First, there’s the indication that a sentence is ill-formed; this is done by marking tokens in the sentence that cause deviation from standard language. Second, we propose target hypotheses to replace these marked tokens (Section 4.1). Finally, we also offer interpretations pertaining to the presumed cause of these errors (formulated as a basic classification of the errors by type/cause) in Section 4.2.
4.1 Principles of the target hypothesis
When correcting a non-native text, it is sometimes assumed that the language used deviates in some way from “typical”, or “standard” native language use, and that the author’s intended meaning can be recovered and reconstructed according to the norms of the target language. In reality, this is not a straightforward matter. First, the notion of “standard” native language is elusive: native speakers vary greatly in their use of language, and more often than not avoid adhering to prescriptive language norms (Dąbrowska, 2018). Second, it is impossible to construct with certainty an utterance in native-like language that would retain the author’s intended meaning, simply because this meaning is not part of the text and is thus unknown.
Therefore, generating an equivalent “native-like” version of a non-native text is a difficult, ill-defined task. Instead, we adopt an approach that minimally modifies the non-native texts by associating some (ill-formed) constructions with a target hypothesis (Reznicek et al., 2013). Our goal is to introduce a minimal number of changes in an input sentence in order to obtain a grammatically correct utterance in the target language that would make the resulting utterance amenable to automatic language processing tools, such as a morphological analyzer and a parser.
In this project, we adopted a broad interpretation of the term “grammar”, to potentially cover all levels of linguistic analysis on which native and non-native language use can be distinguished, including orthography, morphology, syntax, semantics, and discourse. This decision was motivated by the theoretical conception of language as a whole, but also by the properties of Hebrew that make it difficult to tease apart different levels of analysis (see Section 2).
With this notion of grammaticality in mind, annotators were guided to rely on their intuitions as native speakers of Hebrew, as well as on their experience as linguists, when determining whether the text is native-like and, if not, to induce minimal modifications to make it native-like. As noted above, native language use is inherently variable and, thus, any evaluation and adaptation of texts that is based on speakers’ intuitions is bound to yield variable results. Consequently, the annotation process cannot be entirely consistent across (and even within) annotators. Yet, we formulated elaborate guidelines in an attempt to minimize inter-annotator variability as much as possible, and introduced means to include alternative interpretations in the annotations, thus recognizing the inherent variability in language use. In the following sub-sections, we describe several general principles that guided annotators regarding whether or not a fragment of text should be revised, and if so, how to make the most conservative revision.
4.1.1 The grammaticality principle
Annotators were guided to correct any text fragment that deviated markedly from typical native language use, provided that there was a clear grammatically-correct alternative. Moreover, annotators were guided not to dwell on the text trying to guess the intended meaning, but to follow their initial intuition as much as possible.[11] For example, consider the following sentence:
(2) * הוא צריך להשיג משהו רוצה
hu |
carix |
lehasig |
mašehu |
roce |
he |
need.sg.m.prs |
achieve.inf |
something |
want.sg.m.prs |
*‘He needs to achieve something wants’
(2) is ungrammatical. The most conservative interpretation would be to treat משהו [mašehu] ‘something’ as a morpho-orthographic error, an incorrect merging of the words מה שהוא [ma še-hu] ‘what that-he’. The hypothesized target sentence is then:
(2’) הוא צריך להשיג מה שהוא רוצה
hu |
carix |
lehasig |
ma |
še-hu |
roce |
he |
need.sg.m.prs |
achieve.inf |
what |
that-he |
want.sg.m.prs |
‘He needs to achieve what he wants’
This is considered a conservative interpretation since it assumes a simple cause for the error: the fact that both phrases are pronounced identically in speech. Additional examples of errors on various levels of linguistic analysis, as well as the treatment of these errors, are provided in Section 4.2.
4.1.2 The cooperative principle
In the spirit of the Gricean cooperative principle (Grice, 1989), the sensible author is likely to make sensible utterances. In the current framework, a sensible utterance is one that is acceptable on all levels of linguistic analysis by the standards of native speakers (as assumed by the annotators). Under the cooperative principle, we modify sentences that are syntactically and morphologically valid, but inappropriate in the given context (in contrast to the grammaticality principle, which applies to sentences that are unacceptable in any context).[12] The assumption underlying this principle is that the author likely made an error (e.g., orthographic, morphological) rather than intentionally wrote a sentence that does not make sense.
This principle has become a major issue due to the orthographic and morphological structure of Hebrew, where small errors can generate existing but semantically-unrelated words by pure chance (see Section 2.2), whereas errors of similar nature typically generate nonwords in other languages. When an error generates a non-existing word, it is easier to agree that the nonword should be corrected. But given the above considerations, we claim the same should also apply when the error generates an existing word (see examples below).
The formal guideline that follows from the cooperative principle is: given a syntactically and morphologically valid sentence that does not make sense – if the sentence can be made sensible via small orthographic/morphological corrections, revising the sentence should be preferred over retaining the original sentence. Orthographic corrections include transposition, insertion, deletion, or substitution of a letter with a phonetically/visually similar letter. Morphological corrections typically involve a change of affix or non-linear pattern (Binyan for verbs, Mishkal for nouns and adjectives) while retaining the consonantal root. The hallmark of cases that are typically corrected under this principle is a small edit distance between the original and revised token but a large semantic distance (the words belong to different semantic fields). The following examples illustrate the application of the cooperative principle:
(3) זה יבוא רק לתולעת המשפחה עצמה
ze |
yavo |
rak |
le-tola’at |
ha-mišpaxa |
acma |
it |
come.3sg.m.fut |
only |
to-worm.constr |
the-family |
herself |
‘It will come only to the worm of the family itself’
Sentence(3) is syntactically correct, but does not make sense in the context in which it appeared (e.g., worms are not mentioned anywhere else in the essay). A plausible explanation for this sentence is a letter transposition error: לתולעת‘to the worm of’ should probably have been לתועלת ‘to the benefit of’. We annotate this as a spelling error and introduce a correction. The hypothesized target sentence is:
(3’) זה יבוא רק לתועלת המשפחה עצמה
ze |
yavo |
rak |
le-to’elet |
ha-mišpaxa |
acma |
it |
come.3sg.m.fut |
only |
to-benefit.constr |
the-family |
herself |
‘It will be (lit.: come) only to the benefit of the family itself’
4.1.3 The faithfulness principle: Minimal editing and information maximization
The grammaticality and cooperative principles focus mainly on the justification for revising the text. The faithfulness principle provides general guidelines for how the revision should proceed. According to this principle, the revised text should be as close as possible in meaning and form (i.e., be faithful) to the original text. In other words, annotators were instructed to keep the editing as minimal and local as possible and to avoid rewriting the text extensively to make it sound “better”. In practice, if there are several more-or-less equivalent ways of revising the text to make it more native-like, annotators should opt for the option that involves fewer changes, in terms of tokenization and the number of altered words. For instance, sentence (4) is clearly missing a preposition before מחשב /maxšev/ ‘computer’, but there are several suitable alternatives, includingב /be-/ ‘in’, מ /mi-/ ‘from’, andבאמצעות /be-emca’ut/ ‘using’. In this case, the first two alternatives are preferable, since the prepositions ב and מ are used as clitics and, therefore, do not affect tokenization. This is demonstrated in (4’). By contrast, adding the stand-alone preposition דרך is dispreferred, since it increases the number of words in the text (see 4’’).
(4) * היום אפשר לקרוא מה קורה בסין מחשב שנמצא בפריז
hayom |
efšar |
likro |
ma |
kore |
be-sin |
maxšev |
še-nimca |
be-pariz |
today |
possible |
read.inf |
what |
happen.sg.m.prs |
in-China |
computer |
that-situated.sg.m.prs |
in-Paris |
*‘Today it is possible to read what happens in China a computer located in Paris’
(4’) היום אפשר לקרוא מה קורה בסין ממחשב שנמצא בפריז
hayom |
efšar |
likro |
ma |
kore |
be-sin |
mi-maxšev |
še-nimca |
be-pariz |
today |
possible |
read.inf |
what |
happen.sg.m.prs |
in-China |
from-computer |
that-situated.sg.m.prs |
in-Paris |
‘Today it is possible to read what happens in China from a computer located in Paris’
(4’’) היום אפשר לקרוא מה קורה בסין באמצעות מחשב שנמצא בפריז
hayom |
efšar |
likro |
ma |
kore |
be-sin |
be-emca’ut |
maxšev |
še-nimca |
today |
possible |
read.inf |
what |
happen.sg.m.prs |
in-China |
using |
computer |
that-situated.sg.m.prs |
‘Today it is possible to read what happens in China using a computer located in Paris’
According to the “information maximization” principle, a revised text should retain the maximal amount of information contained in the original text, and add as little information as possible. We assume the following information content hierarchies:
- Content words > function words
- Lexical morphemes > grammatical morphemes
Lexical morphemes include consonantal roots in Semitic languages and monomorphemic content words. Grammatical morphemes include affixes as well as non-linear morphological patterns in Semitic languages.
In practice, the information maximization principle states that changing lower-order elements on the information content hierarchies is preferred to changing higher-order elements. When two alternative corrections are possible, we implement the one requiring minimal assumptions and minimal modifications of the original text. The following example illustrates this principle.
(5) * להספיק להם את כל צרכיהם
lehaspik |
lahem |
et |
kol |
craxeyhem |
suffice.inf |
to.them |
acc |
all |
needs.poss.3pl.m |
*‘To suffice them all their needs’
(5) is ungrammatical due to a mismatch between the verb and its arguments. The verb להספיק [lehaspik] ‘suffice’ is assigned two internal arguments here: [lahem] ‘to them’ and [et kol craxeyhem] ‘all their needs’. Of the two arguments, only the first fits into the argument structure of the verb.[13] However, omitting the second argument will lead to a loss of information. Furthermore, the resulting phrase will still be ungrammatical (or at least odd) in the original wider context:
(5’) יש הורים שלהם אין מספיק כסף כדי להספיק להם ???
yeš |
horim |
še-lahem |
eyn |
maspik |
kesef |
kedey |
lehaspik |
lahem |
exist |
parents |
that-to.them |
there-is-no |
sufficient |
money |
for |
suffice.inf |
to.them |
??? ‘There are parents who don’t have enough money to suffice for them’
The more plausible correction involves changing the verb להספיק /lehaSPiK/ to a verb of the same root in a different Binyan (verb pattern): לספק/leSaPeK/ ‘to provide’. The revised verb is compatible with the argument structure of the original sentence. Thus, no information is lost in the revised sentence and the correction requires a single morphological change. The hypothesized target phrase is:
(5”) לספק להם את כל צרכיהם
lesapek |
lahem |
et |
kol |
craxeyhem |
provide.inf |
to.them |
acc |
all |
need.pl.poss.3pl.m |
‘To provide them all their needs’
Alternatively, one could opt for replacing the verb in (5) with a semantically similar verb from another root, such as לתת [latet] ‘to give’, as in (5”’). However, (5”’) involves a change in a lexical morpheme (a root) plus a change in a grammatical morpheme (a morphological pattern), which is less conservative than a change in a grammatical morpheme alone, as in (5”). Therefore (5”) is preferred to (5”’).
(5”’) לתת להם את כל צרכיהם
latet |
lahem |
et |
kol |
craxeyhem |
give.inf |
to.them |
acc |
all |
need.pl.poss.3pl.m |
‘To give them all their needs’
Uncertainty
In many cases, the author expresses an idea in a way that is atypical of native language, and there is some uncertainty about the appropriate correction. In some of these cases the intended meaning seems clear but there are several, equally plausible alternative ways of expressing the idea in the target language. In such cases, annotators could specify multiple target hypotheses in their annotation. For example, sentence (6) is awkward, if not ungrammatical. Two equally plausible target hypotheses of(6) are given in (6’) and (6”).
אף אחד לא מסתכל על האחר או נותן לו את העניין???
af |
exad |
lo |
mistakel |
al |
ha-axer |
o |
noten |
lo |
et |
ha-inyan |
no |
one |
neg |
look.sg.m.prs |
on |
the-other |
or |
give.sg.m.prs |
3sg.m.dat |
acc |
the-interest |
??? ‘No one looks at the other or gives him the interest’
(6’) אף אחד לא מסתכל על האחר או נותן לו את תשומת הלב
af |
exad |
lo |
mistakel |
al |
ha-axer |
o |
noten |
lo |
et |
no |
one |
neg |
look.sg.m.prs |
on |
the-other |
or |
give.sg.m.prs |
3sg.m.dat |
acc |
tsumet |
ha-lev |
input.constr |
the-heart |
‘No one looks at the other or gives attention to them’
(6”) אף אחד לא מסתכל על האחר או מתעניין בו
af |
exad |
lo |
mistakel |
al |
ha-axer |
o |
mitanyen |
bo |
no |
one |
neg |
look.sg.m.prs |
on |
the-other |
or |
take-interest.sg.m.prs |
3sg.m.loc |
‘No one looks at the other or takes interest in them’
In the Hebrew Essay Corpus, multiple target hypotheses are indicated in separate columns, as shown in Table 5. The Token column corresponds to sentence (6). TH1 is a modified version of the full text (e.g., sentence 6’), while TH2 indicates only alternatives to corrections made in TH1 (e.g., parts of 6” that are different from 6’), and is otherwise empty.
Table 5 Multiple target hypotheses
Token |
TH1 |
TH2 |
af |
af |
|
exad |
exad |
|
lo |
lo |
|
mistakel |
mistakel |
|
al |
al |
|
ha-axer |
ha-axer |
|
o |
o |
|
noten |
noten |
mitanyen |
lo |
lo |
bo |
et |
et |
&& |
ha-inyan |
tsumet |
&& |
&& |
ha-lev |
&& |
Another kind of uncertainty occurs when the intended meaning is unclear. In such cases, annotators were advised to leave the text uncorrected and, instead, make free-form comments, or assign special error tags to parts of the text during the error annotation process (see Section 4.2.4). For example, consider the following sentence:
(7) *נראה לי שיש משהו שדומה כמו החלה טכנולוגיה
nir’e |
li |
še-yeš |
mašehu |
še-dome |
kmo |
haxala |
texnologya |
seem.sg.m.prs |
to.me |
that-exist |
something |
that-similar.sg.m |
like |
application |
technology |
*‘It seems to me that there is something that is similar like technology application’
The phrase “החלה טכנולוגיה” [haxala texnologya] ‘technology application’ is ungrammatical, but it is not clear what the intended meaning was (if the author meant ‘application of technology’ it seems that some information is missing, e.g., ‘application to what’?). In fact, it is not clear at all that the author meant to use the word “החלה” ‘application’, but rather some other semantically, morphologically, or phonologically similar word. There is not enough information in the sentence to help recover the target word. The word דומה /dome/ ‘similar.sg.m’ suggests a comparison between entities, which could potentially be helpful. However, the compared entities are not mentioned in the sentence and, since the original order of the sentences is unknown, the context cannot help determining what the relevant entities are. In this case, the most suitable solution would be to leave the text unaltered, and make comments about the problems in the sentence.
4.2 Interpretation
After revising a text (i.e., forming the target hypothesis), the deviations between the original and revised text were analyzed and tagged. The error tags are stored in a separate column alongside the columns of original and revised tokens. If a single token contains multiple independent errors (e.g., a spelling error and a syntactic error), each error is tagged in a separate error column. If there are multiple target hypotheses for a given phrase, each one has its own set of error annotation columns.
Table 6 demonstrates revision and error annotation of a sentence. The “Token” column contains the tokenization of the original sentence (8), the “TH1” column contains the tokenization of the revised sentence (8), and the columns labelled “Error1_TH1” and “Error2_TH1” contain the error tags. The full list of error tags used in this project is included in an appendix supplied with the online corpus. Note that tilde signs in glosses indicate deliberate misspells (e.g., teknology) that mirror orthographic errors in the Hebrew text.
(8) * יותר הטחנולוגיה מתפתח יותר קשה זה למצוא משהו שלא מסתכל את הטלפון כל דקה
yoter |
ha-texnologya |
mitpateax |
yoter |
kaše |
ze |
limco |
mašehu |
more |
~the-teknology |
evolve.sg.m.prs |
more |
hard.sg.m |
this |
find.inf |
something |
še-lo |
mistakel |
et |
ha-telefon |
kol |
daka |
that-neg |
look.m.sg.prs |
acc |
the-telephone |
every |
minute |
*‘More the teknology (f) evolves.m more difficult it is to find something that doesn’t look the phone every minute’
(8’) ככל שהטכנולוגיה מתפתחת יותר קשה למצוא מישהו שלא מסתכל בטלפון כל דקה
kexol |
še-ha-texnologya |
mitpataxat |
yoter |
kaše |
limco |
mišehu |
as much |
the-technology |
evolve.sg.f.prs |
more |
hard.sg.m |
find.inf |
someone |
še-lo |
mistakel |
ba-telefon |
kol |
daka |
that-neg |
look.m.sg.prs |
in.the-telephone |
every |
minute |
‘The more the technology (f) evolves.f the more difficult it is to find someone that doesn’t look at the phone every minute’
Table 6 Tokenized, revised and annotated text
|
Token |
TH1 |
Error1_TH1 |
Error2_TH1 |
1 |
yoter |
kexol |
wrong(conj) |
|
2 |
ha-texnologya |
še-ha-texnologya |
shouldB(ח,כ) |
miss(conj,##) |
3 |
mitpateax |
mitpataxat |
agree(subj,pred) |
|
4 |
yoter |
yoter |
|
|
5 |
kaše |
kaše |
|
|
6 |
ze |
&& |
redun(dem) |
|
7 |
limco |
limco |
|
|
8 |
mašehu |
mišehu |
oMiss(י) |
|
9 |
še-lo |
še-lo |
|
|
10 |
mistakel |
mistakel |
|
|
11 |
et |
&& |
wrong(prep,&&) |
|
12 |
ha-telefon |
ba-telefon |
wrong(prep) |
|
13 |
kol |
kol |
|
|
14 |
daka |
daka |
|
|
Tags legend: wrong = incorrect element, conj = conjunction, shouldB = element 1 should be element 2, miss = missing element, agree = agreement error, subj = subject of clause, pred = predicate, redun = redundant element, dem = demonstrative, oMiss = missing letter, prep = preposition
4.2.1 Basic error classification
Error tags have the general form of function (arguments). This enables tagging a wide array of errors with a relatively small basic vocabulary of codes. In this configuration, functions indicate the nature of the deviation between the original and revised token. Some common types of functions include: miss (a missing element), redun (a redundant element), and wrong (a wrong element). Arguments to the functions usually denote linguistic categories affected by the error. These categories include, among other things: orthographic elements, various categories of function words (e.g., prepositions, conjunctions), syntactic categories (e.g., subject, predicate), and categories of content words (e.g., noun, adjective). Most functions require only a single argument. For example, row 1 in Table 6 demonstrates tagging of an incorrect conjunction.
Other error functions require two arguments. This configuration is typically used with agreement errors. In these cases, the arguments to the function denote the categories of the two elements for which there is a lack of agreement in gender, number, or person. For instance, row 3 in Table 6 contains the tag agree(subj,pred), indicating an agreement error between the feminine subject of the clause, טכנולוגיה [texnologya] ‘technology’ and the main predicate of the clause מתפתח [mitpateax] ‘evolve.sg.m.prs’, which is masculine.
4.2.2 Multiple analyses
If there is more than one likely analysis of a given error, alternative analyses can be indicated side-by-side. For example, (9) contains the word form יוכלים, which does not exist in Hebrew. In (9’), it was corrected to יכולים [yexolim] ‘can.m.pl.prs’, resulting in a grammatical sentence.
(9) * הצעירים לא יוכלים לעבוד
ha-ce’irim |
lo |
yoxlim |
la’avod |
the-young.m.pl |
neg |
~abel |
work.inf |
(9’) הצעירים לא יכולים לעבוד
ha-ce’irim |
lo |
yexolim |
la’avod |
the-young.m.pl |
neg |
able.m.pl.prs |
work.inf |
‘The young are unable to work’
The error in this example can be analyzed on two different levels: at the orthographic level it can be analyzed as metathesis of two adjacent letters (i.e., כו → וכ). Alternatively, it can be analyzed as a morphological error, i.e., selection of an incorrect non-linear pattern. Both the orthographic and morphological accounts are plausible. Table 7 demonstrates alternative analyses of the same error in the Hebrew Essay Corpus. The TH1 column contains the full revised text, as explained earlier. The TH2 column contains a copy of the revised token [yexolim] (this is in contrast to situations described in Section 4.1.4, in which TH2 was different from TH1). Alternative analyses of the error are indicated in the Error1_TH1 and Error1_TH2 columns.
Table 7 Alternative error analyses
Token |
TH1 |
Error1_TH1 |
TH2 |
Error1_TH2 |
ha-ce’irim |
ha-ce’irim |
|
|
|
lo |
lo |
|
|
|
yoxlim |
yexolim |
metathesis(כו) |
yexolim |
wrong(pattern) |
la’avod |
la’avod |
|
|
|
4.2.3 Dependent corrections
Occasionally, correction of one error entails additional corrections, often in different tokens (i.e., some corrections are dependent on others). While we tagged every correction made in the corpus, dependent corrections were excluded from statistical analysis in order to avoid overestimation of the number of errors in the corpus.
One type of dependent correction that was not counted involved insertion of a dummy token that accompanied additional modifications. As explained in Section 3.3, in cases such as deletion, insertion, splitting, merging, or movement of words, a dummy token && was inserted in order to maintain the alignment between original and revised tokens. However, this action may result in differences between the token columns on several rows (some reflecting true errors, others reflecting corrections of alignment). Since the multiple differences stem from a single error, counting all these rows will lead to an overestimation of the number of errors. To prevent this overestimation, we used the same error code in all the rows affected by the same error and added && as an argument to the error function in all the rows containing the dummy token &&.
For example, row 8 in Table 8 demonstrates the annotation of a dummy token inserted as part of a preposition correction in sentence (8). The correction replaced the stand-alone preposition את [et] (an accusative marker) by the cliticized preposition ב ‘in’. Overall, the single preposition correction resulted in the change of two tokens. The change in row 9 was tagged wrong (prep), while the change in row 8, which contains the dummy token, was tagged wrong(prep,&&). Thus, every row containing different original and revised tokens was tagged, but multiple tags related to the same error were marked to be excluded from further analysis.
Table 8 Error tags and dummy tokens
|
Token |
TH1 |
Error1_TH1 |
1 |
yoter |
yoter |
|
2 |
kaše |
kaše |
|
3 |
ze |
&& |
redun(dem) |
4 |
limco |
limco |
|
5 |
mašehu |
mišehu |
oMiss(י) |
6 |
še-lo |
še-lo |
|
7 |
mistakel |
mistakel |
|
8 |
et |
&& |
wrong(prep,&&) |
9 |
ha-telefon |
ba-telefon |
wrong(prep) |
10 |
kol |
kol |
|
11 |
daka |
daka |
|
Note that not all dummy tokens were tagged with &&. Row 3 in Table 8 demonstrates deletion of the demonstrative [ze] ‘this.m’. A dummy token was inserted in the TH1 column to maintain alignment between original and revised texts, but the error tag, redun(dem) does not contain && since the error correction affected only a single row, which is equal to the actual number of errors.
Another case of uncounted error tags are those marking changes that are required due to other obligatory changes. We call such changes “chain corrections”. Chain corrections do not correct things that were considered errors in the original text, but rather things that would have been errors after the application of another correction. We view chain corrections as stemming from a single source and do not count them in order to avoid overestimation of the number of errors in the corpus. Chain corrections are marked in the Hebrew Essay Corpus errors by ## as an argument to the error function.
One type of chain correction is related to a repeated error in multiple linked words. One such common case is a consistent incorrect usage or omission of a grammatical element in coordination or list constructions. In such a case, ## is added to all repeated instances of the tag referring to the relevant error. For example, in (10) an incorrect preposition ב /be/ ‘in’ is repeated instead of the preposition ל /le/ ‘to’ (as in 10’). Since the errors are identical and occur in a coordination construction that complements a single predicate, we consider them as a single error. Consequently, the second occurrence of the error is tagged wrong(prep,##) to indicate that it is dependent on the first occurrence (see Table 9).
(10) * אנשים רוצים להצליח בחיים ושמים לב יותר בעבודה ולא במשפחה
anašim |
rocim |
lehacliax |
ba-xaim |
people |
want.pl.m.prs |
succeed.inf |
in.the-life |
ve-samim |
lev |
yoter |
ba-avoda |
ve-lo |
ba-mišpaxa |
and-put.pl.m.prs |
heart |
more |
in.the-work |
and-neg |
in.the-family |
* ‘People want to succeed in life and pay more attention in work and not in the family’
(10’) אנשים רוצים להצליח בחיים ושמים לב יותר לעבודה ולא למשפחה
anašim |
rocim |
lehacliax |
ba-xaim |
people |
want.pl.m.prs |
succeed.inf |
in.the-life |
ve-samim |
lev |
yoter |
la-avoda |
ve-lo |
la-mišpaxa |
and-put.pl.m.prs |
heart |
more |
to.the-work |
and-neg |
to.the-family |
‘People want to succeed in life and pay more attention to work and not to the family’
Table 9 Annotation of a “chain correction” in a coordination construction
Token |
TH1 |
Error1_ TH1 |
ve-samim |
ve-samim |
|
lev |
lev |
|
yoter |
yoter |
|
ba-avoda |
ba-avoda |
wrong(prep) |
ve-lo |
ve-lo |
|
ba-mišpaxa |
ba-mišpaxa |
wrong(prep,##) |
Another type of chain correction involves reattachment of clitics. Recall that Hebrew has several function words that are attached as clitics to the following word. Occasionally, error correction requires such a clitic to be detached from one word and reattached to another. This results in changes in two words although there is only a single underlying error. These changes are tagged using complementary operators (i.e., miss and redun), and one of the tags includes ## to indicate that the errors are dependent. For example, the phrase השאר דברים /ha-š’ar dvarim/ ‘the rest of things’ in (11) has the structure of a construct state (i.e., a noun modified by another noun). In definite construct states in formal Hebrew, the definite article should be attached to the modifier (i.e., the second noun) rather than to the modified (i.e., first) noun. When the definite article is attached to the modified noun, the appropriate correction requires the definite article to be detached from the first noun and reattached to the second, as in (11’). However, since these modifications are dependent, we tag the second correction with ##, as in Table 10, to avoid inflating the number of estimated errors in the corpus.
(1) * לטפל בכל השאר דברים בבית
letapel |
bexol |
ha-š’ar |
dvarim |
ba-bait |
handle.inf |
in-all |
the-rest |
things |
at.the-house |
(11’) לטפל בכל שאר הדברים בבית
letapel |
bexol |
š’ar |
ha-dvarim |
ba-bait |
handle.inf |
in-all |
rest |
the-things |
at.the-house |
‘To take care of the rest of the stuff at home’
Table 10 Annotation of clitic reattachment
Token |
TH1 |
Error1_ TH1 |
ha-š’ar |
š’ar |
redun(det) |
dvarim |
ha-dvarim |
miss(det,##) |
4.2.4 Error tags and no correction
In many cases, a text clearly deviates from typical native language, but there is uncertainty about the appropriate correction. This can occur when the text is incomprehensible, or when there are several plausible corrections, each requiring a different major modification (e.g., change of syntactic structure). In such cases, annotators were advised not to correct the text. Yet, we tagged errors in individual words if the nature of the error was clear enough. We marked errors that did not accompany any revision of the text by adding $$ as an argument to the error function.
An example of an uncorrected but tagged sentence can be seen in (12). The sentence is clearly incomplete. A possible correction would be to insert some deontic element, such as עדיף [adif] ‘preferable’ as in (12’). However, it is unclear whether that was the author’s intention. Therefore, an alternative solution would be to insert a dummy token in both original and revised token columns and tag the error: miss(lex,$$), i.e., a missing unknown lexical item. This is demonstrated in Table 11.
(12) *לדעתי להיכנס לנושא שיותר קל לך
leda’ati |
lehikanes |
le-nose |
še-yoter |
kal |
lexa/lax |
in-my-opinion |
enter.inf |
to-subject |
that-more |
easy.sg.m |
to-you |
*‘In my opinion to get into a subject that is easier for you’
(12’) לדעתי עדיף להיכנס לנושא שיותר קל לך
leda’ati |
adif |
lehikanes |
le-nose |
še-yoter |
kal |
lexa/lax |
in-my-opinion |
preferable |
enter.inf |
to-subject |
that-more |
easy.sg.m |
to-you |
‘In my opinion it is better to get into a subject that is easier for you’
Table 11 Error tagging of an unknown missing lexical item
Token |
TH1 |
Error1_ TH1 |
leda’ati |
leda’ati |
|
&& |
&& |
miss(lex,$$) |
lehikanes |
lehikanes |
|
4.2.5 Higher-level interpretations
Another feature of the annotation scheme used in this corpus is the inclusion of interpretive (or, explanatory) error tags. In many cases, an error on one level of analysis affects higher linguistic levels as well. In other cases, scrutinizing an error reveals a plausible cognitive cause for the error, which is not captured by the surface description of the error. In such cases, annotators were able to use an additional set of interpretive error tags to specify their observations. The interpretive error tags were added to the annotation separately (i.e., in distinct columns) from the other error tags.
One class of interpretive error tags analyzes the cognitive basis of orthographic errors. Most often, such errors are analyzed from a phonological perspective (i.e., influence of pronunciation on the written form) or from a visual perspective (i.e., substitution of similarly looking letters). Another class of interpretive error tags analyzes lexical and syntactic errors. The analysis can indicate details such as the semantic effect of a wrong lexical item (e.g., selection of a semantically-related, but inappropriate, word), pragmatic effects (inconsistent use of grammatical tense or person throughout a sentence), and even the use of inappropriate register.
For instance, sentence (13) demonstrates several errors that can be analyzed from different perspectives. The corrected sentence is shown in (13’). Table 12 displays the analysis of the errors, where columns labelled “Error” specify the more basic description of errors, and columns labelled “Interp” contain interpretations of individual errors relative to a specific target hypothesis (e.g., Interp2_TH1 is an interpretation of the second error analyzed in target hypothesis 1).
The use of ככה [kaxa] ‘this way’ instead of זה [ze] ‘this’ is analyzed as a wrong demonstrative (row 2). In addition, it can be viewed as a miscollocation – deformation of the collocation בגלל זה [biglal ze] ‘because of that’. The use of יעשה [ya’ase] ‘do.3sg.fut’ instead of אעשה [e’ese] ‘do.1sg.fut’ is a case of letter substitution, which reflects the colloquial pronunciation of the word, and is common even in the writing of native speakers (row 4). Moreover, it is noteworthy that even if the error is tolerable in informal writing, it is inappropriate in formal (e.g., essay) writing. Thus, the error can be further analyzed as a register error. Finally, בסיכומתרי [bsixometri] ‘~bsychomettric (test)’ exhibits two spelling errors (row 5). The פ-ב substitution is a common error in the Hebrew of native speakers of Arabic resulting from the absence of the consonant [p] (represented by the letter פ) in Arabic (Abu Baker, 2016). Thus, it can be analyzed as an error reflecting the common pronunciation of L2 Hebrew speakers (with Arabic L1). The ט-ת substitution is a homophonic letter substitution (both letters represent the consonant /t/).
(13) *בגלל ככה אני יעשה בסיכומתרי
biglal |
kaxa |
ani |
ya’ase |
bsixomeTri |
because-of |
this-way |
I |
do.3sg.m.fut |
~bsychomettric (test) |
*‘Because of this way I will take (3sg.m) the bsychomettric (test)’
(13’) בגלל זה אני אעשה פסיכומטרי
biglal |
ze |
ani |
e’ese |
psixometri |
because-of |
this/that |
I |
do.1sg.fut |
psychometric (test) |
‘Because of that I will take the psychometric (test)’
Table 12 Annotated text with explanatory error tags
|
Token |
TH1 |
Error1_TH1 |
Interp1_TH1 |
Error2_TH1 |
Interp2_TH1 |
1 |
biglal |
biglal |
|
|
|
|
2 |
kaxa |
ze |
wrong(dem) |
colloc |
|
|
3 |
ani |
ani |
|
|
|
|
4 |
ya’ase |
e’ese |
shouldB(א,י) |
pronuncReg /register |
|
|
5 |
bsixomeTri |
psixometri |
shouldB(פ,ב) |
pronuncL2 |
shouldB(ט,ת) |
homophone |
Tags legend: wrong = incorrect element, dem = demonstrative, colloc = miscollocation, shouldB(x,y) = element x should be element y, pronuncReg = regular pronunciation (of native speakers), pronuncL2 = pronunciation of L2 speakers (with a specific L1), homophone = homophonic letter substitution
In summary, the interpretive error tags represent a more speculative analysis, and can provide valuable insights that would be harder to reach without specific research hypotheses.
4.3 Evaluation
To evaluate the quality of the corrections and annotations, we chose 54 essays, at various proficiency levels and across all three L1s, to be annotated and corrected by two experienced annotators. In total, this evaluation set included 428 sentences comprising 7757 tokens. The size of the evaluation set (in terms of the number of essays, sentences and tokens) is 5% the size of the annotated corpus. The number of words corrected by both annotators was 667, about 9% of all tokens in the evaluation set.
Due to the complexity of the annotation process, the notion of inter-annotator agreement became complex as well. We calculated inter-annotator agreement on several levels: (i) whether annotators agreed that some word or expression contained an error, (ii) whether they applied the same correction, and (iii) whether they annotated the error similarly when the correction was identical. All cases of disagreement between annotators in these files were resolved by consultation with a third annotator.
The first inter-annotator agreement measure looked only at the binary question, whether both annotators treated word tokens in the same way (i.e., left untouched or corrected). The agreement between the two annotators (micro-averaged over all essays) was 95.4% (Range: 90%-99%, SD: 2%); the macro-average was 95.6%.
A second, harsher measure looked at the proportion of tokens that were corrected identically by both annotators. This measure takes into account (in other words, penalizes disagreement on) both the binary decision (whether to correct a token) and the actual correction. That is, the second agreement measure is the number of tokens corrected identically by both annotators divided by the number of tokens corrected by either annotator. Here, since the annotators had more freedom in determining the target hypothesis of an erroneous token, the agreement was only 57% (Range: 11%-83%, SD: 15%); the macro-average was 58%.
To understand why the agreement level on the corrections was relatively low, we scrutinized all cases of disagreement. Overall, we identified four types of disagreement. The distribution of correction differences over the various types is listed in Table 13.
Table 13 Categories of disagreements on corrections
Type |
% |
Different target hypothesis |
47 |
Annotator error |
26 |
Differences in chain corrections |
24 |
Partially overlapping corrections |
3 |
Differences due to different target hypotheses are cases in which the annotators chose different but valid ways to correct the texts. Such differences reflect the natural variability of the language (see also 4.1.4 above). For example, the phrase in (14) is ungrammatical. Both annotators corrected the word הראשון [ha-rišon] ‘the-first.sg.m’, but each applied a different but equally acceptable correction (see 14’ and 14’’).
(14) *מטרת הראשון של האפליקציות
Matrat |
ha-rišon |
šel |
ha-aplikacyot |
goal (f).constr |
the-first.sg.m |
of |
the-applications |
*‘The goal(f) of first(m) of the applications’
(14’) המטרה הראשונה של האפליקציות
ha-matara |
ha-rišona |
šel |
ha-aplikacyot |
the-goal (f) |
the-first.sg.F |
of |
the-applications |
‘The first goal of the applications’
(14’’) המטרה המקורית של האפליקציות
ha-matara |
ha-mekorit |
šel |
ha-aplikacyot |
the-goal (f) |
the-original.sg.m |
of |
the-applications |
‘The original goal of the applications’
The second type of disagreement was due to an error on part of one of the annotators. Most often the error was failing to correct an obvious error in the text (e.g., a spelling error). Such errors cannot be prevented completely, but it is important to estimate their frequency and overall effect on the annotations.
The third type of disagreement was due to differences in chain corrections. As discussed in 4.2.3, chain corrections refer to a series of corrections in a multi-word phrase, such that a correction of one word requires corrections of additional words in the phrase. If the annotators disagreed on the first correction this could lead to further disagreements. The disagreement on the first word is analyzed according to one of the previous categories (different target hypothesis, annotator error). However, the additional disagreements should be counted separately, since the words in the phrase are inter-dependent. A common case of disagreement in chain corrections involves the alternation between free and bound morphemes that are semantically equivalent. For example, (15) uses an inappropriate phrase to denote causality. Both annotators corrected it by adding a conjunction. However, the correction in (15’’) also required an omission of a bound preposition מ [me] ‘from’, resulting in a difference in two tokens between the two corrections, as demonstrated in Table 14.
(15) *אנשים שהתאבדו מאתרי אינטרנט
anašim |
še-hit’abdu |
me-atarey |
internet |
people |
that-commit suicide.pl.pst |
from-sites.constr |
internet |
*‘People who committed suicide from websites’
(15’) אנשים שהתאבדו כתוצאה מאתרי אינטרנט
anašim |
še-hit’abdu |
ke-toca’a |
me-atarey |
internet |
people |
that-commit suicide.pl.pst |
as-result |
from-sites.constr |
internet |
‘People who committed suicide as a result of websites’
(15’’) אנשים שהתאבדו בגלל אתרי אינטרנט
anašim |
še-hit’abdu |
biglal |
atarey |
internet |
people |
that-commit suicide.pl.pst |
because-of |
sites.constr |
internet |
‘People who committed suicide because of websites’
Table 14 Disagreement in chain corrections
Token |
Annotator1 |
Annotator2 |
|
anašim |
anašim |
anašim |
|
še-hit’abdu |
še-hit’abdu |
še-hit’abdu |
|
&& |
ke-toca’a |
biglal |
|
me-atarey |
me-atarey |
atarey |
← chain correction |
internet |
internet |
internet |
|
The last type of disagreement on corrections was in partially overlapping corrections. This type refers to cases of multiple errors in a single word where the annotators agreed on the correction of some of the errors, but not on the others. For example, both annotators changed the bound preposition ל [le] ‘to’ in (16) to the bound preposition ב [be] ‘in’. However, the first annotator did not make additional changes (16’), while the second annotator also changed the noun to which the bound preposition is attached (16’’). Thus, the annotators disagreed at the token level ([be-davar] ‘in-thing’ vs. [be-mašehu] ‘in-something’). However, the fact that they did agree on the correction of the preposition should not be overlooked.
(16) *חפץ בכל ליבו לדבר
xafec |
be-xol |
libo |
le-davar |
wish.sg.m.prs |
in-all |
heart.poss.3sg.m |
to-thing |
*‘Wishes with all his heart to a thing’
(16’) חפץ בכל ליבו בדבר
xafec |
be-xol |
libo |
be-davar |
wish.sg.m.prs |
in-all |
heart.poss.3sg.m |
in-thing |
‘Wishes with all his heart for a thing’
(16’) חפץ בכל ליבו במשהו
xafec |
be-xol |
libo |
be-mašehu |
wish.sg.m.prs |
in-all |
heart.poss.3sg.m |
in-something |
‘Wishes with all his heart for something’
To summarize, when analyzing learner texts that have been corrected, one should keep in mind that the corrections do not represent absolute truth. First, corrected texts may still contain errors. This includes grammatical errors that would be considered errors by any standard, but also expressions that could be considered errors in some register or dialect but not in another. Second, a given correction could be only one of several plausible corrections that was chosen by a specific annotator. The annotation guidelines attempt to minimize such inconsistency (e.g., by including alternative corrections in the annotated text), but some variability in the corrections cannot be avoided.
Next, we discuss the third inter-annotator agreement measure, which was calculated based on the annotations of tokens that were corrected identically by the annotators. Instead of using the actual error tags, we used more general classes of tags, e.g., one class that accounts for all errors involving prepositions (missing, redundant, and wrong prepositions). The overall agreement on the annotations was 80% (Range: 0%-100%, SD: 18%). As in the analysis of the corrections, we distinguished several types of disagreements on annotations. The distribution of differences in errors tags over the various types is listed in Table 15.
Table 15 Categories of disagreements on annotations
Type |
% |
Different interpretations |
35 |
Annotator error |
29 |
Annotation difference with no corrections |
20 |
Partially overlapping annotations |
16 |
Differences in the interpretation of errors occur when there is more than one plausible way to analyze a given error. One of the most common cases of this type of disagreement involves errors in letters that represent bound functional morphemes, such as prepositions (see 2.2). Such errors could, in principle, be analyzed as orthographic errors or as errors in a function word. For example, in one case, both annotators corrected the word קרוב [karov] ‘close’ to בקרוב [bekarov] ‘soon’ (lit. ‘in close’). One of them analyzed the error in the original token as a missing letter, while the other analyzed it as a missing preposition.
As in the disagreements on the corrections, many of the annotation differences were due to an error on part of one of the annotators. Most often, this happened when both annotators applied the same correction, but one of them did not assign an error tag to the revised token.
The third type of disagreements includes cases in which none of the annotators corrected a given word, but one of them assigned an error tag to it. This usually happened when a content word was used inappropriately, but there was no clear target hypothesis. In such cases, the annotation scheme enables annotators to tag a word even if it was left uncorrected (see 4.2.4). However, adding an error tag is optional in these cases, thus, one annotator may choose to tag the error, while the other may choose not to tag it.
The last type of disagreement on annotations is in partially overlapping annotations. These cases involve multiple errors in a single word where the annotators agreed on the annotation of some of the errors, but not on the others. One such case involves an error that both annotators corrected and annotated similarly and an additional error that neither annotator corrected, but one of them tagged nonetheless. For example, in one instance, both annotators corrected the word אש [ʔš] to איש [ʔiš] and analyzed the error as a missing letter. However, one of them commented that even the corrected word was inappropriate in the context, and added an error tag for a wrong lexical item. The other annotator did not tag the lexical error leading to partial disagreement on the annotation. Since the annotators did agree on one of the errors, the inter-annotator agreement analysis should take this into account.
To conclude, learner language inevitably involves a certain degree of variability, and our annotation scheme and guidelines were designed with this fact in mind. On the one hand, we attempted to minimize the variability of the annotations by providing elaborate guidelines that address common issues encountered during the annotation process. On the other hand, we acknowledged the fact that the variability cannot be eliminated completely. Consequently, we decided to incorporate the variability in the annotation architecture by allowing annotators to specify multiple target hypotheses and multiple error tags whenever there was more than one way to correct and analyze a given error.
[11] A common observation among linguists is that engaging in grammaticality analysis for an extended period of time can affect linguistic intuitions and reduce the speaker’s confidence in them. This is sometimes referred to as scanting out (e.g., Schütze, 2016: 113) or as syntactic satiation (Sprouse, 2009).
[12] In the Hebrew essay corpus, the sentences are not given in their original order. Thus, the immediate context of any given sentence is essentially unknown. However, examining the entire essay, we can determine at least a “thematic” context, against which the appropriateness of sentences can be evaluated to some degree. Occasionally, we encountered sentences that were extremely unlikely and could be judged inappropriate even without context (or, with a “zero” context”).
[13] The verb להספיק is ambiguous. One of its uses does take a direct object, but its meaning (‘to succeed doing something on time’) is incompatible with the given context.