First, it must be noted that we do not understand vocabulary growth here in the sense of language acquisition research, but in the sense of, for example, Baayen (1996), where it is investigated how the vocabulary develops as the corpus grows. Of course, as corpus size increases, we would expect the vocabulary to grow. However, the vocabulary growth curves should differ according to the amount and type of cleaning we apply to the frequency lists because an increasing number of wordforms is being excluded from the dataset. For this case study, we apply several cleaning stages:
A) no cleaning at all;
B) exclusion of punctuation, names, and start-end-symbols (all identified via their respective POS tags), URLs, and wordforms only consisting of numbers (both identified by regular expressions);
C) exclusion of wordforms containing numbers;
D) exclusion of wordforms that contain upper-case letters following lower-case letters[11];
E) exclusion of wordforms where the TreeTagger could not assign a lemma;
F) selection of wordforms that are themselves (or the associated lemma[12]) on a basic lemma list (BLL) of New High German standard language to identify a set of conventionalized word forms (Stadler, 2014). For more information regarding this basic lemma list, please refer to Koplenig et al. (2022, p. 2).
Cleaning stages A through D are cumulative. For example, cleaning stage D incorporates stages B and C. Stages E and F, however, both rely on stage D because they can be understood as being equivalent regarding their aim: identifying 'true' lemmas and wordforms of German. We chose these cleaning stages because they represent very general selections/exclusions in potential research projects in corpus or computational linguistics. One could also think of these cleaning stages as becoming more rigorous towards the ‘core vocabulary’ of the language in each step. Of course, the datasets provided can also be used to test other selections adapted to specific research questions (e.g., only selecting certain POS or applying frequency thresholds).
4.1 Number of wordform types
We will first examine how the number of wordform types develops when including more and more of the 16 corpus folds. Figure 1 shows that the first four cleaning stages (panels A through D) exhibit roughly the same overall pattern for raw and lowered versions: The vocabulary growth curves do not show clear signs of approaching a ceiling value. This finding replicates several studies for English language corpora that are summarized by Brysbaert et al. (2016, p. 2) who also “failed to find any flattening of the predicted linear curve, indicating that the pool of possible word types was still far from exhausted”. There is, in other words, “no indication of a stop to the growth”, which is an instantiation of Herdan’s (1964) or Heaps’ (1978) law.
The final two cleaning stages (panels E and F) quickly show asymptotical behavior, but only the vocabulary growth curves for the lowered dataset (grey lines). This is especially true for cleaning stage F where we restrict the corpus to a fixed set of wordforms. For the raw corpus version, many new forms are still observed, even approaching the full corpus.
The same data can also be visualized as percentage increases as more and more folds are included (Figure 2). There is virtually no difference between the raw vs. lowered versions for the first three cleaning stages (panels A through C) and the number of observed wordforms still increases in the last step (15 to 16 folds) by approx. 4%. This is remarkably close to the figure reported by Miller and Biber (2015, p. 41) who used a corpus of ten introductory psychology textbooks and investigated the growing lexical diversity when adding whole textbooks to their sample one after another.[13]
The results for cleaning stages E and F are different: The fourth step (4 to 5 folds) in panel E shows a percentage increase of below 1% for the lowered dataset. In panel F, it's already the second step (2 to 3 folds). The lowest percentage increase is observed for the last step of lowered corpora for cleaning stage F: 0.03% (or, in absolute numbers, 157 newly observed wordforms[14] after adding the final fold). So, for cleaning stages E and F, we can conclude that the boundless growth of vocabulary is far less pronounced than in the previous cleaning stages. This makes sense given that E and F are the cleaning stages where we tried to identify a basic set of German wordforms.
4.2 Number of hapax legomena
Hapax legomena (henceforth: HL), i.e. items appearing only once in a frequency list, are often used in calculations of quantitative linguistic measures, for example in analyses concerning productivity (Baayen, 1994). It is therefore interesting to see how the number of HL is influenced by corpus size (= number of folds), cleaning stage and lowering of the dataset.
In terms of frequency, we would expect more HL in larger corpora, especially when no or rather light (cleaning stages A through D) cleaning is performed because we observe more and more 'non-canonical' wordforms as the corpus gets larger. However, it is hard to hypothesize what the pattern looks like for the last two cleaning stages E and F.
There are no considerable differences between the first four cleaning stages (panels A through D in Figure 3) and, indeed, none of the curves show any signs of approaching a point where no new HL are being added after a specific corpus size (which would be indicated by the curve approaching a horizontal asymptote). The highest number of HL is observed for the full raw corpus without any cleaning in panel A of Figure 3 (66,149,313 wordforms, 58.0% of all wordforms).
The last two cleaning stages (panels E and F), which we consider quite ‘strict’, show a different pattern compared to the first four stages. In these cleaning stages, the trajectory of the curves for raw and lowered datasets differs. For the lowered dataset, the number of HL steadily decreases, which is what we would expect given datasets where only recognized lemmas (stage E) or elements from a well-defined word list (stage F) are allowed. Consequently, the lowest number of HL is observed for the full corpus in cleaning stage F for a lowered dataset (2,205 wordforms), which make for 0.37% of all wordforms at this point.
For the raw version it is especially noteworthy that the number of HL first decreases and then increases again. In stage E, the lowest point of the curve is reached for 4 folds (239,461 HL, 8.6% of all wordforms) with steadily rising counts until the corpus is complete (304,293 HL, 9.6%). To get an idea of where this effect comes from, we can look at the HL in the complete 16-fold dataset which were not observed in the 4-fold dataset (247,607 wordforms). These 228,156 HL had to be added somewhere between the inclusion of folds 5 to 16. 51.8% of these wordforms (n = 118,076) consist of upper-case letters only.[15] Another 7.9% (n = 18,111) begin with more than one upper-case letter (e.g., COusinen or HERZstiche), also indicating irregular capitalizations of wordforms that would not be hapax legomena if capitalized in a regular way. So, irregular capitalization seems to play a large role in the increasing number of hapax legomena after the few initial folds for the final cleaning stages of raw corpora.
35,014 HL (15,3%) have an upper-case letter in word-initial position only. Most of these wordforms turn out to be sentence-initial or nominalized forms of adjectives (Flauschigsten, Abwischbarer) and verbs (Durchtauchte, Lullten) or compound nouns (Waldgesundheitsprogramm, Dampflokomotivkessel).[16] Since all the effects reported above involve capitalization, they are not observed for lowered corpora. Hence, the diverging patterns for the respective curves in panels E and F in Figure 3.