The combinatorial analysis of n-gram dictionaries, coverage and information entropy based on the web corpus of English

doi:10.21203/rs.3.rs-237508/v2

Download PDF

Research Article

The combinatorial analysis of n-gram dictionaries, coverage and information entropy based on the web corpus of English

https://doi.org/10.21203/rs.3.rs-237508/v2

This work is licensed under a CC BY 4.0 License

Journal Publication

published 08 Sep, 2021

Read the published version in Baltic Journal of Modern Computing →

Version 2

posted

You are reading this latest preprint version

We research n-gram dictionaries and estimate its coverage and entropy based on the web corpus of English. We consider a method for estimating the coverage of empirically generated dictionaries and an approach to address the disadvantage of low coverage. Based on the ideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the English language and use mathematical extrapolation to approximate the marginal entropy. In addition, we approximate the number of all possible legal n-grams in the English language for large order of n-grams.

Information Theory

n-gram entropy

n-gram dictionaries

coverage