Formalizing BPE Tokenization

doi:10.21203/rs.3.rs-4196947/v1

Download PDF

Research Article

Formalizing BPE Tokenization

https://doi.org/10.21203/rs.3.rs-4196947/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 15 Sep, 2023

Read the published version in Electronic Proceedings in Theoretical Computer Science →

Version 1

posted

You are reading this latest preprint version

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems. There are subtle differences between implementations, so we in particular formally define and investigate the semantics of the SentencePiece and HuggingFace tokenizers, and how they relate to each other. These differences depend on the details of how the dictionary of tokenization rules is constructed. Beyond this, we consider how tokenization can be performed in an incremental fashion, as well as doing it left-to-right using an amount of memory constant in the length of the string, enabling e.g. using a finite state string-to-string transducer.

No competing interests reported.

Download PDF

Journal Publication

published 15 Sep, 2023

Read the published version in Electronic Proceedings in Theoretical Computer Science →

Version 1

posted

You are reading this latest preprint version

Formalizing BPE Tokenization

Status:

Journal Publication

Version 1

Abstract

Full Text

Additional Declarations

Status:

Journal Publication

Version 1