
SuperBPE: Rethinking Tokenization for Language Models
In the domain of language models, tokenization i.e. the process of breaking down text into manageable units plays a pivotal role. Traditionally, models rely on subword tokenization, where words are split into smaller units. However, this approach often overlooks the semantic significance of multi-word expressions and varies across languages.