2D Tokenization for Large Language Models

When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model. LLMs contain an embedding matrix to store a representation for each of these tokens. In the case of the <a href="https://arxiv.org/abs/1907.11692" rel="noopener ugc nofollow" target="_blank">RoBERTa</a> model, there are 768 numbers to represent each of the ~50,000 tokens in its vocabulary. This approach raises several questions, like “how should spaces be represented in the sequence of tokens?” and “should different capitalization be considered a different word?”. If we look in the embedding matrix of the RoBERTa model, there isn’t just one representation of the word <code>dog</code>, there are separate representations for variations in capitalization, pluralization, and whether or not it’s preceded by a space. In total, RoBERTa has seven separate slots in its vocab for various forms of <code>dog</code>. <a href="https://betterprogramming.pub/2d-tokenization-for-large-language-models-46b295dd1904">Read More</a>