2D Tokenization for Large Language Models

<p>When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model.</p> <p>LLMs contain an embedding matrix to store a representation for each of these tokens. In the case of the&nbsp;<a href="https://arxiv.org/abs/1907.11692" rel="noopener ugc nofollow" target="_blank">RoBERTa</a>&nbsp;model, there are 768 numbers to represent each of the ~50,000 tokens in its vocabulary.</p> <p>This approach raises several questions, like&nbsp;<em>&ldquo;how should spaces be represented in the sequence of tokens?&rdquo;</em>&nbsp;and&nbsp;<em>&ldquo;should different capitalization be considered a different word?&rdquo;</em>.</p> <p>If we look in the embedding matrix of the RoBERTa model, there isn&rsquo;t just one representation of the word&nbsp;<code>dog</code>, there are separate representations for variations in capitalization, pluralization, and whether or not it&rsquo;s preceded by a space. In total, RoBERTa has seven separate slots in its vocab for various forms of&nbsp;<code>dog</code>.</p> <p><a href="https://betterprogramming.pub/2d-tokenization-for-large-language-models-46b295dd1904">Read More</a></p>