2D Tokenization for Large Language Models
<p>When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model.</p>
<p>LLMs contain an embedding matrix to store a representation for each of these tokens. In the case of the <a href="https://arxiv.org/abs/1907.11692" rel="noopener ugc nofollow" target="_blank">RoBERTa</a> model, there are 768 numbers to represent each of the ~50,000 tokens in its vocabulary.</p>
<p>This approach raises several questions, like <em>“how should spaces be represented in the sequence of tokens?”</em> and <em>“should different capitalization be considered a different word?”</em>.</p>
<p>If we look in the embedding matrix of the RoBERTa model, there isn’t just one representation of the word <code>dog</code>, there are separate representations for variations in capitalization, pluralization, and whether or not it’s preceded by a space. In total, RoBERTa has seven separate slots in its vocab for various forms of <code>dog</code>.</p>
<p><a href="https://betterprogramming.pub/2d-tokenization-for-large-language-models-46b295dd1904">Read More</a></p>