All Languages Are NOT Created (Tokenized) Equal

<p>Large language models such as ChatGPT process and generate text sequences by first splitting the text into smaller units called&nbsp;<strong>tokens</strong>. In the image below, each colored block represents a unique token. Short or common words such as &ldquo;you&rdquo;, &ldquo;say&rdquo;, &ldquo;loud&rdquo;, and &ldquo;always&rdquo; are its own token, whereas longer or less common words such as &ldquo;atrocious&rdquo;, &ldquo;precocious&rdquo;, and &ldquo;supercalifragilisticexpialidocious&rdquo; are broken into smaller subwords.</p> <p><img alt="" src="*LlsaxJwwCix3jVAt.png" style="height:551px; width:700px" /></p> <p>Visualization of tokenization of a short text using&nbsp;OpenAI&rsquo;s tokenizer website. Screenshot taken by author.</p> <p>This process of&nbsp;<strong>tokenization</strong>&nbsp;is not uniform across languages, leading to disparities in the number of tokens produced for equivalent expressions in different languages. For example,&nbsp;<strong>a sentence in Burmese or Amharic may require 10x more tokens than a similar message in English.</strong></p> <p><img alt="" src="*PqsXeXMRYVfLj-mD.png" style="height:284px; width:700px" /></p> <p>An example of the same message translated into five languages and the corresponding number of tokens required to tokenize that message (using OpenAI&rsquo;s tokenizer). The text comes from&nbsp;Amazon&rsquo;s MASSIVE dataset.</p> <p>In this article, I explore the tokenization process and how it varies across different languages:</p> <ul> <li>Analysis of token distributions in a parallel dataset of short messages that have been translated into 52 different languages</li> <li>Some languages, such as Armenian or Burmese, require&nbsp;<strong>9 to 10 times more tokens than English</strong>&nbsp;to tokenize comparable messages</li> <li>The impact of this language disparity</li> </ul> <p><a href="">Visit Now</a>&nbsp;</p>