The Art of Prompt Design: Prompt Boundaries and Token Healing

<p>This (written jointly with&nbsp;Marco Tulio Ribeiro) is part 2 of a series on&nbsp;<strong>the art of prompt design</strong>&nbsp;(part 1&nbsp;here), where we talk about controlling large language models (LLMs) with&nbsp;<code>guidance</code>.</p> <p>In this post, we&rsquo;ll discuss how the greedy tokenization methods used by language models can introduce a subtle and powerful bias into your prompts, leading to puzzling generations.</p> <p>Language models are not trained on raw text, but rather on tokens, which are chunks of text that often occur together, similar to words. This impacts how language models &lsquo;see&rsquo; text, including prompts (since prompts are just sets of tokens). GPT-style models utilize tokenization methods like&nbsp;Byte Pair Encoding&nbsp;(BPE), which map all input bytes to token ids in a greedy manner. This is fine for training, but it can lead to subtle issues during inference, as shown in the example below.</p> <h1><strong>An example of a prompt boundary problem</strong></h1> <p>Consider the following example, where we are trying to generate an HTTP URL string:</p> <pre> import guidance # we use StableLM as an example, but these issues impact all models to varying degrees guidance.llm = guidance.llms.Transformers(&quot;stabilityai/stablelm-base-alpha-3b&quot;, device=0) # we turn token healing off so that guidance acts like a normal prompting library program = guidance(&#39;The link is &lt;a href=&quot;http:{{gen max_tokens=10 token_healing=False}}&#39;) program()</pre> <p>&nbsp;</p> <p>Notebook output.</p> <p>Note that the output generated by the LLM does not complete the url with the obvious next characters (two forward slashes). It instead creates an invalid URL string with a space in the middle. This is surprising, because the&nbsp;<code>//</code>&nbsp;completion is extremely obvious after&nbsp;<code>http:</code>. To understand why this happens, let&rsquo;s change our prompt boundary so that our prompt does not include the colon character:</p> <p><a href="https://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0be38">Read More</a></p>