AI Telephone — A Battle of Multimodal Models
<p>Generative AI is on fire right now. The past few months especially have seen an explosion in multimodal machine learning — AI that connects concepts across different “modalities” such as text, images, and audio. As an example, <a href="https://www.midjourney.com/" rel="noopener ugc nofollow" target="_blank">Midjourney</a> is a multimodal text-to-image model, because it takes in natural language, and outputs images. The magnum opus for this recent renaissance in multimodal synergy was Meta AI’s ImageBind, which can take inputs of 6(!) varieties and represent them in the same “space”.</p>
<p>With all of this excitement, I wanted to put multimodal models to the test and see how good they <em>actually</em> are. In particular, I wanted to answer three questions:</p>
<ol>
<li>Which text-to-image model is the best?</li>
<li>Which image-to-text model is the best?</li>
<li>What is more important — image-to-text, or text-to-image?</li>
</ol>
<blockquote>
<p><em>Of course, each model brings its own biases to the table, from training data to model architecture, so there isn’t really ever one BEST model. But we can still put models to the test in a general context!</em></p>
</blockquote>
<p>To answer these questions, I decided to play a game of AI Telephone, inspired by the board game <a href="https://en.wikipedia.org/wiki/Telestrations" rel="noopener ugc nofollow" target="_blank">Telestrations</a>, which my family and I love to play together.</p>
<p>Telestrations is much like the <a href="https://www.wikihow.com/Play-the-Telephone-Game" rel="noopener ugc nofollow" target="_blank">game of telephone</a>: players go around in a circle, taking in communication from the person on one side, and in turn communicating their interpretation to the person on their other side. As the game ensues, the original message is invariably altered, if not lost entirely. Telestrations differs, however, by adding bimodal communication: players alternate between drawing (or <em>illustrating</em>) a description, and describing (in text) a description.</p>
<p><a href="https://towardsdatascience.com/ai-telephone-a-battle-of-multimodal-models-282b01daf044">Website</a></p>