Three mistakes when introducing embeddings and vector search
<p>Representing unstructured data as embedding vectors and embedding-based retrieval (EBR) using vector search is more popular than ever. What are embeddings anyway? Roy Keyes explains it well in <a href="https://roycoding.com/blog/2022/embeddings.html" rel="noopener ugc nofollow" target="_blank">The shortest definition of embeddings?</a></p>
<blockquote>
<p>Embeddings are learned transformations to make data more useful</p>
</blockquote>
<p>In academia, this process is known as representation learning and has been a field of research for decades. By transforming the data into vectors, a language native to computers, we can make the data more useful. Take BERT for text as an example. <strong>Bidirectional Encoder Representations from Transformers</strong> (<strong>BERT</strong>).</p>
<p>How <strong>useful the representation is</strong>, depends on how we learn this transformation and how the learned way to represent data generalizes to new data. This is how we do Machine Learning. Take some data, learn something from it, then apply that learning to new data. Simple.</p>
<p>So what is new? Why the surge in interest? The answer is better model architectures (e.g., Transformer architecture) and self-supervised representation learning. Add a touch of confusion around Large Language Models (LLMs) such as chatGPT to the mix, and here we are.</p>
<p>About self-supervised learning. Using a clever objective, we can train a model using piles of data without human supervision (labeling). Then, once that is done, we can fine-tune the model for tasks where the fine-tuning requires less labeled data than if we started from scratch.</p>
<p><a href="https://bergum.medium.com/four-mistakes-when-introducing-embeddings-and-vector-search-d39478a568c5">Website</a></p>