Large Language Models, Part 1: BERT

<p>2017was a historical year in machine learning when the&nbsp;<strong>Transformer</strong>&nbsp;model made its first appearance on the scene. It has been performing amazingly on many benchmarks and has become suitable for lots of problems in Data Science. Thanks to its efficient architecture, many other Transformer-based models have been developed later which specialise more on particular tasks.</p> <p>One of such models is BERT. It is primarily known for being able to construct embeddings which can very accurately represent text information and store semantic meanings of long text sequences. As a result, BERT embeddings became widely used in machine learning. Understanding how BERT builds text representations is crucial because it opens the door for tackling a large range of tasks in NLP.</p> <p>In this article, we will refer to the&nbsp;<a href="https://arxiv.org/pdf/1810.04805.pdf" rel="noopener ugc nofollow" target="_blank">original BERT paper</a>&nbsp;and have a look at BERT architecture and understand the core mechanisms behind it. In the first sections, we will give a high-level overview of BERT. After that, we will gradually dive into its internal workflow and how information is passed throughout the model. Finally, we will learn how BERT can be fine-tuned for solving particular problems in NLP.</p> <h1>High level overview</h1> <p><strong>Transformer</strong>&rsquo;s architecture consists of two primary parts: encoders and decoders. The goal of stacked encoders is to</p> <p>construct a meaningful embedding for an input which would preserve its main context. The output of the last encoder is passed to inputs of all decoders trying to generate new information.</p> <p><a href="https://towardsdatascience.com/bert-3d1bf880386a">Website</a></p>
Tags: BERT LLM NLP