In recent years, large language models (LLMs), have become ubiquitous. Perhaps the most famous LLM is ChatGPT, which was released by OpenAI in November 2022. ChatGPT is able to generate ideas, give personalized recommendations, understand complicated topics, act as a writing assistant, or help you build a model to predict the Academy Awards. Meta has announced their own LLM called LLaMA, Google has LaMDA, and there is even an open-source alternative, BLOOM.
LLMs have excelled in natural language processing (NLP) tasks like the ones listed above because LLMs have historically focused on unstructured data — data that does not have a pre-defined structure, and is usually text-heavy. I asked ChatGPT, “why have LLMs historically focused on unstructured data?” The reply was: