Topic Modeling with Llama 2

With the advent of Llama 2, running strong LLMs locally has become more and more a reality. Its accuracy approaches OpenAI’s GPT-3.5, which serves well for many use cases. In this article, we will explore how we can use Llama2 for Topic Modeling without the need to pass every single document to the model. Instead, we are going to leverage <a href="https://github.com/MaartenGr/BERTopic" rel="noopener ugc nofollow" target="_blank">BERTopic</a>, a modular topic modeling technique that can use any LLM for fine-tuning topic representations. BERTopic works rather straightforward. It consists of 5 sequential steps: <ol> <li>Embedding documents</li> <li>Reducing the dimensionality of embeddings</li> <li>Cluster reduced embeddings</li> <li>Tokenize documents per cluster</li> <li>Extract best-representing words per cluster</li> </ol> <img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*BY9n2IWgoFJ3uNnE4wN7cw.png" style="height:405px; width:700px" /> The 5 main steps of BERTopic. However, with the rise of LLMs like Llama 2, we can do much better than a bunch of independent words per topic. It is computationally not feasible to pass all documents to Llama 2 directly and have it analyze them. We can employ vector databases for search but we are not entirely sure which topics to search for. <a href="https://towardsdatascience.com/topic-modeling-with-llama-2-85177d01e174">Visit Now</a>