With the advent of Llama 2, running strong LLMs locally has become more and more a reality. Its accuracy approaches OpenAI’s GPT-3.5, which serves well for many use cases.
In this article, we will explore how we can use Llama2 for Topic Modeling without the need to pass every single document to the model. Instead, we are going to leverage BERTopic, a modular topic modeling technique that can use any LLM for fine-tuning topic representations.
BERTopic works rather straightforward. It consists of 5 sequential steps:
- Embedding documents
- Reducing the dimensionality of embeddings
- Cluster reduced embeddings
- Tokenize documents per cluster
- Extract best-representing words per cluster