Running Llama 2 on CPU Inference Locally for Document Q&A

Third-party commercial large language model (LLM) providers like OpenAI’s GPT4 have democratized LLM use via simple API calls. However, teams may still require self-managed or private deployment for model inference within enterprise perimeters due to various reasons around data privacy and compliance. The proliferation of open-source LLMs has fortunately opened up a vast range of options for us, thus reducing our reliance on these third-party providers. When we host open-source models locally on-premise or in the cloud, the dedicated compute capacity becomes a key consideration. While GPU instances may seem the most convenient choice, the costs can easily spiral out of control. In this easy-to-follow guide, we will discover how to run quantized versions of open-source LLMs on local CPU inference for retrieval-augmented generation (aka document Q&A) in Python. In particular, we will leverage the latest, highly-performant <a href="https://ai.meta.com/llama/" rel="noopener ugc nofollow" target="_blank">Llama 2</a> chat model in this project. <a href="https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8">Visit Now</a>