Serve Large Language Models from Your Computer with Text Generation Inference

Running very large language models (LLM) locally, on consumer hardware, is now possible thanks to quantization methods such as <a href="https://towardsdatascience.com/qlora-fine-tune-a-large-language-model-on-your-gpu-27bed5a03e2b" rel="noopener" target="_blank">QLoRa</a> and <a href="https://github.com/IST-DASLab/gptq" rel="noopener ugc nofollow" target="_blank">GPTQ</a>. Considering how long it takes to load an LLM, we may also want to keep the LLM in memory to query it and have the results instantly. If you use LLMs with a standard inference pipeline, you must reload the model each time. If the model is very large, you may have to wait several minutes for the model to generate an output. There are various frameworks that can host LLMs on a server (locally or remotely). On my blog, I have already presented the <a href="https://medium.com/towards-data-science/deploy-your-local-gpt-server-with-triton-a825d528aa5d" rel="noopener">Triton Inference Server which is a very optimized framework</a>, developed by NVIDIA, to serve multiple LLMs and balance the load across GPUs. But if you have only one GPU and if you want to host your model on your computer, using a Triton inference may seem unsuitable. <a href="https://medium.com/towards-data-science/serve-large-language-models-from-your-computer-with-text-generation-inference-54f4dd8783a7">Read More</a>