Comparing LLMs with MLFlow

<p>Comparing models is just as important in Large Language Model Ops (LLMOps) as it is in MLOps, but the process for doing so is a little less clear. In &ldquo;classical&rdquo; machine learning, it usually suffices to compare models on a set of clear numerical metrics; the model with the better score wins. This is not typically the case with LLMs (though there are plenty of&nbsp;<a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" rel="noopener ugc nofollow" target="_blank">performance benchmarks</a>&nbsp;to capture and quantify various aspects of model performance).</p> <p>Selecting the best LLM model can depend on some less-tangible (or at least less-quantifiable) model characteristics. Some practitioners refer to taking &ldquo;vibe checks&rdquo; of a model. Models might differ in terms of the tone or detail of their responses, or in terms of their correctness in various niche areas.</p> <p>A fairly simple requirement of any LLMOps platform, then, is the ability to straightforwardly compare the outputs of different models on the same prompts. This is one of the features enabled by using&nbsp;<a href="https://www.databricks.com/blog/announcing-mlflow-24-llmops-tools-robust-model-evaluation" rel="noopener ugc nofollow" target="_blank">MLFlow for LLMOps</a>.</p> <p>In this post, we&rsquo;ll walk through the process of comparing a few small (&lt;1B parameter) open-source text-generation models with one of MLFlow&rsquo;s core LLMOps capabilities, the&nbsp;<code>mlflow.evaluate()</code>&nbsp;function. We&rsquo;ll use these small models to make it easier to test out the comparisons without needing to worry so much about provisioning sufficient cloud resources to run through the examples. Note, however, that the outputs from these small models aren&rsquo;t always very coherent or relevant.</p> <p><a href="https://medium.com/@dliden/comparing-llms-with-mlflow-1c69553718df"><strong>Learn More</strong></a></p>
Tags: LLMs MLflow