Speech and Natural Language Input for Your Mobile App Using LLMs

A Large Language Model (LLM) is a machine learning system that can effectively process natural language. The most advanced LLM available at the moment is GPT-4, which powers the paid version of ChatGPT. In this article you will learn how to give your app highly flexible speech interpretation using GPT-4 function calling, in full synergy with your app’s Graphical User Interface (GUI). It is intended for product owners, UX designers, and mobile developers. <h1>Background</h1> Digital assistants on mobile phones (Android and iOS) have failed to catch on for a number of reasons; among which they are faulty, limited, and often tedious to use. LLMs, and now especially OpenAI GPT-4, hold the potential to make a difference here, with their ability to more deeply grasp the user’s intention instead of trying to coarsely pattern match a spoken expression. Android has Google assistant’s ‘app actions’ and iOS has SiriKit intents. These provide simple templates you can use to register speech requests that your app can handle. Google assistant and Siri have already improved quite a bit over the past few years — even more than you probably realize. Their coverage is greatly determined, however, by which apps implement support for them. Nevertheless, you can, for instance, play your favorite song on Spotify using speech. The natural language interpretation of these OS-provided services, however, predates the huge advances in this field that LLMs have brought about — so it is time for the next step: to harness the power of LLMs to make speech input more reliable and flexible. Although we can expect that the operating system services (like Siri and Google assistant) will adapt their strategies soon to take advantage of LLMs, we can already enable our apps to use speech without being limited by these services. Once you have adopted the concepts in this article, your app will also be ready to tap into <a href="https://www.axios.com/2023/07/31/google-assistant-artificial-intelligence-news" rel="noopener ugc nofollow" target="_blank">new assistants</a>, once they become available. The choice of your LLM (GPT, PaLM, LLama2, MPT, Falcon etc.) does have an impact on reliability, but the core principles you will learn here can be applied to any of them. We will let the user access the entirety of the app’s functionality by saying what they want in a single expression. The LLM maps a natural language expression into a function call on the navigation structure and functionality of our app. And it need not be a sentence spoken like a robot. LLM’s interpretation powers allow users to speak like a human, using their own words or language; hesitate, make mistakes, and correct mistakes. Where users have rejected voice assistants because they often fail to understand what they mean, the flexibility of an LLM can make the interaction feel much more natural and reliable, leading to higher user adoptation. <a href="https://towardsdatascience.com/speech-and-natural-language-input-for-your-mobile-app-using-llms-e79e23d3c5fd">Visit Now</a>