“It worked when I prompted it” or the challenges of building an LLM Product
How do we serve a response with confidence, if we don’t know how confident we should be about the response?
Over the last few weeks, we have been trying to build an LLM product ourselves, a Chrome extension aimed at improving English writing skills for non-native speakers, to see the challenges of taking an LLM to production. Our extension is free to use and open-source. All you need is an OpenAI API key.
Here are the main takeaways:
LLMs, specifically LLM APIs, make it easy to build complex applications.
However, to take those applications to production, you need to make them reliable, scalable and trustworthy. Therein lies the biggest challenge with building LLM products: making them production-ready.
The lack of reliability is mainly due to the type of problem you are solving, your LLM API provider and the vague nature of prompt engineering.
A lack of good practices and patterns for building LLM applications.
Challenges of LLM Products
In no particular order, here are the major challenges we have faced when building this product.
One of the significant challenges with using LLM APIs is the lack of SLAs or commitments on endpoint uptime and latency from the API provider. While building our application, there were inconsistencies with when we might get a result back from the API. Creative workflows need you to give responses quickly so you can capitalise on the flow state. Having to wait even a few extra seconds for the result from a model can break that flow.
Prompt engineering, which involves crafting prompts for the model, is another challenge, as results using the same prompt can be unpredictable. Best practices for prompts may not work with future models, making it hard to predict their effectiveness. Additionally, the model’s output, which is in natural language, can be ambiguous and inconsistent, especially when parsing specific information from it. This will make your product unreliable. You can make the output more predictable by providing examples of expected outcomes in the prompt, but these also tend to fail, especially for complex problems.
Complex products with chains of prompts can further increase inconsistencies, leading to incorrect and irrelevant outputs, often called hallucinations. This also causes a lack of reproducibility, where the same prompt with the same settings may produce different results, making it difficult to ensure consistency in the product. LLMs will also hallucinate with extreme confidence, making it difficult to spot them.
Another significant challenge is the lack of adequate evaluation metrics for the output of the Language Model. It is challenging to serve results with confidence without knowing how confident one should be about the result.
An incorrect result in the middle of the chain can cause the remaining chain to go wildly off track. In many cases, getting the chain back on track with prompting is very difficult. But how can you even tell if your chain is off track? And how can you check if the correcting prompt has successfully brought the chain back on track?
Our biggest problem that led to the most delays? API endpoint deprecation. When we started building our demo, we used OpenAI’s DaVinci-002 model. We created a whole set of crafted, finetuned, few-shot prompts for that API. It started working really well. However, a few weeks into the project, OpenAI deprecated that API and suggested devs move to the DaVinci-003 endpoint. Unfortunately, this wasn’t an easy transition.
Trust and security issues also pose a challenge for deploying Language Models. There are concerns about how data is being used by API providers, especially in light of recent news about proprietary code leaks. There were also concerns that OpenAI would use the data for training their next generation of models. Recently though, OpenAI announced that you could opt out of that.
The next trust issue is knowing what data was used to train these models. Since the data can adversely affect the output the model produces. One of the speakers in yesterday’s conference (I believe it was Hanlin) mentioned that if you are building a financial model, you do not want your model to be trained on data from r/wallstreetbets. Or more importantly, if you are building a model that will make medical diagnoses and suggest treatments, you definitely want to avoid your model being trained on data with misinformation.
Finally, attacks on Language Models pose another challenge, as malicious actors can trick them into outputting harmful or inaccurate results. Tools like guardrails (built by Shreya Rajpal, another speaker at yesterday’s conference) are being developed to avert such attacks, but better and more reliable solutions are needed.
Best Practices for Building LLM Products
With these problems in mind, what are the solutions, and what are the best practices for deploying LLM models?
Finetuning and Training
Many problems can be fixed by finetuning an LLM or training your own language model from scratch instead of using an API. In our previous newsletter, we compared the two and talked about the economics of choosing one over the other. Using an API has a low barrier to entry and is a good way to build an MVP without investing in a team of engineers and data scientists. However, as your product attracts more users and grows in complexity, it is better to finetune and train your own model.
Prompt Engineering
Another challenge of deploying language models in production is developing effective prompts. Here are some guidelines you should follow:
Since language is ambiguous, write your prompts so there is as little room for interpretation as possible.
You should move away from basic prompts quickly and instead use few-shot prompts containing some context about the problem and examples of inputs and expected outputs.
The next step in prompt engineering is probably Chain-of-Thought, where you ask the model to explain how it came to an output step by step.
However, prompt engineering can get complex quickly, and you get diminishing returns with prompt tuning after Chain-of-Thought. This is because the prompts get large and complex, so there are either not enough tokens left for your data or the costs become too much. With Chain-of -Thought, your output size also increases, leading to an increase in latency and cost.
This blog by Lilian Weng talks about prompt engineering techniques and references papers if you want to learn more. It is the resource I used when writing our prompts. The best option is to move on to finetuning a model as soon as your prompts get complex and your outputs less reliable.
When you find a prompt that works well, save it and treat it as valuable as your API keys. This prompt is what separates you from the competition. LangChain and other tools provide excellent APIs to do this. In this era of LLMs, this prompt is like an in-house trained model from the “old days”.
The MLOps aspect is that these prompts need to be versioned, and we need to have tests for the integrity of the prompt results.
Vector Databases
If you need to process a large amount of data, and your users might ask multiple questions about that data, you need to use a vector database. The deployment pattern for this is that you create an embedding over the data, and then instead of querying the API, you can query the embedding to get good answers. Vector databases can be used to store and query embeddings quickly. There are many vector databases. Chroma is one that integrates well with LangChain. By not querying the LLM API, you reduce your costs and decrease latency.
Chains, Agents, Watchers
Using really long chains and complex agents is not something I would advise, as they don’t work reliably enough at this point to deploy to production. One way to address errors and malicious outputs is to use a watcher language model to watch the output of another language model. Your watcher language model can also fix and parse responses from the first language model. However, be aware of the increasing costs of this setup. Also, who watches the watcher?
While searching for best practices, I didn’t find much out there. Many of the current best practices released by OpenAI, AI21Labs and others are from last year (they may as well be from the previous decade at the pace this field is moving in, lol) and don’t talk much about production architectures or design patterns for deployment. I hope the community comes up with more of these soon.