Welcome back to Quick Bites! This month, we're keeping it short and sweet, ensuring our busy readers get their dose of insightful content. As January unfolds, the buzz around AI at the World Economic Forum in Davos is hard to miss. The conference put a spotlight on a conscious approach to AI, emphasizing its application across various sectors and its intersection with other technologies, all while prioritizing people-first strategies.
Among business leaders, there's a growing concern about the 'impending doom' of AI overreach. But the real head-scratcher is the evaluation of these evolving models. For instance, the challenge in RAG (Retrieval-Augmented Generation) applications is significant. These systems need to be assessed for not just the accuracy and relevance of their responses, but also for their ability to retrieve and apply pertinent context.
Typically, human annotation is the go-to method for such evaluations. However, its drawbacks include being time-consuming, error-prone, and unable to handle real-time systems. While metrics like perplexity can assess the language model, they fall short for the complete RAG system.
Enter the world of self-evaluating systems, like RAGAs, which use LLMs (Large Language Models) for reference-free evaluations. But this raises an intriguing dilemma: how objective can a system be in evaluating its own output?
Evaluating the quality of RAG applications in production is a considerable challenge. The evaluation needs to account for not only the quality and faithfulness of the generation but also the ability to identify and retrieve relevant context.
Human annotation is the most accurate evaluation method. However, it is slow and prone to errors and biases. Moreover, you cannot use human evaluators for real-time systems. Metrics like perplexity can be used to evaluate the performance of the language model itself but not the performance of the entire RAG system.
The holy grail for RAG evaluations is self-contained and reference-free, meaning you do not need human-annotated reference answers. RAGAs is one of the popular frameworks for doing so. However, to make their system reference-free, LLMs are used to evaluate the generated answers. Herein lies the problem
A Typical RAGAs Evaluation
RAGAs uses OpenAI’s API by default to calculate four main metrics: answer relevancy, faithfulness, context recall, and context precision. The default model is GPT-3.5-turbo. However, you can use your own LLM model. By taking the harmonic mean of the four metrics, you get the ragas score, which “is a single measure of the performance of your QA system across all the important aspects.”
To run your evaluation, you provide RAGAs with the metrics you want to calculate, the query, the answer, and the context used to arrive at the answer.
The results show that my RAG response was faithful, and the retrieved context was relevant to the question. However, I can improve my context recall, which measures “the ability of the retriever to retrieve all the necessary information needed to answer the question.” Overall, since my answer was faithful or factually relevant to the provided context, I can serve this answer to my user with high confidence!
But what is the cost of running this eval?
I ran RAGAs evaluation on our RAG application data. If you plot the number of tokens sent to OpenAI, on average, about ~90% of the tokens are used for running the evaluation. Just ~10% of my tokens were used to generate the response!
But at least it was fast, right? Nope, for my application, each evaluation (4 metrics) takes, on average, 15-20 seconds to run and involves five requests to OpenAI.
Finally, each evaluation costs somewhere between $0.10 to $0.15 in API costs.
And so, while our evaluation and monitoring system is like buying a Ferrari to watch over a bicycle, OpenAI is not just baking the cake but gleefully devouring it as well.
Death by RAG Evals
My first concern is regarding the tokens. There is going to be an overhead when using LLMs for evaluation. The problem is that the eval query also includes the RAG query, context, and answer. The generated outputs are also quite long too. This results in the eval requiring about 9x the number of tokens needed for the original query and response pair
This can be reduced by running fewer eval metrics rather than all four, but I would argue that each metric in RAGAs is essential and gives a good overview of the RAG system's performance.
Secondly, it takes at least 10 seconds to run all the evaluation metrics per RAG Query. For larger query-answer pairs, it can take more than 20 seconds. If you are using evals to ensure that your responses are truthful and accurate before serving them, then this will increase the latency.
Finally, I am not sure about using LLMs to evaluate the output of other LLMs. Different LLMs will have different scores for the same response. The creators also allude to this in their docs here. So do we just choose an LLM that gives us the best scores? Or do we fix a scoring LLM and then try to improve our RAG output based on that? Or do we finetune a RAG scoring LLM for our domain? Doesn't that kind of defeat the purpose?
RAG evaluations, while essential for AI accuracy and reliability, often resemble a 'death by a thousand cuts' with their excessive token use, time delays, and financial costs.