OpenAI vs Self Hosted LLMs: A Cost Analysis
How do you know if a self-hosted LLM will be cheaper than OpenAI's API?
Over the past few weeks, we have been experimenting with deploying an open-source LLM for our product, speakerscribe.com. The world of AI is vast, and while many are familiar with GPT, understanding the intricacies of its pricing and utilization can be daunting for those who aren’t neck-deep in the technicalities. That’s where our initiative began.
While our OpenAI API costs are pretty low, we wondered if a self-hosted LLM could be more economical. We were even toying with a tiered offering, with OpenAI being the premium paid feature and a self-hosted or local LLM being the free tier. But all these considerations hinge on the local deployment being cost-effective.
Here lies the challenge: reconciling the tokens-based pricing of OpenAI with the compute time-based pricing of most GPU services. However, after extensive exploration, we learned that while OpenAI’s token-based pricing might initially seem high, it can be cheaper and more efficient than a locally deployed LLM for a gamut of applications and load.
However, this isn't a straightforward decision for everyone. Thus, we've developed the GPT Subscription Advisor app. This simple tool lets you input your estimated usage details, and in return, you get a recommendation on which option might be best for you financially. It's designed for both techies and non-techies alike. So whether you're a developer, a business owner, or someone just dipping your toes in the world of AI, give it a whirl and find out which option suits your needs best!
Understanding the Pricing Structure of OpenAI and Local LLM
OpenAI employs a token-based pricing approach. Tokens are categorized into two types: input tokens and output tokens. Input tokens are the ones you include in your prompt, essentially the question or command you're asking the model. Output tokens, on the other hand, are the generated results that the model returns.
Output tokens come with a higher price tag compared to input tokens. This isn't arbitrary; there's a computational reason behind it. GPT models are regressive, meaning that for each output token generated, the entire sequence of past text—including both input and previously generated output tokens—must be processed through the model again. This iterative process is computationally expensive, justifying the higher output token cost. However, this pricing approach is different from that taken by other LLM API providers like Cohere, which charges the same for input and output tokens.
The pricing model subtly encourages developers and product builders to be more verbose in their input prompts. Since input tokens are cheaper, a more detailed prompt could potentially lead to more accurate or nuanced outputs without significantly affecting the cost.
Another advantage of this pricing strategy is that it becomes cost-effective when the tasks require short outputs. For example, if you're using the model for classification tasks where the output is a simple JSON object containing only the predicted class, you'll pay less.
With OpenAI, you pay per token—both for input and output. However, when you deploy a Local Language Model (LLM) on a cloud instance, you rent computational resources. Your costs are tied to your chosen instance, not the number of tokens you generate.
The billing cycle for these cloud instances can vary depending on the provider. Some may charge you by the hour, while others might have more granular billing cycles, such as by the minute. This starkly contrasts OpenAI's token-based pricing, where you're billed based on usage rather than time.
As you scale your LLM to handle more requests or to speed up processing, you'll likely need to add more computational resources. Whether it's an extra GPU or additional memory, these upgrades come at a cost. This could be a downside, especially for LLMs that require significant computational power and memory to operate efficiently.
Despite the potential for higher costs, deploying an LLM locally has its own advantages. One of the most compelling is the absence of API limits. You're also not billed for the number of prompts or output tokens. This means you can submit as many requests as your model can handle and generate as many tokens as you need without worrying about additional costs.
Comparing Prices and Introducing CATS
OpenAI's API and self-hosted LLMs have different pricing structures, making it challenging to compare the two directly. The first hurdle in comparing costs is the difference in billing models. OpenAI charges based on the number of tokens generated (usage-based), while self-hosted LLMs incur costs based on the time the cloud instance runs (time-based). To make a meaningful comparison, we need to standardize these disparate models.
One approach to standardize the comparison is calculating how many tokens you can generate in an hour with each model and the associated costs. This gives us a theoretical upper limit on the total volume of tokens that can be generated within a given time frame and the costs involved in doing so. We call this theoretical maximum as 100% utilization.
However, this calculation provides only the theoretical maximum cost and the best-case scenario for a self-hosted LLM. In reality, your self-hosted LLM will likely not operate at maximum utilization. The actual utilization will depend on various factors, like the number of clients you have or how frequently your LLM is called. There's a tipping point where low utilization makes it more economical to use OpenAI's API instead of a self-hosted solution.
Interestingly, the cost per token for a self-hosted LLM increases as the utilization of your product decreases. This is because the fixed costs of running the cloud instance remain constant, but the number of tokens generated decreases, effectively raising the cost per token.
Let's introduce a new metric, Cost-Adjusted Tokens (CATS), to quantify this relationship. The formula for CATS is:
This metric helps you understand how the cost per token increases as utilization decreases, providing a more nuanced view of the economics involved.
By calculating CATS for your specific use case, you can better understand the tipping point where switching from a self-hosted LLM to OpenAI's API becomes more economical. This empowers you to make a more informed decision, ensuring you get the most value for your investment.
Analyzing Costs for a 13B LLama 2 Model
We run the unquantized 13B LLama 2 model at varying batch sizes using vanilla HuggingFace code on two 4090 GPUs. The GPUs were from valdi.ai and cost $0.738 per hour. Our findings are in the table below:
Cells highlighted in red are where the costs of self-hosting are far greater than the cost of OpenAI.
Batch Size: The number of tokens processed in a single batch.
Tokens/sec: The rate at which tokens are processed every second.
Total Tokens/hour: Extrapolating the tokens/sec to an hourly rate.
Utilization Percentages (100% to 1%): The cost of generating a token at varying utilization rates. A 100% utilization means the LLM is running at full capacity, while 1% indicates minimal usage.
OpenAI: The cost per 1k output tokens when using OpenAI's GPT3.5 API.
So, what can we tell from the data?
Batch Size Efficiency: As the batch size increases, the tokens processed per second also increase, leading to a higher total of tokens per hour. This implies that larger batch sizes are more efficient in terms of throughput.
Cost Dynamics with Utilization: At 100% utilization, the cost per token for self-hosted LLMs decreases as the batch size grows. However, as the utilization drops, the cost per token increases. This highlights the importance of maintaining high utilization for cost efficiency in self-hosted models.
The OpenAI Comparison: OpenAI's cost per token remains constant at $0.002. When comparing this to the self-hosted LLMs:
Even at 100% utilization, only batch sizes of more than 6 offer a cheaper cost per token than OpenAI.
As utilization decreases, the cost advantage of self-hosted LLMs diminishes. For instance, at 25% utilization or less, your costs will be more than OpenAI, while larger utilization at large batch sizes still offers some cost benefits.
The decision between using a self-hosted LLM and OpenAI's API largely depends on the expected utilization. If you anticipate high utilization, a self-hosted LLM is more cost-effective, especially with larger batch sizes. However, OpenAI's consistent pricing might be more economical for sporadic or low utilisation.
Finally, batch processing will be more economical than a real-time system as long as you generate more tokens than the cost of the instance.
We hope to run more models at different specs and get you more. We will update the website with those data!