The current pricing of LLM APIs is less a reflection of their true operational cost and more a weapon in a fierce battle for market share, developer loyalty, and enterprise integration.
The most fun part about working at a place like OpenAI is seeing how wrong random substackers are when making claims about training and serving frontier models. The worst part about working at OpenAI is not being able to correct people without divulging inside knowledge.
I’ll say this: frontier LLM providers are running fantastic businesses. Your math is very wrong
This analysis doesn't pass the smell test. They claim 90% subsidization based on $6.37/million tokens "true cost", but inference providers like Together AI and Fireworks profitably serve 70B models at $0.90-2.00/million.
The math assumes pathetically low throughput (1,848 tokens/sec on 8x H200s) and uses cloud pricing instead of actual hardware costs. Modern serving stacks with continuous batching and quantization achieve much higher utilization.
The timing is suspicious - OpenAI just dropped o3 prices 80% last week. If they were already subsidizing 90%, they'd now be at 98% losses, which is absurd.
Note the author is selling a "Token Optimization" workshop at AgentCon. Classic FUD marketing: create panic about future price hikes, position yourself as the expert, sell the solution.
Unlike Uber or AWS, LLMs are basically interchangeable - there's no lock-in. Why would providers heavily subsidize a commodity service where customers can switch with a one-line code change? The margins are probably thin but positive, especially at scale.
If the economics were truly this dire, we'd see inference providers shutting down, not OpenAI aggressively cutting prices further.
You can get as much as 400tokens/sec throughput on an H200 for Llama 3.1 405B[1]. If you calculate that with Together AI's prices for that model ($3.50/M tokens), you can see that they do make a profit, albeit very tiny. The economics for smaller LLMs definitely makes sense (at scale). Not so much for the larger LLMs that major API players are running.
The throughput is based on benchmarks run by other folks in the community [2]. You are correct about modern serving stacks, however you should assume a batching efficiency of 60% (even with higher efficiencies, the costs are highly subsidized). If you are serving a quantized model, you would have to tell that to people and charge lesser. For instance, Together would (or at least should) tell people that the model they are using is quantized and charge them appropriately. You cannot sell a quantized model and charge full precision costs.
I am not selling a workshop. It is free to attend, and I went through the normal CFP process to get my talk accepted. I have been talking about unsustainable AI costs for a while now, before I had anything to sell.
There is a certain lock-in with models. It is not easy to shift from one model to another.
The most fun part about working at a place like OpenAI is seeing how wrong random substackers are when making claims about training and serving frontier models. The worst part about working at OpenAI is not being able to correct people without divulging inside knowledge.
I’ll say this: frontier LLM providers are running fantastic businesses. Your math is very wrong
This analysis doesn't pass the smell test. They claim 90% subsidization based on $6.37/million tokens "true cost", but inference providers like Together AI and Fireworks profitably serve 70B models at $0.90-2.00/million.
The math assumes pathetically low throughput (1,848 tokens/sec on 8x H200s) and uses cloud pricing instead of actual hardware costs. Modern serving stacks with continuous batching and quantization achieve much higher utilization.
The timing is suspicious - OpenAI just dropped o3 prices 80% last week. If they were already subsidizing 90%, they'd now be at 98% losses, which is absurd.
Note the author is selling a "Token Optimization" workshop at AgentCon. Classic FUD marketing: create panic about future price hikes, position yourself as the expert, sell the solution.
Unlike Uber or AWS, LLMs are basically interchangeable - there's no lock-in. Why would providers heavily subsidize a commodity service where customers can switch with a one-line code change? The margins are probably thin but positive, especially at scale.
If the economics were truly this dire, we'd see inference providers shutting down, not OpenAI aggressively cutting prices further.
Hey Christopher. Thank you for your comments.
You raise some very important points.
You can get as much as 400tokens/sec throughput on an H200 for Llama 3.1 405B[1]. If you calculate that with Together AI's prices for that model ($3.50/M tokens), you can see that they do make a profit, albeit very tiny. The economics for smaller LLMs definitely makes sense (at scale). Not so much for the larger LLMs that major API players are running.
The throughput is based on benchmarks run by other folks in the community [2]. You are correct about modern serving stacks, however you should assume a batching efficiency of 60% (even with higher efficiencies, the costs are highly subsidized). If you are serving a quantized model, you would have to tell that to people and charge lesser. For instance, Together would (or at least should) tell people that the model they are using is quantized and charge them appropriately. You cannot sell a quantized model and charge full precision costs.
I am not selling a workshop. It is free to attend, and I went through the normal CFP process to get my talk accepted. I have been talking about unsustainable AI costs for a while now, before I had anything to sell.
There is a certain lock-in with models. It is not easy to shift from one model to another.
[1] https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms
[2] https://docs.jarvislabs.ai/blog/h200-price