Managing Costs of Your LLM Application — Part 1 of 2

I interviewed the CTO of a company in the legal field that had a recent very successful exit. They were founded in 2013 and had been working on using AI to make lawyers more productive. They were way ahead of the market. When GenAI exploded they were in the cat bird seat. He said that GPT4 literally transformed his business yet he was paying “well over $100k a month” to OpenAI. For the sake of the conversation let’s say the number was $200k a month. This comes out to $2.4M a year! That’s a lot of cash for a 120 person company!
You don’t have to be a big time player to get upside down. For example, if you charge forward building a chatbot using GPT-4 and you have not employed any cost management techniques, you can quickly see a per-user cost grow… $8, $15, even $20 per visit and find your business model upside down!
For any of us involved in building AI applications with large language models, an important skill is in the area of cost management. As you dig into this topic, you’re going to see LLM cost management is an entire discipline in and of itself!
Here is my take on the topic, along with some of my favorite supporting articles and a few startups working on various areas of the problem.
I see cost reduction in these four areas:
- Understand where your costs are coming from
- Use the right model
- Reduce tokens
- Don’t use an LLM where you don’t have to!
Understand Where Your Costs Are Coming From
Cost Analytics and Monitoring
As your application grows, you’re going to need a tool to keep an eye on where your costs are coming from. Particularly as part of your overall monitoring strategy, you should catch incidents such as application errors and overactive users abusing your product that you’re going to want to jump on right away.
Here are a few startups working in this area: (Note: There are numerous startups doing interesting things in the monitoring and observability area. These three made it easy to find cost analytics features in their materials.)
- Monitor and monetize your AI applications. Powerful, low-latency proxy to monitor, track, and utilize your OpenAI usage.
https://www.aporia.com/ai-guardrails/cost-tracking/
- Cost Tracking. Track every penny, query, and token to control your LLM costs and streamline budget planning.
- One platform, everything you need. Everything you need to build, deploy, and scale your application.
Price Calculators
Price calculators are handy tools that seek to keep up-to-date pricing and allow you to quickly calculate costs based on tokens, words, and/or characters used in your context windows.
Here are two tools you can play with:
https://yourgpt.ai/tools/openai-and-other-llm-api-pricing-calculator
- OpenAI & other LLM API Pricing Calculator. Calculate the cost of using OpenAI and other Large Language Models (LLMs) APIs.
https://docsbot.ai/tools/gpt-openai-api-pricing-calculator
- Calculate and compare the cost of using OpenAI, Azure, Anthropic Claude, Llama 2, Google Gemini, Mistral, and Cohere LLM APIs for your AI project with our simple and powerful free calculator. Latest numbers as of March 2024.
Use the Right Model
Execution Level Optimization
This is about picking the least expensive model for a given task. A simple approach is to test various executions across your application against lower cost models. Simply using less powerful, lower-priced models for the right executions in your application can have a 35% or more drop in costs for the given execution. This one is pretty obvious.
A helpful tool and a startup:
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
- The Hugging Face leaderboard is a good place to quickly find models you can consider with a set of metrics to help rank the models across different benchmarks and measurement techniques. As of today, this leaderboard covers 73 models.
- With the rapid release of new pre-trained Large Language Models, selecting the right one for specific applications can be challenging, as available benchmarks don’t tell you how well the model is going to work for your application. Airtrain enables the design of tailored evaluation procedures and metrics for your application.
Cascading / Compounding Models
Cascading models layer models are different from the execution level in that this approach breaks down a task sequentially, where each refines the output or filters inputs, balancing accuracy and cost.
https://www.ai-jason.com/learning-ai/how-to-reduce-llm-cost
A startup working on this use case:
- Roli provides Generative AI Gateway services for compound AI solutions, such as: LLM abstraction, routing and chaining, OAuth/RBAC-enabled API gateway and automated client SDK generation.
Model Routing
Model Routing directs tasks to the most suitable model based on the nature of the task, optimizing both efficiency and accuracy.
Two startups working on this use case:
- We invented the first LLM router. By dynamically routing between multiple models, Martian can beat GPT-4 on performance, reduce costs by 20%-97%, and simplify the process of using AI.
- Outperform any single LLM with a model garden. GPT-4 performance at a fraction of the cost. Dynamically routes prompts to the best-suited LLM, maximizing performance while optimizing costs.
Fine-tuning Pre-trained Models
Fine-tuning is about providing additional training on top of pre-trained models with domain-specific data to enhance task-specific performance. I’m not going to tackle fine-tuning here as it’s a big topic. I will say that I am seeing a resurgence in the topic now that so many builders have maxed out the results they are getting from prompting techniques and are looking for better performance and lower costs.
Here is an inspiring case study:
https://jellypod.ai/blog/how-I-reduced-llm-costs
- I dropped costs by 85% fine-tuning Misteral — March 7, 2024.
Open Source
“We’re going to go open source eventually”. Have you said that yourself? Going open source is likely more expensive than you think. A point I would add is to think hard about the business opportunity cost of your team’s time to research, decide, set up, operate, and support an open source solution and build this into your analysis. We are going to see price pressure on the commercial APIs both from direct competitors and open source that will change the current math.
https://medium.com/emalpha/the-economics-of-large-language-models-2671985b621c
Related Articles
https://lajavaness.medium.com/llm-large-language-model-cost-analysis-d5022bb43e9e
- Opensource vs commercial deep dive.
https://medium.com/emalpha/the-economics-of-large-language-models-2671985b621c
- The Economics of Large Language Models. A deep dive into considerations for using and hosting large language models
https://www.theinformation.com/articles/metas-free-ai-isnt-cheap-to-use-companies-say
- Meta’s free AI isn’t cheap to use (Requires subscription). ;-(
Next Article
In the next article, I will cover:
- Token Reduction
- Don’t use an LLM where you don’t have to!
Here is a link to the second article in this two part series.
Conclusion
Managing the costs of LLM applications is not just about reducing expenses, it’s about optimizing both performance and budget. By understanding where costs originate, selecting the appropriate models, minimizing token usage (Part 2 coming soon), and judiciously deciding when to use an LLM, developers can significantly reduce their financial outlay without compromising on the quality of their applications.
References
https://www.youtube.com/watch?v=lHxl5SchjPA&t=1178s
Jason’s video gives real examples of cost-cutting techniques you can deploy now to cut costs. This is a fabulous watch and contribution to the topic. Thanks “AI Jason”!
https://www.linkedin.com/pulse/managing-cost-ai-paul-walsh-9w3of/
A solid comprehensive overview on the topic.
https://arxiv.org/pdf/2305.05176.pdf
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. Lingjiao Chen, Matei Zaharia, James Zou Stanford University
https://www.linkedin.com/pulse/managing-cost-ai-paul-walsh-9w3of/
This article expands on foundational prompt engineering techniques to optimize cost and efficiency in using Large Language Models (LLMs), covering model selection, prompt and query efficiency, caching, and a cost-effective LLM cascade approach for economical AI operations.
https://jellypod.ai/blog/how-I-reduced-llm-costs
I dropped costs by 85% fine-tuning Misteral — March 7, 2024.
https://lajavaness.medium.com/llm-large-language-model-cost-analysis-d5022bb43e9e
Opensource vs commercial deep dive.
https://medium.com/emalpha/the-economics-of-large-language-models-2671985b621c
The Economics of Large Language Models. A deep dive into considerations for using and hosting large language models
https://www.theinformation.com/articles/metas-free-ai-isnt-cheap-to-use-companies-say
Meta’s free AI isn’t cheap to use (Requires subscription). ;-(