How We Cut Our AI Bill by 60% (Real Numbers)
Our monthly LLM bill: $2,340
Three months later: $890
That's not a typo. We cut our AI costs by 62% without reducing functionality.
Here's exactly how we did itβwith real numbers, actual tools, and strategies you can implement today.
Strategy 1: Semantic Caching (35% Savings)
The biggest win was the simplest concept: don't pay to generate the same response twice.
We were calling GPT-4 for user queries that were 90% similar. "How do I reset my password?" shouldn't cost $0.03 every single time.
What We Built
We implemented semantic caching using embeddings:
- Store every LLM response with its embedding vector
- For new queries, find similar past queries (cosine similarity > 0.92)
- If match found, return cached response instead of calling API
- Cache invalidates after 24 hours for dynamic content
π° The Numbers
Cache hit rate: ~40% of queries
Monthly savings: 35% ($819/month)
Tool we used: Custom implementation with Redis + OpenAI embeddings. You can also use GPTCache (open source) as a drop-in replacement.
Strategy 2: Model Routing (16% Savings)
Not every task needs GPT-4. Not even close.
We built a simple classifier that routes queries to the right model:
- GPT-4: Complex reasoning, coding, analysis (10% of traffic)
- GPT-3.5: Simple Q&A, summaries, formatting (75% of traffic)
- Local models: Embeddings, classification tasks (15% of traffic)
π° The Numbers
Downgraded queries: 75% of total volume
Monthly savings: 16% ($374/month)
Tool we used: Llmswap β handles model routing with fallback logic.
Strategy 3: Prompt Compression (9% Savings)
We were sending way too much context. Every prompt included full conversation history, system instructions, and examples.
Our compression strategy:
- Summarize long conversations: After 10 turns, summarize the thread and replace older messages
- Dynamic examples: Instead of 5 fixed examples, retrieve only the most relevant 2 based on embedding similarity
- Trim unnecessary context: Strip timestamps, metadata, and formatting that doesn't affect the output
π° The Numbers
Average prompt reduction: 45% fewer tokens
Monthly savings: 9% ($211/month)
Tool we used: InferShrink β prompt compression with semantic awareness.
The Bonus: Monitoring
Here's the thing: we didn't know we were overspending until we measured it.
Set up per-endpoint, per-user tracking of:
- Tokens in/out
- Model usage breakdown
- Cache hit rates
- Cost per user session
We built a simple dashboard that shows daily costs and sends alerts if we exceed thresholds. This alone caught $200/month in forgotten test calls to GPT-4.
The Tools We Use
- GPTCache β Semantic caching layer
- Llmswap β Model routing and fallbacks
- InferShrink β Prompt compression
- Grafana β Cost monitoring dashboard
- Redis β Cache storage
What's Next
We're not done optimizing. Next on our list:
- Fine-tuned models: For repetitive tasks, a fine-tuned 3.5 model might replace GPT-4 entirely
- Batched processing: Queue non-urgent requests and process in batches for lower per-token costs
- Local LLMs: Running smaller models (7B-13B parameters) on our own infrastructure for specific tasks
What's your biggest AI cost optimization win? I'd love to hear your strategies.
#costoptimization #llm #ai #saas #bootstrapped #buildinpublic