How We Cut Our AI Bill by 60% (Real Numbers)

March 2, 2026 SaaS

Our monthly LLM bill: $2,340

Three months later: $890

That's not a typo. We cut our AI costs by 62% without reducing functionality.

Here's exactly how we did it—with real numbers, actual tools, and strategies you can implement today.

60%

Cost Reduction

$2,340

Original Bill

$890

New Bill

$1,450

Monthly Savings

Strategy 1: Semantic Caching (35% Savings)

The biggest win was the simplest concept: don't pay to generate the same response twice.

We were calling GPT-4 for user queries that were 90% similar. "How do I reset my password?" shouldn't cost $0.03 every single time.

What We Built

We implemented semantic caching using embeddings:

Store every LLM response with its embedding vector
For new queries, find similar past queries (cosine similarity > 0.92)
If match found, return cached response instead of calling API
Cache invalidates after 24 hours for dynamic content

💰 The Numbers

Cache hit rate: ~40% of queries

Monthly savings: 35% ($819/month)

Tool we used: Custom implementation with Redis + OpenAI embeddings. You can also use GPTCache (open source) as a drop-in replacement.

Strategy 2: Model Routing (16% Savings)

Not every task needs GPT-4. Not even close.

We built a simple classifier that routes queries to the right model:

GPT-4: Complex reasoning, coding, analysis (10% of traffic)
GPT-3.5: Simple Q&A, summaries, formatting (75% of traffic)
Local models: Embeddings, classification tasks (15% of traffic)

💰 The Numbers

Downgraded queries: 75% of total volume

Monthly savings: 16% ($374/month)

Tool we used: Llmswap — handles model routing with fallback logic.

Strategy 3: Prompt Compression (9% Savings)

We were sending way too much context. Every prompt included full conversation history, system instructions, and examples.

Our compression strategy:

Summarize long conversations: After 10 turns, summarize the thread and replace older messages
Dynamic examples: Instead of 5 fixed examples, retrieve only the most relevant 2 based on embedding similarity
Trim unnecessary context: Strip timestamps, metadata, and formatting that doesn't affect the output

💰 The Numbers

Average prompt reduction: 45% fewer tokens

Monthly savings: 9% ($211/month)

Tool we used: InferShrink — prompt compression with semantic awareness.

The Bonus: Monitoring

Here's the thing: we didn't know we were overspending until we measured it.

Set up per-endpoint, per-user tracking of:

Tokens in/out
Model usage breakdown
Cache hit rates
Cost per user session

We built a simple dashboard that shows daily costs and sends alerts if we exceed thresholds. This alone caught $200/month in forgotten test calls to GPT-4.

The Tools We Use

GPTCache — Semantic caching layer
Llmswap — Model routing and fallbacks
InferShrink — Prompt compression
Grafana — Cost monitoring dashboard
Redis — Cache storage

What's Next

We're not done optimizing. Next on our list:

Fine-tuned models: For repetitive tasks, a fine-tuned 3.5 model might replace GPT-4 entirely
Batched processing: Queue non-urgent requests and process in batches for lower per-token costs
Local LLMs: Running smaller models (7B-13B parameters) on our own infrastructure for specific tasks

What's your biggest AI cost optimization win? I'd love to hear your strategies.

#costoptimization #llm #ai #saas #bootstrapped #buildinpublic