Pricing Guide

Last updated: March 2026 · Pricing verified March 2026

AI Token Pricing Comparison 2026: Claude vs GPT vs Gemini — Complete Cost Guide

In 2026, AI model pricing varies by over 100x between the cheapest and most expensive options. A single agent session can cost anywhere from $0.02 to $45 depending on which model you choose and how efficiently your prompts are written. This guide breaks down every major model's pricing, compares real-world costs per task, and shows how token optimization shifts the economics in your favor.

Table of Contents

  1. Understanding AI Token Pricing
  2. Complete Pricing Table (All Major Models)
  3. Cost Per Task Breakdown
  4. Hidden Costs Most Developers Miss
  5. How Token Optimization Changes the Math
  6. Which Model Should You Use?
  7. How Terse Helps Regardless of Model

Understanding AI Token Pricing

Before comparing specific models, it is worth understanding how AI token pricing actually works. Every commercial LLM charges based on tokens — small units of text that roughly correspond to three-quarters of a word in English. The word "understanding" is two tokens. The phrase "Please help me write a function that" is nine tokens. A typical paragraph of natural language runs 60 to 100 tokens.

Pricing is quoted per million tokens (MTok), and there are three distinct price points to understand for each model:

Input Tokens (Prompt Pricing)

Input tokens are what you send to the model: your instructions, context, conversation history, system prompts, and any documents or code you include. This is the price you pay for the model to read your prompt. Input pricing is almost always cheaper than output pricing because reading is computationally less expensive than generating. For most workflows, input tokens vastly outnumber output tokens — a typical prompt might be 2,000 tokens with a 500-token response. In agent sessions with tools like Claude Code or Cursor, the ratio skews even further: the model reads thousands of lines of code context for every short tool invocation it generates.

Output Tokens (Completion Pricing)

Output tokens are what the model generates in response to your prompt. This includes the actual answer, code it writes, reasoning it produces, and any tool call arguments. Output tokens cost more — typically 3x to 5x the input price — because generation requires sequential computation. Each output token depends on all previous tokens, making generation inherently more expensive than the parallel processing used for inputs. For coding tasks, output costs can dominate the bill because the model generates verbose code blocks, explanations, and multi-step tool call sequences.

Cache Read Pricing

Cache pricing is the newest and most impactful dimension. When you send the same prefix (system prompt, conversation history, or document context) across multiple requests, providers can cache the KV (key-value) representations from the first request and reuse them on subsequent ones. Cached input tokens are dramatically cheaper — typically 90% less than standard input pricing. Anthropic, OpenAI, and Google all now offer some form of prompt caching, though the implementations differ. For agent sessions where the same context is resent on every turn, cache pricing is what actually determines your cost. A 200K-token agent session where 90% of each turn's input is cached costs a fraction of what the sticker input price would suggest.

Complete Pricing Table: All Major Models (March 2026)

The following table lists current per-million-token pricing for every major model available through commercial APIs. Prices are in USD.

ModelInput / MTokOutput / MTokCache Read / MTokContext Window
Anthropic (Claude)
Claude Opus 4$15.00$75.00$1.50200K
Claude Sonnet 4$3.00$15.00$0.30200K
Claude Haiku 3.5$0.80$4.00$0.08200K
OpenAI
GPT-4o$2.50$10.00$1.25128K
GPT-4o-mini$0.15$0.60$0.075128K
o1$15.00$60.00$7.50200K
Google (Gemini)
Gemini 2.0 Flash$0.10$0.40$0.0251M
Gemini 1.5 Pro$1.25$5.00$0.31252M

The spread is enormous. Sending one million input tokens costs $0.10 on Gemini 2.0 Flash and $15.00 on Claude Opus 4 or o1 — a 150x difference. Output pricing shows a similar range: $0.40 per MTok on Gemini Flash versus $75.00 on Claude Opus 4, a 187x spread. These are not marginal differences. Choosing the wrong model for a task, or sending unnecessarily verbose prompts to an expensive model, can turn a $0.50 workflow into a $50 one.

Cache pricing narrows the gap for multi-turn workflows but does not eliminate it. Claude Opus 4's cached input at $1.50/MTok is still 60x more expensive than Gemini Flash's cached rate of $0.025/MTok. The economics of model choice compound across every turn of an agent session.

Cost Per Task Breakdown

Raw per-million-token prices are hard to reason about. What matters is what a real task actually costs. Below are three common scenarios with concrete costs for each model, assuming typical input-to-output ratios.

Simple Q&A

~500 input / ~200 output tokens

Opus 4$0.0225
Sonnet 4$0.0045
Haiku 3.5$0.0012
GPT-4o$0.0033
4o-mini$0.0002
o1$0.0195
Gemini Flash$0.0001
Gemini Pro$0.0016

Code Review

~5K input / ~1K output tokens

Opus 4$0.150
Sonnet 4$0.030
Haiku 3.5$0.008
GPT-4o$0.023
4o-mini$0.001
o1$0.135
Gemini Flash$0.001
Gemini Pro$0.011

Agent Session

~200K input / ~50K output (40 turns)

Opus 4$6.75
Sonnet 4$1.35
Haiku 3.5$0.36
GPT-4o$1.00
4o-mini$0.060
o1$6.00
Gemini Flash$0.040
Gemini Pro$0.50

The agent session column is where the numbers get real. A developer running 10 Claude Opus 4 agent sessions per day is spending $67.50 daily, or roughly $1,350 per month, on API tokens alone. The same sessions on Sonnet 4 cost $270 per month. On Gemini 2.0 Flash, the monthly bill drops to $8. These are not hypothetical projections — they reflect real usage patterns from developers working with Claude Code, Cursor, and similar agentic tools throughout 2025 and into 2026.

The cost gap between a simple question and an agent session illustrates why per-token efficiency matters exponentially more as context windows grow. When your prompts are short, model choice barely matters — even Opus 4 costs $0.02 for a quick question. But when you are feeding 200,000 tokens of context into every turn of a 40-turn coding session, every wasted token in your prompt is multiplied 40 times over, and the price difference between models becomes a difference between a coffee and a car payment.

Hidden Costs Most Developers Miss

The pricing table tells you what each token costs. What it does not tell you is how many tokens you are wasting. In practice, most developers spend 30-50% more on tokens than they need to, and the waste comes from sources that are not obvious.

Context Window Waste

Agent frameworks send the full conversation history on every turn. By turn 20, the model is re-reading 19 previous turns of context it has already processed. If your early prompts were verbose, that verbosity is now being billed on every subsequent request. A 100-word prompt in turn 1 that could have been 60 words costs you those extra 40 words multiplied by every remaining turn in the session. On a 40-turn Opus 4 session, a single unnecessarily verbose paragraph in turn 1 can add several dollars to the total bill.

Duplicate API Calls

Agent sessions frequently make redundant API calls. The model reads the same file twice, runs the same search again, or calls a tool with identical arguments because it lost track of what it already retrieved. Each duplicate call sends the full context window plus the tool result back through the model. Terse's agent monitor routinely detects 5-15% of tool calls in a session as duplicates, and each one costs input tokens for the full context re-read plus output tokens for the repeated response.

Verbose Prompts

Human-written prompts are full of patterns that cost tokens without adding information. Phrases like "I would really appreciate it if you could please" carry the same instruction as "Please" — which itself carries the same instruction as nothing at all, since LLMs do not require politeness to produce good output. Research from LLMLingua (EMNLP 2023) quantified this: typical prompts contain 40-70% redundant tokens. Hedging ("I think maybe"), filler words ("basically", "actually", "just"), meta-language ("what I'm trying to do is"), and excessive qualification all consume tokens that the model ignores during processing.

Retry Loops from Typos

A misspelled variable name, an incorrect function signature, or a garbled instruction can cause the model to produce incorrect output, leading to a follow-up correction turn that re-sends the entire context window. On Opus 4, a single retry caused by a typo in a 100K-token context costs $1.50 in input tokens alone. The typo itself might have been two characters — but the cost of the retry is the entire context window being re-processed. Spell correction at the point of entry, before the prompt is sent, eliminates this entire category of waste.

Markdown and Formatting Overhead

Many prompts include markdown headers, bullet-point formatting, bold markers, and other formatting characters that are meaningful to humans reading the prompt but irrelevant to the model processing it. A prompt peppered with **bold**, ### headers, and - bullet points can spend 5-10% of its token budget on formatting characters. The model produces equally good output without them.

How Token Optimization Changes the Math

The hidden costs described above are not fixed. They can be reduced systematically through token optimization — compressing prompts before they are sent to the model, removing redundancy while preserving the instruction that the model actually needs.

Terse applies a 7-stage optimization pipeline that runs on-device in under 5 milliseconds. The effect on costs is significant and compounds across agent sessions. Here is what the numbers look like before and after optimization for each task type, using Terse in Normal mode (25-40% reduction):

ScenarioModelBefore TerseAfter Terse (30% reduction)Saved
Simple Q&A (500 tokens)Opus 4$0.0225$0.0158$0.007
Simple Q&A (500 tokens)GPT-4o$0.0033$0.0023$0.001
Code review (5K tokens)Opus 4$0.150$0.105$0.045
Code review (5K tokens)Sonnet 4$0.030$0.021$0.009
Agent session (200K tokens)Opus 4$6.75$4.73$2.02
Agent session (200K tokens)Sonnet 4$1.35$0.95$0.40
Agent session (200K tokens)GPT-4o$1.00$0.70$0.30
Agent session (200K tokens)o1$6.00$4.20$1.80

On a per-request basis, the savings on a simple question look modest. But the compounding effect in agent sessions is where the numbers become meaningful. A developer running 10 Opus 4 agent sessions per day saves $20.20 daily, or $404 per month. For a team of five developers, that is $2,020 per month — enough to cover additional tooling, infrastructure, or simply reduce the AI budget line item by a quarter.

The savings compound further because token optimization reduces context window growth. When your prompts in turns 1 through 10 are 30% shorter, the cumulative context re-sent in turns 11 through 40 is also shorter. The actual savings in a 40-turn session are typically 5-10% higher than the linear 30% reduction would suggest, because the compression cascades through the conversation history.

In Aggressive mode (40-60% reduction), the savings roughly double. A developer who is comfortable with telegraph-style prompts — shorter, more direct, stripped of all filler — can cut their Opus 4 agent session costs from $6.75 to under $3.00. At that point, the cost difference between Opus 4 and Sonnet 4 narrows enough that choosing the more capable model becomes justifiable for tasks where it makes a qualitative difference.

Which Model Should You Use?

Model choice depends on two factors: what you are doing and what you are willing to spend. Here is a decision framework based on task type and budget sensitivity.

Quick Questions & Chat

Simple queries, brainstorming, explanations, and conversational tasks. Low token counts, minimal context needed.

Best value: GPT-4o-mini ($0.0002/query) or Gemini 2.0 Flash ($0.0001/query)

Code Generation

Writing functions, classes, and modules from specifications. Moderate context, high output. Quality matters for correctness.

Best balance: Claude Sonnet 4 ($0.030/review) or GPT-4o ($0.023/review)

Complex Reasoning

Multi-step logic, mathematical proofs, architecture decisions, research analysis. Accuracy is paramount.

Best quality: Claude Opus 4 ($0.15/task) or o1 ($0.135/task)

Agent Sessions (Coding)

Extended multi-turn sessions with tool use: Claude Code, Cursor, Aider. High token volume, context grows per turn.

Best balance: Claude Sonnet 4 ($1.35/session). Budget: Haiku 3.5 ($0.36/session)

Batch Processing

Classifying documents, extracting data, summarizing at scale. Thousands of requests, cost per unit matters most.

Best value: Gemini 2.0 Flash ($0.10/MTok) or GPT-4o-mini ($0.15/MTok)

Long Document Analysis

Processing large PDFs, codebases, or research papers. Needs large context window and strong comprehension.

Best fit: Gemini 1.5 Pro (2M context, $1.25/MTok) or Claude Sonnet 4 (200K, $3/MTok)

The general principle is straightforward: use the cheapest model that produces acceptable output for your task. For most coding workflows, Claude Sonnet 4 or GPT-4o offer the best balance of capability and cost. Reserve Opus 4 and o1 for tasks where their superior reasoning produces measurably better results — complex architecture decisions, subtle bug analysis, or multi-file refactors where a cheaper model would require multiple retry attempts (which themselves cost tokens).

For high-volume work like data processing, classification, and extraction, Gemini 2.0 Flash and GPT-4o-mini are in a class of their own. At $0.10 and $0.15 per million input tokens respectively, they make tasks economical that would be prohibitively expensive on frontier models. The quality gap is real — these models will not match Opus 4 on complex reasoning — but for structured, well-defined tasks, they perform comparably at a fraction of the cost.

One important consideration: the cheapest model is not always the cheapest choice. If a budget model requires two attempts to get a correct answer while a more expensive model gets it right the first time, the budget model may actually cost more. Each retry re-sends the full context window. On a 100K-token agent session, a single retry on Sonnet 4 costs $0.30 in input tokens. If the more expensive Opus 4 avoids that retry, the $0.15 premium you paid for Opus 4's input pricing is offset. The true cost metric is not price per token but price per successful outcome.

How Terse Helps Regardless of Model

Token optimization is model-agnostic. The redundancy in your prompts exists whether you are sending them to Opus 4, GPT-4o-mini, or Gemini Flash. Terse works across all of them because it operates at the prompt level, before the text reaches any API.

Universal Compression

Terse's optimization pipeline processes plain text. It does not depend on any model's tokenizer, API format, or pricing structure. Whether your prompt goes to Anthropic, OpenAI, or Google, the compression is applied identically. Filler words are filler words regardless of the destination model. Hedging phrases cost tokens on every provider. The 30-40% reduction Terse achieves in Normal mode translates directly to 30-40% fewer tokens billed, on any model, at any price point.

Savings Scale with Price

The dollar value of optimization is proportional to the model's price. A 30% reduction on Gemini Flash saves $0.03 per million tokens. The same 30% reduction on Opus 4 saves $4.50 per million tokens — a 150x larger dollar saving from the same optimization. This means Terse pays for itself fastest on the most expensive models, which are also the models where developers are least willing to compromise on quality. Instead of downgrading from Opus 4 to a cheaper model, you can keep using the best model and reduce its effective cost by compressing your prompts.

Agent Session Monitoring

Beyond prompt compression, Terse's agent monitoring capabilities provide real-time visibility into where your tokens are going. The monitoring dashboard tracks input tokens, output tokens, cache hits, tool calls, and cumulative cost for every agent session. It flags duplicate tool calls, detects redundant file reads, and alerts you when context fill exceeds 85%. These insights help you optimize not just your prompts, but your entire workflow — identifying patterns where the agent is wasting tokens so you can adjust your approach or update your CLAUDE.md rules to prevent the waste from recurring.

Compounding Across Sessions

Most developers do not run one agent session per day. They run five, ten, or twenty. The savings from token optimization compound linearly across sessions and exponentially within each session (due to context window re-sending). A developer who optimizes their prompts with Terse across 10 daily Sonnet 4 agent sessions saves approximately $4.00 per day, $80 per month, or $960 per year. On Opus 4, the same usage pattern saves $20 per day, $400 per month, or $4,800 per year. For teams, multiply accordingly.

The arithmetic is simple but the implication is significant: token optimization is not a nice-to-have for cost-conscious developers. At current 2026 pricing levels, it is a direct input to your AI budget that compounds with every session you run.

Stop Overpaying for Tokens

Terse compresses your prompts by 30-60% before they reach any API. On-device, under 5ms, works with Claude, GPT, Gemini, and every other model. See your savings in real time with the agent monitoring dashboard.

Download Terse

Further Reading

Explore related guides and techniques for reducing your AI costs: