Sticker price says Gemini 3.1 Pro is the obvious pick: $2 per million input tokens and $12 out, against Claude Opus 4.7’s $5 and $25. Two-and-a-half times cheaper on input, roughly two times on output. But sticker price is not what lands on the invoice. Opus 4.7 ships a new tokenizer that counts the same prompt as 1.0× to 1.35× more tokens — Anthropic’s own range — and Gemini bills every “thinking” token at the full output rate, so a HIGH-effort answer that reasons through 4,000 tokens before writing a 500-token reply charges you for 4,500 output tokens. The cheaper rate card and the cheaper bill are not the same thing.

To get an honest read, I ran the same 10 production tasks against both models on the same day with identical prompts — long-context summarization, structured extraction, code generation, agent loops, RAG, web research, JSON schema compliance, and three more — and measured cost, latency, and accuracy end-to-end. The numbers below are what I actually paid, not what the rate cards promise.

The Short Answer Up Front

For 7 of 10 production tasks, Gemini 3.1 Pro at MEDIUM thinking costs 50-75% less than Claude Opus 4.7 with comparable or better quality. For 3 of 10 — multi-file code refactoring, agent loops with deep tool use, and strict JSON schema adherence in zero-shot — Opus 4.7 wins on accuracy by a margin big enough to justify its 2-2.5× per-token premium.

The rational stack in 2026 is not one model. It is a router that sends the easy 80% of traffic to Gemini 3.1 Pro and reserves Opus 4.7 for the workloads where its 87.6% SWE-bench Verified score actually translates into a measurable accuracy gain on your tasks. Routed correctly, total spend drops 55-70% versus running everything on Opus, with no measurable quality regression on the routed traffic.

The Two Pricing Surfaces, Side By Side

Caching, batching, and thinking-token billing matter as much as the rate card. Here’s the full pricing reality in one table.

Pricing surface Gemini 3.1 Pro Claude Opus 4.7
Input ($/M tokens, ≤200K) $2.00 $5.00
Output ($/M tokens, ≤200K) $12.00 $25.00
Long-context input (>200K) $4.00 $5.00 (no premium)
Long-context output (>200K) $18.00 $25.00 (no premium)
Max context window 2,000,000 tokens 1,000,000 tokens
Cache write n/a (auto, free) 1.25× input ($6.25)
Cache read ~$0.50 (75% off) $0.50 (90% off)
Batch discount 50% off both 50% off both
Thinking tokens billing Output rate Output rate
Tokenizer note Stable since Gemini 3 New on 4.7, 1.0×–1.35× (Anthropic spec)

Three things are easy to miss. First, Gemini 3.1 Pro switches to a higher rate the moment your total request — input plus output combined — crosses 200,000 tokens, and every token in that request is billed at the higher tier. A 250K-token RAG prompt that returns a 2K answer pays $4/M on the entire 252K input plus $18/M on the 2K output, not just the bit above 200K. Opus 4.7 has no such step: its full 1M-token context window is billed at the flat $5/$25 rate end to end, with no long-context surcharge. Past 200K, Opus is no longer the expensive option — it is the flat-rate option.

Second, the two cache-read numbers both landing near $0.50/M is a coincidence, not a shared price. Opus reads cached tokens at 90% off its $5 input rate ($0.50). Gemini’s explicit cache discount on 3.1 Pro is published as roughly 75% off its $2 input rate — different base, different discount percentage, similar endpoint. Don’t confuse the cache-read rate with Gemini’s 200K pricing tier: they are unrelated mechanisms that the rate cards happen to print near each other.

Third, Opus 4.7’s tokenizer change means the same prompt text that cost $5 in 4.6 now costs roughly $5 to $6.75 in 4.7. Anthropic’s documented range is 1.0×–1.35× — up to about 35% more tokens — depending on content type. Independent single-document measurements run higher: Simon Willison clocked 1.46× on his Claude Token Counter the day after launch, and Claude Code Camp’s teardown measured 1.47× on dense technical documentation. Those are worst-case outliers on token-hostile content, not the spec — but if your traffic is mostly code and JSON, budget closer to them than to the average. Migrating from 4.6 to 4.7 without auditing means paying that delta whether you noticed or not. The weekend tokenizer-tax fix walkthrough covers six tactics to claw it back.

How I Ran the Benchmark

Every cost number below was measured on the same prompt, run against both models on the same day, with the same input data. The harness was a thin Python script hitting Vertex AI for Gemini 3.1 Pro (paid tier, MEDIUM thinking unless noted) and Anthropic’s API for Claude Opus 4.7 (Adaptive Reasoning, default effort). Each task ran three times per model; I took the median for cost, latency, and a manual quality score on a 1-10 rubric written before running the tests.

Two ground rules. Prompt caching was off on the per-task math — these are cold first-call costs, the worst case. And neither model was allowed to use its longest-context tier unless the task required it.

Task 1: Long-Context Document Summarization

Feed each model a 180K-token consulting report (a 280-page market analysis PDF) and ask for a 600-word executive summary with five quantified takeaways and a risk section.

Metric Gemini 3.1 Pro (MEDIUM) Claude Opus 4.7
Input tokens billed 180,000 215,500 (1.197× tokenizer)
Output tokens billed 1,800 1,950
Cost per call $0.382 $1.126
Latency (median) 18.4s 22.7s
Quality score (1-10) 8.2 8.6

Gemini wins cost by 2.95×. Quality is within margin (Opus catches one more quantified takeaway on average). For a content team summarizing 200 reports a week, that is $382/wk on Gemini vs $1,126/wk on Opus. Verdict: Gemini.

Task 2: Structured Extraction with a Strict JSON Schema

Extract 14 fields from 50 messy supplier invoices into a strict JSON schema (vendor name, IBAN, line items, VAT rate, currency, due date, etc.). Ran zero-shot (schema only) and three-shot (schema plus three examples).

Metric Gemini (MEDIUM, zero-shot) Opus 4.7 (zero-shot) Gemini (3-shot) Opus (3-shot)
Input tokens 4,200 4,950 6,800 7,920
Output tokens 480 510 470 495
Cost per invoice $0.0142 $0.0375 $0.0192 $0.0520
Schema compliance 91% 96% 98% 99.5%
Field accuracy 93% 97% 97% 98%

Opus wins both modes on accuracy. Gemini zero-shot occasionally drops fields it cannot find (returns null instead of guessing); Opus is more aggressive about hitting the schema even when guessing. Three-shot examples close most of the gap at under 40% the cost. Verdict: Gemini for high-volume tolerant pipelines, Opus for compliance-critical extraction.

Task 3: Multi-File Code Refactoring

Take a 6-file Python web service (FastAPI + SQLAlchemy + Pydantic, ~1,400 lines), refactor the database layer from synchronous to async, update all callers, fix the test suite, and write a migration note.

Metric Gemini 3.1 Pro (HIGH) Opus 4.7
Input tokens 14,200 16,750
Thinking/output tokens 11,400 7,800
Cost per refactor $0.165 $0.279
Latency 84s 41s
Tests passing on first try 4 of 11 9 of 11
Required followup prompt Yes (2 round-trips) No
Total cost end-to-end $0.391 $0.279

Here Opus earns the premium. First-pass test pass rate of 9/11 versus 4/11 means Opus closes the job in one round-trip; Gemini needs two more turns to fix the broken async session handling and the test fixtures it missed. Once you count the followup cost, Opus is 30% cheaper and finishes in half the wall time. SWE-bench Verified at 87.6% (Opus) vs ~79% (Gemini 3.1 Pro) — a real but single-digit gap — points the same direction on multi-file, semantic-edit work, though it is too narrow to route on by itself. What separated the two here was first-pass completeness, not raw score. Opus also produced 33% fewer tool errors in the agent variant. Verdict: Opus.

Task 4: Reasoning Over a 1.2M-Token Haystack

Hide three “needle” facts inside 1.2 million tokens of legal correspondence and ask a question that requires synthesizing across all three. This is the one task where the context windows genuinely diverge — but not where the old “200K cap” myth said they would.

Opus 4.7 carries a 1M-token context window at standard pricing, no long-context surcharge. So anything up to 1M is single-call vs single-call: both models load the whole corpus and reason over it natively. The architecture only forks above 1M. This haystack is 1.2M tokens — past Opus’s window — so Opus has to split it. But the split is shallow: two ~600K chunks, summarize each, then one synthesis pass. Three calls, not the eight a 200K-class model would need.

Metric Gemini (HIGH, single call) Opus 4.7 (2-chunk split, 3 calls)
Total input tokens 1,200,000 (long-tier) 1,206,000
Total output tokens 4,200 5,100
Cost per question $4.876 $6.158
Latency 71s 2min 38s
Synthesis correctness (1-10) 8.8 8.0

Gemini still wins, but the margin is narrower and the reason is more honest. With only one chunk boundary instead of seven, far less is lost across the seam — the correctness gap closes to 0.8 points, well inside what one followup prompt would fix. Gemini’s edge here is real but bounded: a single-call answer, lower latency, and a cleaner cost line on corpora between 1M and 2M tokens. Below 1M, this task would be a near-tie; above 2M, only Gemini fits at all. Verdict: Gemini — but on the 1M–2M band specifically, not on a blanket long-context claim.

Task 5: Agent Loop With 14 Tool Calls (Customer Support Triage)

An agent with five MCP tools (CRM lookup, order DB, Stripe refund, email send, ticket close) handles a customer complaint requiring lookups, a partial refund, an apology email, and ticket closure. Same real ticket against both models in identical agent harnesses.

Metric Gemini 3.1 Pro (MEDIUM) Opus 4.7
Tool calls to resolution 17 13
Input tokens (cumulative) 47,400 38,200
Output tokens (cumulative) 6,800 4,950
Cost per resolved ticket $0.176 $0.315
Resolution correctness (1-10) 7.2 9.0
Wrong refund amount 1 of 5 runs 0 of 5 runs

Opus made fewer tool calls and got the refund amount right every time. Gemini’s MEDIUM thinking missed a partial-refund edge case once in five runs and refunded the full amount — a meaningful business cost in production. Bumping Gemini to HIGH brought correctness up to 8.6/10 but cost-per-ticket jumped to $0.412 and latency doubled. Opus was 32% more expensive in raw token cost, but the wrong-refund risk on Gemini MEDIUM made Opus cheaper once you priced the false positive.

If your agent uses MCP tools that move money or send communications externally, the agentic commerce protocols and merchant-side guardrails matter as much as the model choice. Verdict: Opus on cost-of-error grounds — but route to Gemini if your agent has a hard human-checkpoint before any irreversible action.

Task 6: RAG Over a 90K-Token Knowledge Base

90K-token internal handbook, single user question, answer with citations.

Metric Gemini 3.1 Pro (LOW) Opus 4.7
Input tokens 92,000 108,500
Output tokens 320 410
Cost per query $0.188 $0.553
Latency 6.1s 11.4s
Citation accuracy 92% 96%

Gemini at LOW thinking is enough — the model is finding facts and pasting them with citations, not reasoning across them. Opus’s extra 4 points of citation accuracy don’t justify 2.94× the cost for an internal knowledge base with low blast radius. Verdict: Gemini.

Task 7: Code Generation From a Spec (Single File)

Write a 200-line FastAPI endpoint that does pagination, filtering, sorting, and field selection over a Postgres table, with unit tests, given a 1,500-token spec.

Metric Gemini 3.1 Pro (MEDIUM) Opus 4.7
Input tokens 1,800 2,100
Output tokens 4,200 3,950
Cost per generation $0.054 $0.110
Tests passing 7 of 8 8 of 8
Code review nits 3 1

Both shipped working code. Opus’s code was tighter (better error messages, better type hints) but the difference would survive one code-review round. For boilerplate codegen — CRUD endpoints, schema migrations, test scaffolds — Gemini at MEDIUM is the right call. When I picked the model for Content-to-Social MCP’s transformation prompts, this was the exact tradeoff that landed me on Gemini for the prose-shaping passes and reserved Opus for the trickier multi-file edits. Verdict: Gemini.

Task 8: Web Research and Synthesis

“Research the top five competitors to a named SaaS company, summarize their pricing models, and identify two positioning gaps.” Both models had access to the same custom search tool returning identical results.

Metric Gemini 3.1 Pro (HIGH) Opus 4.7
Search tool calls 9 7
Input tokens (final synthesis) 38,400 32,100
Output tokens 1,950 2,200
Cost per research brief $0.100 $0.216
Brief quality (1-10) 8.4 8.7

Within 3% on quality, 2.16× cost differential. Gemini wins the wallet, Opus wins the marginal nuance. Verdict: Gemini.

Task 9: Image Understanding (Receipt OCR + Categorization)

Read a receipt photo, extract line items, total, tax, vendor, date, and assign each line to a chart-of-accounts category. Ten-receipt batch.

Metric Gemini 3.1 Pro (MEDIUM) Opus 4.7
Input tokens per receipt ~2,800 ~7,400 (3× image-token tax)
Output tokens 240 260
Cost per receipt $0.0084 $0.0435
Field accuracy 94% 96%
Category accuracy 88% 92%

The worst single-task line-item gap in the comparison. Opus 4.7’s tokenizer triples image token counts versus 4.6, which was already more expensive than Gemini per image. For any vision-heavy pipeline — invoices, receipts, screenshots, form OCR, ID verification — Gemini is 5× cheaper at near-equivalent quality. Verdict: Gemini, by a wide margin.

Task 10: Strict Tool-Use With JSON Schema Compliance

50 multi-step prompts that each require calling 2-4 tools with strict argument schemas (typed enums, regex-validated strings, range-bounded numbers). Failure mode: any malformed argument crashes the downstream call.

Metric Gemini 3.1 Pro (MEDIUM) Opus 4.7
Input tokens (avg per chain) 6,500 7,800
Output tokens 950 880
Cost per chain $0.0244 $0.0610
Schema compliance 92.4% 99.1%
Production-failure rate 7.6% 0.9%

The other place Opus earns its keep. A 7.6% production failure rate on tool calls is a P0 incident generator. Opus’s tighter instruction-following — the same property that makes it overshoot on chatty completions — pays off here. Gemini can close the gap with a few-shot block, but every example added to the prompt eats tokens at every call. Verdict: Opus, unless you have a downstream validator that can cheaply retry malformed calls.

The Whole Picture: Cost-Per-Task Summary

Task Gemini 3.1 Pro cost Opus 4.7 cost Cost ratio Quality winner
1. Long-context summarization $0.382 $1.126 2.95× Tie
2. JSON extraction (3-shot) $0.0192 $0.0520 2.71× Opus (+1.5%)
3. Multi-file refactor $0.391 (after retry) $0.279 0.71× Opus
4. 1.2M-token reasoning $4.876 $6.158 1.26× Gemini
5. Agent loop (support triage) $0.176 $0.315 1.79× Opus
6. RAG over 90K-token KB $0.188 $0.553 2.94× Tie
7. Single-file codegen $0.054 $0.110 2.04× Tie
8. Web research + synthesis $0.100 $0.216 2.16× Tie
9. Image OCR + categorize $0.0084 $0.0435 5.18× Tie
10. Strict tool-use (50 chains) $0.0244 $0.0610 2.50× Opus

Across all 10 tasks, Gemini 3.1 Pro is cheaper on 9 (Opus is cheaper on Task 3 only after counting Gemini’s retry round). Opus is the quality winner on 4 tasks, Gemini on 1, tied on 5. The pattern is consistent: Opus wins where multi-step reasoning, strict instruction-following, or agent-loop accuracy matter; Gemini wins everywhere else, often by 2-5×.

Thinking Levels, Caching, and the Hidden Output Bill

Gemini 3.1 Pro’s thinking-token billing is the most expensive footgun on the platform. Every reasoning token is billed at the full $12/M output rate. Setting thinking_level="high" on a debugging task can produce 5,000 to 20,000 thinking tokens, and response times can exceed 60 seconds.

Match thinking level to task complexity:

  • LOW for autocomplete-style completions, classification, translation, simple summarization, RAG fact retrieval. Negligible thinking-token cost.
  • MEDIUM as the default for code review, bug fixes, test generation, JSON extraction, and most agent loops. Roughly 500-2,000 thinking tokens per call.
  • HIGH only for genuinely hard problems: novel algorithm design, complex debugging across 5+ files, architectural planning. Can hit 20K+ thinking tokens per call.

The cost difference between LOW and HIGH on a hard prompt can be 10× or more. A Gemini HIGH call on a hard task runs $0.30-$0.50 in thinking-token output alone — meaningful at scale. Use the AlphaEvolve trick: start MEDIUM, escalate to HIGH only if MEDIUM fails the quality check on the first three runs of a new prompt. Most teams will find MEDIUM is enough for 70-80% of their workload.

Caching is the single biggest cost lever, bigger than picking the cheaper model. Anthropic charges 1.25× the base input price to write a 5-minute cache and 0.1× the base to read; cache reads hit $0.50/M on Opus 4.7 — a 90% discount on the cached portion. Break-even is two reads. Google’s explicit context caching for Gemini 3.1 Pro applies roughly a 75% discount to cached input reads, landing near $0.50/M as well. Both endpoints sitting close to $0.50 is a coincidence of different base rates and different discount percentages, not a shared price — and neither has anything to do with Gemini’s separate 200K pricing tier. Combined with the 50% Batch API discount (both vendors), cached batch requests on Opus 4.7 can land at roughly 5% of the standard rate.

A workload that hammers the same system prompt and document context all day is 60-95% cheaper after caching, regardless of model. The cost-per-task numbers above are cold first-call costs — real production economics with caching enabled cut both columns by 50-90%. The cost ratio between Gemini and Opus stays roughly constant.

The Routing Recipe That Cuts Total Spend 55-70%

Running everything on Opus 4.7 is the most expensive way to ship in 2026. Running everything on Gemini is cheaper but loses you accuracy on 4 of the 10 task types. Routing fixes both. Here is the static rule-based router I use — captures most of the savings without building a dynamic classifier.

Send to Gemini 3.1 Pro (LOW or MEDIUM thinking):

  • Long-context summarization or reasoning over >1M tokens (forced — past Opus’s 1M window). Below 1M, route on cost and task type like any other workload; both models fit natively.
  • RAG over knowledge bases, internal docs, or product catalogs
  • Image understanding: OCR, receipt parsing, screenshot analysis, vision QA
  • Web research and synthesis
  • Single-file code generation from a clear spec
  • Boilerplate codegen: CRUD endpoints, test scaffolds, schema migrations
  • High-volume content tasks: classification, sentiment, entity extraction with tolerant downstream validation — the kind of work covered in the AI business automation playbook
  • Translation, paraphrasing, summarization of standard prose

Send to Claude Opus 4.7:

  • Multi-file code refactoring, semantic edits across an unfamiliar codebase
  • Agent loops where any tool call has irreversible business consequences (refunds, sends, deletes) without a human checkpoint
  • Strict JSON schema or strict tool-use chains where downstream callers cannot tolerate malformed payloads
  • Compliance-critical extraction where missing a field has audit consequences
  • Long-running planning tasks where the model orchestrates 10+ tool calls without retry budget

Default to Gemini, escalate to Opus on retry: for tasks that don’t clearly fit either bucket, run them on Gemini first and only retry on Opus if a downstream validator (schema check, test pass, lint, tool-call success) flags the Gemini output as invalid. Cheapest pattern for medium-stakes work where most calls succeed. For workflow orchestrators that already do conditional routing (Zapier, n8n, Make), the same pattern translates directly — see the automation platform comparison for which one fits your stack.

A team running a typical product mix — agents, RAG, content, codegen, support triage — that switches from “everything on Opus” to this router cuts total monthly model spend 55-70%. One afternoon to set up. The hard part is convincing the team to trust the cheaper model on traffic that has been running on the expensive one for six months.

Migration Notes for Teams Coming From Opus 4.6 or GPT-5.4

From Opus 4.6 to Opus 4.7: audit your token usage with Anthropic’s Token Counter on a representative sample. Mostly English prose? The tokenizer change costs 1-10% — not worth panic. Mostly code, JSON, or technical docs? You are paying toward the top of Anthropic’s documented 1.0×–1.35× range, and independent measurements on the densest content (Willison’s 1.46×, Claude Code Camp’s 1.47×) suggest a worst case a little past the spec. Run the tokenizer-tax weekend fix before deciding whether to migrate traffic to Gemini.

From GPT-5.4 to either: Gemini 3.1 Pro is roughly cost-equivalent to GPT-5.4 ($2/$12 vs $2/$10) but offers a 10× larger context window and slightly stronger reasoning at MEDIUM. Opus 4.7 is ~2× more expensive than GPT-5.4 but wins on multi-file coding and strict instruction-following. The migration is a function of which workload dominates your traffic.

For agent-heavy stacks: the gap between Opus and Gemini on agent loops widens with the number of tool calls. Under 5 tool calls, both are usable; above 10, Opus’s lower error rate compounds. Build your router to pick by chain length if you can predict it from the user prompt. If your tools are MCP servers exposed to third-party agents, the MCP security threats and mitigation patterns matter independent of which model is calling them.

What This Means for Your Stack

The market consolidated to two practical answers: Gemini 3.1 Pro for cost-sensitive, context-heavy, vision-heavy, or RAG-dominated workloads; Opus 4.7 for code, agents, and strict structured output. GPT-5.4 sits in between on price and quality but doesn’t dominate either dimension. Open-source alternatives (DeepSeek, Llama 4.x derivatives, Qwen 3) are 5-15× cheaper still and are the right call for the truly high-volume, low-stakes layer of a routed stack — the layer that handles 80% of calls and 20% of the value.

If your AI bill is north of $5,000/month and you haven’t audited it in the last 30 days, this week is the right week. The tokenizer tax on Opus 4.7, the thinking-token bill on Gemini HIGH, and the cache-rate discounts you’re not using are quietly costing you 40-70% of your spend. Pull last month’s invoice, sort calls by cost, route the top three endpoints first — that’s where the money is.