Same prompt, same model family, bigger bill. Claude Opus 4.7 shipped on April 16, 2026 with the sticker price of Opus 4.6 untouched — $5 per million input tokens, $25 per million output tokens — but Anthropic quietly swapped the tokenizer. Anthropic’s official range puts the same input text at 1.0× to 1.35× as many tokens; single-document measurements on technical content have landed harder, at 1.45× to 1.47×. One coding agent running 100M tokens a day jumped from $500/day to $675/day overnight with zero change in usage. Claude Pro subscribers started hitting their weekly cap after three questions. The rate card did not move. Your bill did.

I felt this one personally. Claude Opus 4.7 is the heaviest line item on my own stack — I run it daily for Apify actor development, blog drafting, and the planning loops behind Godberry Studios. My Content-to-Social MCP server also calls the Claude API per transformation at $0.07 a pop, so the tokenizer-tax math feeds straight into the actor’s gross margin. This post walks through what actually changed, how bad the damage is on each prompt type, and six reduction tactics — ranked by ROI — you can run through in a weekend to claw back the 35% and then some. If you use Claude in any production workload — agents running on MCP servers, scrapers, coding tools, customer support bots, content pipelines — the work here is concrete.

What Actually Changed on April 16

Anthropic announced Opus 4.7 as a capability upgrade — better coding, stronger agentic behavior, improved instruction following. The migration notes mention a new tokenizer in passing and note that it “may use roughly 1.0× to 1.35× as many tokens when processing text compared to previous models.” That sentence is the whole story. Opus 4.7 is the first Claude model to ship a new tokenizer since Opus 3, and the new one is smaller on some common subwords but larger on code patterns, structured data, and technical prose. Every cost you pay is a function of token count, so a tokenizer swap is a direct cost event.

Anthropic’s official range is 1.0× to 1.35× — up to about 35% more tokens, depending on content. That is the number to plan against. A few independent single-document measurements have run hotter: Simon Willison ran the Opus 4.7 system prompt through his Claude Token Counter the day after launch and got 1.46× the token count of Opus 4.6 on the same text; the independent Claude Code Camp teardown measured 1.47× on technical documentation and 1.45× on a real CLAUDE.md file. Those are worst-end single-document readings on dense technical content, not a production average — treat them as the ceiling, not the expectation. For images, the new tokenizer uses up to 3× the tokens of Opus 4.6. In words: at the official top end, 1M Opus 4.7 tokens covers roughly 555,000 words of text where 1M Opus 4.6 tokens covered 750,000 — about a quarter of your effective context window gone on the same prompt.

The Real-World Cost Impact, by Prompt Type

The 0-35% official range is an average across prompt types, and the damage clusters sharply once you split it. Five representative prompt types and what the tokenizer change does to each:

  • Standard English prose — 1.00× to 1.10×. Marketing copy, FAQs, customer support messages. Largely unchanged. If this is 100% of your workload, you can almost ignore the tokenizer change.
  • Technical documentation and mixed prose + code — around 1.35×, with dense single documents measured as high as 1.47×. API docs, engineering runbooks, CLAUDE.md files, long agent scratchpads. The most common production workload, and the hardest-hit.
  • Pure code — 1.30× to 1.40×. Function writing, refactoring, codegen. Also the category where Opus 4.7 is most accurate, so the cost pressure shows up in the exact workload where you most want to use it.
  • JSON, XML, structured data — 1.25× to 1.35×. Tool call payloads, structured extraction outputs, RAG context with metadata. Bad on anything that passes rich JSON through prompts.
  • Images — up to 3.00×. If your pipeline sends screenshots, diagrams, or receipts for vision analysis, this is the single biggest line-item increase in your bill.

If you run a coding agent, a vision pipeline, or anything that passes JSON or XML through prompts frequently, you are paying the full tokenizer tax. If you run a pure English-prose workload, the tokenizer is a footnote and you do not need to rearchitect. Audit which camp you are in before you change anything.

Why Anthropic Changed the Tokenizer (And Why the Defense Is Half-True)

Anthropic’s argument is that the new tokenizer contributes to the model’s accuracy gains — better instruction following, more literal tool use, fewer silent generalizations. SWE-bench Verified for Opus 4.7 is 87.6%, SWE-bench Pro 64.3%, and Opus 4.7 averages 13% more accurate on coding tasks with 8-12% fewer tool calls. Those numbers are real.

The defense is half-true because the benchmarks that shifted upward are not the only ones that moved. On MRCR v2 8-needle at 1M context — a long-context retrieval test — Opus 4.7 scores 32.2% where Opus 4.6 scored 78.3%, a collapse on a workload a lot of production agents depend on. Adaptive reasoning drew loud complaints that it under-thinks on hard questions even at “high” effort. Coding and tool-use workloads net out positive even with the tokenizer tax; long-context retrieval and certain reasoning workloads net out negative. If you run the latter, the migration decision isn’t about cost — it’s about whether the model still does your job at all.

The quality debate is its own thread. The cost question is tractable in a weekend. Six tactics, ranked by ROI.

Six Reduction Tactics, Ranked by Weekend ROI

Sequenced by how much they move your bill per hour of work — fastest payback first.

1. Audit and trim your system prompts (30-60 min, 15-30% reduction)

Most production prompts have 30-50% fat. Prompts grow organically — someone adds an instruction to handle an edge case, a rule to fix a regression, a few-shot example that helped a demo, and nothing ever gets deleted. Over 18 months, a 400-token system prompt becomes a 2,400-token one.

The audit is simple. Pick your top three highest-volume endpoints. For each instruction, ask: does removing this change the output on the last 20 real requests? If no, remove it. For few-shot examples, ask: does removing this worsen the output? If no, remove it. For formatting rules, ask: can structured output (tool schemas or response_format) replace this? If yes, move it.

A first-pass trim on a typical 2,000-token prompt lands around 800-1,200 tokens — a 40-60% input reduction on every single call. On Opus 4.7 at $5/M input, a service doing 1M requests/month at 1,500 tokens of prompt drops from $7,500/month to around $3,500/month just from the trim. Trim takes priority over every other tactic because the tokenizer tax compounds against bloated prompts — a 2,000-token 4.6 prompt becomes a 2,700-token 4.7 prompt, and the trim recovers both at once. When I ran this on my own Content-to-Social system prompt, about 35% of it was scar tissue from edge cases I’d patched months earlier — removing it changed nothing in the diff on 20 sample transformations.

2. Turn on prompt caching (15 min, 60-90% reduction on cached sections)

Prompt caching is the single highest-ROI lever Anthropic ships, and most teams still don’t use it because the rollout was quiet and the math looks confusing at first. It is neither.

A 5-minute cache write costs 1.25× the base input price — $6.25 per million tokens on Opus 4.7. Cache reads cost 0.1× the base — $0.50 per million tokens. Every cached read saves 90% on the cached portion. The break-even math is more forgiving than most people assume: a 5-minute cache write is 1.25× base and a read is 0.1× base, so write-plus-one-read (1.35×) already beats two uncached reads (2.0×) — the 5-minute cache pays for itself on the very first read. A one-hour cache write costs 2× base, so it breaks even on the second read. In practice, if a prompt is reused at all within the window, caching wins.

Cache first: your system prompt, tool schema definitions, any large document or CLAUDE.md constant across requests, and the few-shot examples you survived the trim step with. If your workload is 500 requests a day hitting the same 10,000-token system prompt, cached reads cut that portion of your bill by 90% for the whole day.

One gotcha: Anthropic quietly dropped the default cache TTL from 60 minutes to 5 minutes earlier in 2026, which raised effective costs 30-60% for teams that never adjusted. The 1-hour TTL is still available but you have to ask for it.

3. Route the right tasks to Sonnet 4.6 (1-2 hours, 40-70% reduction on routed traffic)

The boring, obvious, underused lever. Sonnet 4.6 costs $3/M input and $15/M output — 40% cheaper than Opus across both directions, and Sonnet 4.6 doesn’t use the new tokenizer, so you also skip the token inflation.

Opus 4.7 is genuinely better on coding (13% accuracy, 8-12% fewer tool calls), certain agentic tasks (74.9 vs 65.3 benchmark average), and vision. It is not meaningfully better on classification, content generation, sentiment analysis, basic RAG, boilerplate codegen, test generation, customer support triage, or structured extraction from clean inputs. Those are the workloads to route.

Rule of thumb: if a task has a well-defined output shape, predictable inputs, and doesn’t require multi-step tool use or deep reasoning, run it on Sonnet 4.6 unless you can measure a quality drop. Agent workloads are the exception — the more autonomous an agent is, the more the reasoning quality gap between Opus and Sonnet shows up in the tail. Classifier-based routing is worth building if your workload exceeds $5,000/month in model spend — published benchmarks show 20-60% cost savings at comparable accuracy. Below that, a static rule-based router (“these five endpoints always route to Sonnet; these three always to Opus”) gets you 80% of the benefit in an afternoon. For my own MCP server the math is unforgiving: every transformation bills at $0.07, and if the Claude call inside it costs $0.05 instead of $0.02, the unit economics flip from healthy to break-even. The default “always use Opus” posture quietly kills indie-developer margins faster than any pricing change Anthropic could announce.

4. Tune effort levels instead of chasing max (1 hour, 20-40% reduction on thinking tokens)

Opus 4.7 removed budget_tokens for extended thinking — requests that still use it return a 400. The replacement is thinking: {type: 'adaptive'} plus a five-tier effort parameter that lives in output_config: output_config: {effort: "high"}, with the levels low, medium, high, xhigh, and max. high is the API default; Anthropic recommends xhigh as the starting point for coding and agentic work, with high as the floor for intelligence-sensitive workloads where quality matters more than speed or cost.

The sensible defaults per task type:

  • Simple classification, summarization, structured extraction: medium is usually the sweet spot, often low is enough
  • Coding, code review, refactoring: xhigh is the recommended starting point; max is rarely worth it
  • Agentic loops with repeated tool calling, deep research, knowledge-base search: xhigh — the exploration budget pays off
  • Frontier problems where the model genuinely needs to reason: max, but cap the run with a task budget (see next section)

Most teams running Opus at max on every request are spending 2-3× what they need to. For coding and agentic endpoints, xhigh is the recommended starting point; for intelligence-sensitive work hold the floor at high; downgrade further per endpoint only where you can measure no quality drop. Max is a surgical tool, not a toggle.

Three Opus 4.7 changes silently break naive migrations — verify them before you trust a side-by-side test. First, temperature, top_p, and top_k return a 400 error if set to any non-default value; strip them from request bodies that carried them on 4.6. Second, thinking content is omitted from responses by default — if your product reads or streams reasoning, set display: "summarized" to get it back. Third, adaptive thinking is off by default; enable it explicitly with thinking: {type: 'adaptive'} or the model does not think at all.

5. Use task budgets to cap runaway agent loops (30 min, 10-30% reduction on agent workloads)

Task budgets are an Opus 4.7 beta feature and probably the most practical change for anyone running long agentic loops. You pass a rough total token target for the full loop — thinking tokens, tool calls, tool results, and final output — and the model sees a running countdown, using it to prioritize work and finish gracefully instead of spiraling.

Before task budgets, a common failure mode was an agent that spent 60% of its budget on a single tool call loop because adaptive reasoning decided to explore, then ran out of headroom before producing a final answer. The typical cost saving is 10-30% on the tail distribution, and the quality difference is usually imperceptible because the long-running cases were rarely the accurate ones.

6. Move async work to batch processing (2 hours, 50% reduction on batched traffic)

Batch processing — submitting requests in batches that complete asynchronously within 24 hours — is a flat 50% discount on both input and output. It’s the right home for anything that doesn’t need to respond in real time: content pipelines, bulk classification, nightly summaries, training data labeling, scraping post-processing, RAG reindexing. If your agents pay for external services inside those batches, agent-to-API payment protocols like x402 handle the other half of the cost equation.

Any team spending $1,000+/month where even 20% of the workload is async should already be running it in batch. The discount compounds with caching and model routing, so the three tactics stacked can cut batched-async-Sonnet-routed workloads to around 5-10% of their Opus-real-time cost.

A Weekend Audit Template

Two hours on Saturday, two hours on Sunday, and most of the tokenizer tax is gone.

Saturday morning — inventory. Pull 30 days of Claude API usage from the Anthropic console and bucket spend by endpoint (endpoint, model, input tokens, output tokens). Sort descending by total spend. The top 5 endpoints are usually 80% of the bill; ignore everything below this weekend.

Saturday afternoon — prompt trim. For each of the top 5, run the trim audit (tactic 1). Ship the shorter prompts to staging, run 20 real requests, deploy if output quality is indistinguishable. You’ll typically ship 3-4 of 5 on Saturday and flag one or two for a deeper look.

Sunday morning — caching and routing. Turn on prompt caching for the top 5 endpoints (tactic 2) — usually a one-line change wrapping the static portion in a cache control block. Then route classification, triage, and structured-extraction endpoints to Sonnet 4.6 (tactic 3). Keep Opus 4.7 on coding, vision, and the genuinely hard agentic loops.

Sunday afternoon — effort levels and budgets. Downgrade any endpoint running at max without a measured reason (tactic 4). Add a task budget to any multi-step agentic loop (tactic 5). Move genuinely async endpoints to batch processing (tactic 6).

A typical weekend audit on a team spending $10-20K/month on Claude lands the monthly bill 35-65% below where it started, with most of the saving from tactics 1, 2, and 3 stacked.

When to Stay on Opus 4.7 Anyway

Opus 4.7 is a real capability upgrade on the workloads it was tuned for. If you run a coding assistant, an agentic loop with heavy tool use, or a vision pipeline where the accuracy-to-cost ratio is favorable, staying on Opus 4.7 is the right call — pay the tokenizer tax and offset it through the six tactics above.

The argument for migration away applies in two cases. One, you run a long-context retrieval workload — if you were relying on the 78% MRCR score, the 32% score on Opus 4.7 may not clear your quality bar, and Opus 4.6 remains available on the API through at least October 2026. Two, you run a pure English-prose workload that Sonnet 4.6 handles at comparable accuracy for 40% less, in which case you should have migrated months ago regardless of the tokenizer news. Most workloads fall outside those two cases. Stay on Opus 4.7 and do the audit. 35% of a bill is not a line you ignore.

What Happens Next

The bigger pattern: every frontier vendor is now raising effective prices without moving sticker prices. OpenAI raised GPT-5 output token prices last October. Google adjusted Gemini 2.5 Pro thinking-level pricing in February. Anthropic changed the tokenizer at unchanged sticker price in April. Expect at least two more effective price increases disguised as capability upgrades across vendors in 2026 — tokenizer swaps, thinking budget changes, cache TTL changes, per-tier access restrictions. Apify’s October 2026 migration to pay-per-event pricing is a non-LLM version of the same pattern: the unit of billing changes, and operators who didn’t pre-audit eat the difference. Kimi K2.6, Qwen 3.2, and DeepSeek R2 are meaningfully competitive with Opus 4.7 on a subset of workloads at 5-15% of the effective cost — worth scoping as a secondary route if your bill has a budget-alert threshold attached. For agents that also need to browse the open web, the ChatGPT Atlas-vs-scraping-stack piece covers what that layer looks like in practice.

Opus 4.7 runs daily across my actor development, blog work, and the Content-to-Social MCP transformations that ship paid output — and the tokenizer tax is real on every one of those surfaces. The single highest-leverage move the weekend after launch was trimming the MCP server’s system prompt and routing the simpler transformations to Sonnet 4.6; that one change pulled the actor’s gross margin out of the danger zone faster than any other lever Anthropic offered. Four hours of audit work protects the economics of every solo operator who built a margin around the pre-4.7 token count and would otherwise watch it quietly evaporate.