How to Reduce Token Usage in Claude Code and Opencode

TLDR: I cut my Claude Code and Opencode token usage by 60-80% using five strategies: replacing MCP servers with a custom CLI, compressing command outputs with RTK, running Headroom as a context compression proxy, forcing terse output with the Caveman skill, and aggressively managing context. These stack — combine all five and your API bill drops to a fraction.

Why Replace MCP Servers with a Custom CLI?

What is the single biggest token waste in AI coding tools? MCP servers. Each one injects 500-1,500 tokens of tool definitions into every API call. Connect 10 servers and you're loading 5,000-15,000 tokens before typing a single word. I wrote about this in detail in my CLI tools post — I built a single CLI called mycli that handles web browsing, Reddit search, YouTube transcripts, Gmail, and more. One skill definition (~500 tokens) replaces 10 MCP servers (~10,000+ tokens).

Beyond token savings, third-party MCP servers are a supply chain nightmare — random npm packages running as child processes with full system access. Build your own CLI and you control what runs on your machine. Good CLI documentation (--help, consistent flags, JSON output) is all an AI needs. No AI has ever failed to use mycli.

Can a Compression Tool Cut Tokens Without Changing My Workflow?

Yes, install RTK. It's a Rust CLI proxy that intercepts shell commands in Claude Code and compresses their output before it hits the LLM. A git status drops from ~2,000 tokens to ~200. Test output from cargo test or npm test shrinks by 90%. git push goes from 15 lines of noise to one line: ok main.

RTK installs with rtk init -g and hooks into Claude Code's PreToolUse. After that it's transparent — every bash command gets compressed automatically. It supports 100+ commands across git, test runners, linters, Docker, AWS, and package managers. In a 30-minute session it saves roughly 80% of command output tokens. It also works with Cursor, Gemini CLI, Codex, Windsurf, and OpenCode.

For heavier compression across everything — not just command output — Headroom takes a broader approach. It compresses tool outputs, logs, files, RAG chunks, and conversation history before they reach the LLM. It runs as a proxy, a library, or an MCP server, and claims 60-95% token reduction. Headroom uses AST-aware compression for code, a trained ML model (Kompress) for prose, and smart JSON crushing for structured data. It also caches originals so the LLM can retrieve them if needed — reversible compression.

I use RTK for the shell layer and it handles the bulk of savings.

How to reduce output tokens?

Output tokens are usually more expensive than input ones as they require more compute power. Caveman is a Claude Code skill that forces the model to respond in compressed, telegraphic English — "caveman speak." Instead of "Sure! I'd be happy to help you with that. The issue you're experiencing is most likely caused by your authentication middleware not properly validating the token expiry," you get: "Bug in auth middleware. Token expiry check uses < not <=. Fix:"

The benchmarks show ~65% output token reduction on average, with technical accuracy preserved at 100%. The same code gets generated — just without the pleasantries, hedging, and redundant explanations. Install with one command: curl -fsSL https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.sh | bash. It also works with Codex, Gemini, Cursor, Windsurf, and 30+ other agents.

I use the caveman skill in my Opencode setup — my INSTRUCTIONS.md tells the agent to respond like a smart caveman. It works. The AI drops filler words, skips pleasantries, keeps technical terms exact, and gives me fragments instead of paragraphs. Speed goes up because shorter responses generate faster. Cost goes down because output tokens are typically the most expensive part of an API call.

What Else Makes a Difference?

Three more things I do that compound the savings:

Compress early and often. Both Claude Code and Opencode have auto-compress features, but don't wait for them. Manually compress context as soon as a task is done. Old context corrupts new feature development — stale assumptions bleed into fresh code. I use /clear between features and compress mid-session when the context window gets heavy. I wrote about this in Claude Code best practices.

Write a tight CLAUDE.md. Your instruction file loads into every API call. A bloated CLAUDE.md with paragraphs of explanation burns tokens on every single request. Keep it to bullet points — specific rules, naming conventions, architecture notes. Cut everything the AI can infer from reading your code.

Use subagents for narrow tasks. Instead of dumping everything into one context window, delegate specific tasks to subagents with scoped instructions. A subagent that only handles tests gets a tiny context and burns few tokens. The main agent stays clean. This pattern works especially well in Claude Code and Opencode where you can define agent personas.

Key Takeaways

MCP servers are the biggest token waste — replace them with a custom CLI to save 5,000-15,000 tokens per API call
RTK transparently compresses shell command output by 60-90% with zero workflow changes
Headroom compresses everything before it reaches the LLM — tool outputs, logs, files, conversation history — and it's reversible
Caveman skill cuts output tokens by ~65% by forcing telegraphic responses without losing technical accuracy
Manual context compression between tasks prevents stale context from corrupting new work and keeps token usage lean
A tight CLAUDE.md with bullet points instead of paragraphs saves tokens on every API call
Subagents with scoped instructions handle narrow tasks with minimal token cost