Enter password to access

anthropic-study-guide — claude-code — 120×48
Deep Study Guide / April 2026

How Anthropic engineers
build software_

A research-backed deep study on Anthropic's development workflows, agent harness design, infrastructure, interpretability research, and AI-first engineering culture — with 100+ official references.

A study about Anthropic by Agilize

Curso Claude Code para Engenheiros — 29 módulos, 100% prático
90%
Code by Claude
59%
Daily work uses AI
67%
More PRs/engineer
5-30
PRs shipped/day
74
Releases in 52 days
200%
Productivity gain

01 — Core philosophy

Design principles behind Anthropic's engineering

1

Do the simple thing first

Context compaction is just asking Claude to summarize previous messages. The CLAUDE.md memory system is "the simplest thing that could work — it's a file that has some stuff." They abandoned vector-based RAG search (with Voyage embeddings) in favor of agentic search using grep and glob, which outperformed RAG "by a lot."Latent Space Boris Cherny created ~20 distinct prototypes in two days for the todo list feature alone, preferring rapid iteration over upfront architecture.Lenny's Pod

2

Minimal scaffolding, maximum model

The SWE-bench agent that scored 49% uses only two tools: a Bash tool (persistent state, no internet) and an Edit tool (str_replace, view, create, insert, undo).SWE-bench No framework. No RAG. No planning module. The team actively removes tools — they unshipped ls once bash enforcement was robust. Cat Wu: "Everything you can do, Claude can do. There's nothing in between."Teams PDF The foundational paper states: "Start by using LLM APIs directly: many patterns can be implemented in a few lines of code."Building Agents

3

Tool design over prompt engineering

From the SWE-bench work: "Much more attention should go into designing tool interfaces for models."SWE-bench The team spent more time optimizing tool interfaces than the overall prompt. Tools are the contract between human intent and model capability. This is why Claude Code's Edit tool enforces exact string matching (not line numbers) and the Bash tool maintains persistent state — each design choice encodes an assumption about reliable model interaction.

4

Separate generation from evaluation

From harness design research: "Agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre."Harness Design Never let the same agent generate and evaluate its own work. The three-agent harness (Planner, Generator, Evaluator) exists specifically for this reason. The Evaluator actively runs the application, not just reads the code.

5

Underfund teams, unlimited tokens

Boris Cherny advocates providing small teams with unlimited API access rather than large headcount. Claude Code started with one engineer (Boris), grew to ~10. The team ships 60-100 internal npm releases per day. This forces prioritization: the model does the heavy lifting, humans guide direction. Individual engineers average 5 PRs/day; Boris routinely ships 10-30 PRs/day.Lenny's Pod Pragmatic Eng

6

Every component encodes an assumption

From "Harness design for long-running application development": "Every component in a harness encodes an assumption about what the model can't do on its own."Effective Harnesses If the model can do it, remove the component. If it can't, make the harness handle it. This is why harness design evolves with model capabilities — what required scaffolding with Claude 3 may be unnecessary with Claude 4.

"Maybe you don't actually need an IDE."Lenny's Pod

— Boris Cherny, Head of Claude Code

02 — Tech stack

How Claude Code is built

Technology choices reflect the "model writes the code" philosophy. Pick technologies the model knows best.

Language

TypeScript

"TypeScript and React are two technologies the model is very capable with, so were a logical choice."Pragmatic Eng ~90% of the codebase is AI-authored.Fortune Boris hasn't edited a line by hand since November 2025.Boris/X

UI Framework

React + Ink + Yoga

Terminal UI via React components with the Ink framework, translating React to ANSI escape codes. Meta's Yoga engine handles constraint-based terminal layouts. No Electron or browser dependency.

Build & Distribution

Bun + npm

Bun for building/bundling. npm for distribution. 60-100 internal npm releases per day. ~1 external release per day. 74 public releases in 52 days (Feb 1 – Mar 24, 2026).Pragmatic Eng Four teams ship in parallel independently.

CLI Framework

CommanderJS

Minimal, standard CLI handling. The tool avoids heavy abstractions. When given bash access, Claude naturally gravitates toward command-line tools rather than custom abstractions.

Agent Core: 2 Tools

Bash + Edit

The SWE-bench agent (49% score) uses only: Bash (executes commands, persistent state across calls, no internet) and Edit (str_replace with exact string matching, enforced absolute paths, undo_edit). The model determines step sequencing freely.SWE-bench

Remote Dev

Coder control plane

Anthropic uses Coder for remote dev environments. Jacqueline Lee (MTS): "I have been focusing on remote development, majorly leveraging Coder as the control plane."Teams PDF Agents run in background — close your laptop and work continues.

Origin story

Claude Code originated from a command-line tool Boris Cherny built to state what music an engineer was listening to. After giving it filesystem access, it "spread like wildfire at Anthropic."Lenny's Pod Boris joined Anthropic in September 2024 and began prototyping with Claude 3.6. He created the first working prototype in days. Sid Bidasaria joined as engineer #2. The team grew to ~10 engineers, now includes PMs, designers, data scientists. An Anthropic spokesperson clarified: company-wide, between 70% and 90% of code is AI-authored.Fortune

03 — Development workflows

The AI-first development loop

Anthropic engineers have converged on several distinct workflow patterns, each suited to different task types.

Primary Workflow

The autonomous loop (Shift+Tab / auto-accept mode)

Clean git state Prompt Claude Shift+Tab auto-accept Claude writes + tests + iterates Review ~80% Human refines 20% Commit

The Product Development team uses auto-accept mode where Claude writes code, runs tests, and iterates autonomously. Claude verifies its own work by running builds, tests, and lints. The engineer reviews the ~80% complete solution. ~70% of final implementation comes from Claude's autonomous work.Teams PDF Critical: always start from a clean git state and commit checkpoints regularly so you can roll back.

Task classification intuition: peripheral features run async (let Claude go fully autonomous), core business logic runs synchronous (human stays in the loop). Developing this intuition is key to the workflow.

High Volume

The slot machine

Used by Data Science and ML Engineering. Commit state, let Claude run 30 minutes, accept or restart fresh. Starting over often has higher success rate than debugging a broken attempt. Build permanent React dashboards (5,000+ lines of TypeScript) instead of throwaway Jupyter notebooks — despite "knowing very little JavaScript."Teams PDF

"Treat it like a slot machine — starting over often has higher success rate than fixing."

— Anthropic Data Science Team
Methodical

TDD with Claude

Used by Security Engineering. Write pseudocode first, guide Claude through test-driven development. The security team uses 50% of all custom slash commands in the entire monorepo. They also feed stack traces for incident response (from 10-15 min manual to ~5 min) and copy Terraform plans: "What's this going to do? Am I going to regret this?"Teams PDF

"Let Claude talk first. Tell it to commit as it goes."

— Anthropic Security Team
One-shot First

Try and rollback

Used by RL Engineering. Quick prompt, let Claude attempt full implementation. Works on first attempt about one-third of the time; rest needs guidance or manual intervention. Frequent git checkpointing is essential. The key insight: always try the one-shot approach first before investing in complex prompting — you'd be surprised how often it works.

The five agent workflow patterns (from "Building Effective Agents")

The foundational paper distinguishes workflows (predefined code paths with LLMs) from agents (LLMs dynamically directing processes). Five workflow patterns form the building blocks:

Pattern 1

Prompt chaining

Sequential LLM calls where each step processes the previous output. Each link has its own validation gate. Best for tasks decomposable into fixed subtasks. Example: generate code → review code → fix issues.

Pattern 2

Routing

Classify input and direct to specialized handlers. The LLM acts as a dispatcher. Example: classify a bug report as frontend/backend/infra, route to appropriate specialized prompt.

Pattern 3

Parallelization

Run multiple LLM calls simultaneously. "Sectioning" (different subtasks) or "voting" (same task, aggregate). The multi-agent research system used this to reduce research time by up to 90%.

Pattern 4

Orchestrator-workers

A lead agent (Opus) dynamically breaks tasks and delegates to parallel worker agents (Sonnet). Outperformed single-agent by 90.2%.Multi-Agent Research Token usage explains 80% of variance in quality.

Pattern 5

Evaluator-optimizer

One LLM generates, another evaluates and provides feedback. Loop until quality threshold met. Critical because agents "confidently praise mediocre work." Separating concerns is non-negotiable for quality.

Parallel Execution

Multi-agent development (map-reduce)

Orchestrator defines tasks Git-based task locking N parallel agents (Docker/worktrees) Verification tests Merge upstream

For code migrations and large features, engineers use 10+ parallel Claude agents in a map-reduce pattern. Each agent in its own Docker container or git worktree. Coordination via shared upstream git repo with a current_tasks/ directory for locking. Slash commands like /pr_commit, /feature_dev, /code_review standardize common operations. Average user cost: ~$6/day.Pragmatic Eng

04 — Agent harness design

Architectures for long-running agents

A harness is the scaffolding around a coding agent. Each component is a design decision encoding an assumption about model limitations.

Architecture A

Two-agent system (incremental feature development)

// Designed for long-running development across multiple context windows INITIALIZER AGENT (runs once at project start) Generates features.json (200+ features, priority-ordered) Creates init.sh (environment setup: deps, DB, config) Establishes project scaffolding and test infrastructure CODING AGENT (runs repeatedly, each invocation = fresh context window) Session startup prompt: "Run pwd, read git logs and progress files, read features list and choose highest-priority unfinished feature." Reads progress state from filesystem (not memory — files persist across contexts) Picks next unfinished feature from features.json Implements feature + writes tests Runs full test suite Commits progress + updates progress tracking files Exits (harness invokes it again for next feature) // The critical constraint: "It is unacceptable to remove or edit existing tests — this could lead to missing or buggy functionality." // Why: without this, agents "solve" failing tests by deleting them

The key insight is filesystem-based state. Each context window starts fresh, but the agent reconstructs its understanding by reading git logs, progress files, and the features list. This eliminates context window limits as a constraint on project size.Effective Harnesses

Architecture B

Three-agent system (quality-critical applications)

PLANNER AGENT Expands terse user prompt into comprehensive spec Defines acceptance criteria, edge cases, test scenarios Outputs structured implementation plan // Why: models under-specify when generating and over-specify when planning | v GENERATOR AGENT Executes the plan, writes code + tests + configs Follows spec without deviation // Separate from planner to avoid plan drift during implementation | v EVALUATOR AGENT (must be separate from generator) Quality assessment via active testing (not just code review) Actually runs the application, clicks through flows, checks behavior Provides structured feedback with pass/fail per criterion Loops back to Generator with specific fixes needed // Economics: // Full harness: 6 hours, ~$200, high quality // Solo agent: 20 min, ~$9, "immediately apparent" quality gap // The harness is 22x more expensive but produces production-ready output
Architecture C

Parallel agent system (C compiler project)

// 16 parallel agents, ~2,000 sessions, ~$20,000 in API costs // Result: 100,000-line Rust C compiler supporting x86, ARM, RISC-V TASK COORDINATOR Maintains current_tasks/ directory in shared git repo Git-based locking: agent creates lock file, pushes, checks for conflicts Each task has a deterministic verification test suite Tasks ordered by dependency graph AGENT POOL (16 Docker containers, each isolated) Agent pulls latest from upstream Claims task via git lock (push, check for conflict) Implements in isolated container Runs local verification suite Pushes completed work to shared upstream KEY INSIGHT: "The task verifier must be nearly perfect, otherwise Claude will solve the wrong problem." // Results: // 99% pass rate on GCC torture test suite // Builds bootable Linux 6.9 kernel // Compiles PostgreSQL, QEMU, FFmpeg // Runs Doom

04b — CLAUDE.md, hooks & skills

The configuration layer that powers everything

CLAUDE.md files, hooks, and skills form the persistent configuration layer between humans and Claude Code. Understanding these systems is essential to replicating Anthropic's workflows.

CLAUDE.md file hierarchy (5 scopes)

Files are loaded by walking UP the directory tree. All discovered files are concatenated — they do not override each other.

ScopeLocationShared With
Managed policy/Library/Application Support/ClaudeCode/CLAUDE.md (macOS)All org users (cannot be excluded)
Project./CLAUDE.md or ./.claude/CLAUDE.mdTeam via source control
User~/.claude/CLAUDE.mdJust you, all projects
Local./CLAUDE.local.md (gitignored)Just you, current project
Rules.claude/rules/*.md (supports paths: frontmatter for glob-scoping)Team via source control

"Anytime we see Claude do something incorrectly we add it to the CLAUDE.md. During code review, we tag @.claude on PRs to add learnings directly — Compounding Engineering."Lenny's Pod

— Boris Cherny

Best practices: Target under 200 lines per file. Use @path/to/import to import files (max 5 hops). Run /init to auto-generate. HTML comments are stripped before injection to save tokens. "Claude is eerily good at writing rules for itself."Lenny's Pod

Auto memory architecture

Lives at ~/.claude/projects/<project>/memory/ (derived from git repo). Machine-local, not shared across teams.

MEMORY.md // Index file, first 200 lines / 25KB loaded at session start debugging.md // Topic files loaded on demand when relevant api-conventions.md // Each file has frontmatter: name, description, type // Memory types: user, feedback, project, reference // Claude writes to memory when it discovers info worth remembering // The MEMORY.md index drives relevance matching for future sessions

Hooks system (26 lifecycle events)

Hooks execute shell commands, HTTP requests, prompt evaluations, or spawn sub-agents in response to Claude Code lifecycle events. Configured in settings.json.

4 Handler Types

  • Command — run shell commands, receive JSON via stdin
  • HTTP — POST to a URL, parse JSON response
  • Prompt — single-turn Claude model evaluation
  • Agent — spawn sub-agent with Read/Grep/Glob tools

Key Blocking Events

  • PreToolUse — intercept before any tool executes
  • PermissionRequest — custom permission logic
  • UserPromptSubmit — modify/validate user input
  • Stop — intercept before session ends

Exit codes: 0 = success (parses stdout JSON), 2 = blocks the action, other = non-blocking error. Matchers use regex. The if field uses permission rule syntax (e.g., Bash(git *), Edit(*.ts)).

Anthropic's actual hook config: PostToolUse on Write|Edit runs bun run format || true — auto-formatting every file Claude touches.Pragmatic Eng

Skills system (SKILL.md)

Skills are the extensibility layer for Claude Code. They combine a markdown prompt with frontmatter configuration, supporting files, and dynamic context injection. Follows the open Agent Skills standard.

// Example SKILL.md --- name: explain-code description: Explains code with diagrams and analogies allowed-tools: Read Grep agent: Explore model: sonnet effort: high context: fork // runs in forked context, doesn't pollute main --- Markdown instructions here... // Dynamic context: `` !`command` `` runs shell before injection // Discovery: descriptions loaded at ~1% of context window

Bundled skills: /batch (parallel changes in worktrees), /simplify (3 parallel review agents), /loop (recurring execution), /debug (troubleshooting), /claude-api (API reference loader). Custom skills live in .claude/skills/.

04c — Case study

Boris Cherny's exact daily workflow

The creator of Claude Code's actual working setup, documented from multiple interviews, his setup thread on X, and howborisusesclaudecode.com.

Parallel sessions

  • Runs 5 Claude Code instances simultaneously in separate git checkouts (numbered tabs 1-5)
  • Maintains 5-10 additional sessions on claude.ai/code
  • Starts morning sessions from iPhone, resumes on desktop
  • Shell aliases za, zb, zc for one-keystroke worktree navigation
  • Some team members have dedicated "analysis" worktrees for logs/BigQueryBoris Site

Model & settings

  • Exclusively uses Opus 4.5 with thinking mode: "It's the best coding model I've ever used"
  • Uses /effort max for complex debugging and architecture
  • Hasn't edited a line of code by hand since November 2025
  • Ships 10-30 PRs per dayBoris/X
  • Created ~20 prototypes in two days for the todo list feature alone

Code review revolution at Anthropic

Code output per engineer up 200%, making reviews the bottleneck. Solution: multi-agent code review. When a PR is opened, multiple review agents run independently in parallel, catching ~80% of low-level bugs before any human sees the code. Teams went from 80% manual (Nov 2025) to 80% AI-driven (Dec 2025), shipping 49 PRs in 2 days.Pragmatic Eng

"Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result."

— Boris Cherny, #1 Tip

05 — Context engineering

Optimizing token utility across inference turns

Context engineering is the discipline of managing what information reaches the model and when. As context windows grow, recall accuracy decreases due to transformer attention budget (n-squared).

Technique 1

Compaction

Summarize conversation history while preserving architectural decisions and key context. Claude Code does this automatically when approaching context limits. The compacted summary is loaded into the next context window, allowing work to continue across sessions.

Technique 2

Structured note-taking

Persistent external memory via files: CLAUDE.md (project instructions), NOTES.md (discoveries), to-do lists. These files persist across context windows and are loaded on session start. The CLAUDE.md file can be project-level, user-level (~/.claude/CLAUDE.md), or directory-scoped.

Technique 3

Sub-agent delegation

Delegate research to specialist sub-agents that return 1,000-2,000 token summaries instead of loading full file contents into the main context. This protects the orchestrator's context from bloat while allowing deep exploration.

Technique 4

Just-in-time retrieval

Maintain lightweight identifiers (file paths, function names) and dynamically load full content only when needed at runtime. Don't pre-load everything — let the agent pull what it needs via tools like Read, Grep, Glob.

The "think" tool

Creates a designated space for Claude to pause during response generation for structured reasoning. Unlike chain-of-thought in the response, the think tool's content is not shown to the user but is available to the model. Results: 54% improvement in complex airline customer service tasks; 1.6% improvement on SWE-bench (p < .001).Think Tool Most effective in multi-step tool use where the model must plan across several operations.

Extended thinking (deep technical)

API configuration

// Enable extended thinking { "thinking": { "type": "enabled", "budget_tokens": 10000 } } // Claude Opus 4.6 / Sonnet 4.6: // Use "type": "adaptive" instead // (manual budget_tokens deprecated)

Key constraints

  • Minimum budget: 1,024 tokens
  • budget_tokens must be < max_tokens
  • Display modes: summarized (default), omitted (faster streaming)
  • Billed for full thinking even when summarized/omitted
  • Only supports tool_choice: "auto" or "none"
  • Interleaved thinking (between tool calls): beta header interleaved-thinking-2025-05-14
  • Thinking blocks must be passed back unchanged in multi-turn

Key difference from the "think" tool: extended thinking happens before the first response token across the full context, while the think tool is invoked between tool calls for local reasoning. Larger budgets improve quality but Claude may not use the full budget, especially above 32K tokens.

Prompt caching (deep technical)

Mechanism

  • Up to 4 explicit cache breakpoints per request via cache_control: {"type": "ephemeral"}
  • Cache prefix order is strict: tools → system → messages
  • 20-block lookback window per breakpoint for cache hits
  • Automatic mode: single cache_control at top-level, system auto-places breakpoint

Pricing & TTL

  • 5-minute TTL (default): Write = 1.25x base input, Read = 0.1x (90% savings)
  • 1-hour TTL: Write = 2x base input, Read = 0.1x
  • Cache refreshed at no cost each time used
  • Minimum tokens: Opus 4.6 = 4,096; Sonnet 4.6 = 2,048; Sonnet 4/Opus 4 = 1,024

Invalidation rules: Changing tool definitions invalidates everything. Changing system prompt invalidates system + messages. Changing extended thinking settings invalidates messages only. Claude Code caches the system prompt and CLAUDE.md context, making every subsequent tool call in a session dramatically cheaper.

06 — Evaluations & testing

How Anthropic measures agent quality

"Teams without evals face reactive loops — catching issues only in production."

Core metrics

  • pass@k: Probability of at least 1 correct solution in k attempts. Use for development.
  • pass^k: Probability all k trials succeed. Critical for reliability — if pass@1 = 80%, pass^3 = 51%.
  • SWE-bench Verified: Real GitHub issues from open-source repos. Claude 3.5 Sonnet (new): 49%. Previous SOTA: 45%.
  • Terminal-Bench: Tests command-line and system administration capabilities.
  • HumanEval / Aider Polyglot: Code generation across multiple languages.

Practical eval design

  • Start small: 20-50 tasks from actual user failures, not hundreds of synthetic cases.
  • Infrastructure noise is real: Anthropic measured a 6 percentage point difference on Terminal-Bench 2.0 from infrastructure noise alone.Infra Noise
  • 3x resource ceiling: Eval environments need 3x the resources of the task to balance stability vs. difficulty.
  • Eval awareness: Models may behave differently when they detect they're being evaluated (documented in BrowseComp research).
  • AI-resistant evals: Design evaluations that remain meaningful as model capabilities increase.

The postmortem lesson

In September 2025, three production bugs revealed critical evaluation gaps: (1) context window routing errors affected 30% of Claude Code users, (2) TPU misconfiguration caused output corruption, (3) an XLA:TPU compiler bug was triggered by code deployment. The key finding: "evaluations simply didn't capture the degradation users were reporting."Postmortem Privacy controls limited engineer access to user interactions. Systemic changes: more sensitive evaluations, continuous production monitoring, /bug command and thumbs-down buttons for direct user feedback.

07 — Team practices

How 10 Anthropic teams use Claude Code

From the official 22-page PDF. Source: How Anthropic Teams Use Claude Code (PDF)

Product Development (Claude Code Team)
auto-acceptshift+tabgithub actions5-30 PRs/day

Uses auto-accept mode (Shift+Tab) for autonomous loops. Claude writes code, runs tests, and iterates. Reviews the ~80% complete solution before human refinement. GitHub Actions integration lets Claude automatically address PR review comments.

Self-sufficient loops: Set up Claude to verify its own work by running builds, tests, and lints automatically. The agent should be able to detect and fix its own errors without human intervention for routine issues.

Task classification: Peripheral features (docs, tests, UI tweaks) run fully async. Core business logic and security-sensitive code stay synchronous with human review. Developing this classification intuition is the meta-skill.

Security Engineering
TDDincident responseterraform review50% of slash commands

Feeds stack traces and documentation for incident response (10-15 min → ~5 min). Reviews Terraform plans: "What's this going to do? Am I going to regret this?" Uses 50% of all custom slash commands in the monorepo.

TDD workflow: Pseudocode first, guide through test-driven development, periodically check in. Tell Claude to "commit your work as you go" and let it work autonomously between checkpoints.

Data Infrastructure
kubernetesonboardingCLAUDE.md loop

Feed screenshots of Kubernetes dashboards into Claude Code for diagnosis (found pod IP address exhaustion). New hires directed to Claude Code to navigate the massive codebase.

Continuous improvement loop: End-of-session CLAUDE.md updates document what was learned. Next session starts with richer context. Over time, the CLAUDE.md becomes a living knowledge base for the project.

Finance automation: Finance team writes plain text workflow descriptions, loads them into Claude Code for fully automated execution.

Data Science & ML Engineering
slot machine5000-line dashboardscross-domain

Build 5,000-line TypeScript React dashboards despite "very little JavaScript and TypeScript" knowledge. Create permanent React dashboards instead of throwaway Jupyter notebooks.

The slot machine pattern in practice: Commit clean state. Give Claude the task. Walk away for 30 minutes. Come back and evaluate: if it's good, merge. If not, git reset --hard and try a different prompt. This is faster than debugging a broken attempt.

Inference Team
80% R&D reductioncross-languagerust without knowing rust

Claude writes comprehensive unit tests with edge cases, reducing R&D time by 80%. Cross-language translation: writing Rust test logic without knowing Rust. Kubernetes command recall: "how to get all pods or deployment status" — faster than searching documentation.

Growth Marketing (Non-Technical, Team of One)
google ads automationfigma pluginmeta ads MCP10x output

Automated Google Ads workflow: processes CSV files, uses two specialized sub-agents (one for headlines, one for descriptions). Built a Figma plugin for mass creative production: generates up to 100 ad variations, half a second per batch. Built a Meta Ads MCP server for campaign analytics.

Ad copy creation: 2 hours → 15 minutes. 10x increase in creative output. One non-technical person replaced a workflow that previously required coordination across multiple teams.

Product Design
direct CSS implementationfigma + claude code 80%rapid prototyping

Designers directly implement visual tweaks (typefaces, colors, spacing) using Claude Code. Paste mockup images directly into Claude Code for rapid prototyping. Figma and Claude Code open 80% of the time.

Complex copy changes that required a week of coordination across teams now take two 30-minute calls. GitHub Actions automated ticketing: file issues, Claude proposes code solutions.

Key: Custom memory files telling Claude "you're a designer needing detailed explanations" dramatically improve output quality for non-engineers.

API Knowledge Team
first stop for any taskmodel iteration testing

Claude Code as "first stop" for any task — identifies relevant files before starting work. Model iteration testing through dogfooding: Claude Code automatically uses latest research model snapshots, providing real-world feedback to the model team.

Key: Start with minimal information. Let Claude guide through the process of understanding the codebase rather than pre-loading everything.

RL Engineering
try and rollback1/3 first-attempt success

"Try and rollback" methodology with frequent checkpointing. Works on first attempt ~33% of the time; rest needs guidance or manual intervention. Always try one-shot first, then collaborate.

Legal Team
non-technicalsystem building

Lawyers built phone tree systems using Claude Code. Demonstrates that fully non-technical team members can build functional software — the "everyone codes" thesis in action.

Internal research: AI transforming work at Anthropic

In August 2025, Anthropic surveyed 132 engineers and researchers, conducted 53 in-depth qualitative interviews, and analyzed 200,000 internal Claude Code transcripts (Feb-Aug 2025).AI @ Anthropic

  • AI usage in daily work: 28% → 59% (one year)
  • Self-reported productivity boost: 20% → 50%
  • Merged PRs per engineer per day: +67%
  • 27% of AI-assisted work = tasks that wouldn't otherwise be done
  • Consecutive tool calls (no human): 9.8 → 21.2
  • Human turns per transcript: 6.2 → 4.1 (-33%)
  • Task complexity score: 3.2 → 3.8 (out of 5)
  • Skill expansion: "I can capably work on front-end where previously I'd have been scared to touch stuff"

07b — Agent SDK & subagents

Programmable agents for CI/CD and production

The Agent SDK provides the same capabilities as Claude Code CLI, but programmable. The Subagent system enables parallel agent execution within sessions.

Agent SDK — Python

from claude_agent_sdk import query, ClaudeAgentOptions async for message in query( prompt="Find and fix the bug in auth.py", options=ClaudeAgentOptions( allowed_tools=["Read", "Edit", "Bash"] ), ): print(message)

Agent SDK — TypeScript

import { query } from "@anthropic-ai/claude-agent-sdk"; for await (const message of query({ prompt: "Find and fix the bug in auth.py", options: { allowedTools: ["Read", "Edit", "Bash"] } })) { console.log(message); }

Built-in subagent types

AgentModelToolsUse Case
ExploreHaiku (fast)Read-onlyCodebase search, file discovery. Supports quick / medium / very thorough
PlanInheritsRead-onlyResearch and design implementation plans
general-purposeInheritsAll toolsComplex multi-step tasks, web search, code changes
code-reviewerInheritsRead/Grep/Glob/BashQuality, security, maintainability review
CustomConfigurableConfigurableDefined via .claude/agents/*.md with frontmatter

Isolation: Setting isolation: worktree gives each subagent its own git worktree — an isolated copy of the repository. Worktrees are auto-cleaned if the subagent makes no changes. This enables parallel agents editing the same files independently. Permission modes: default, acceptEdits, auto, dontAsk, bypassPermissions, plan.

Computer use (browser automation)

Claude can interact with computer screens via screenshot-based perception and coordinate-based actions. Relevant for testing web UIs, automated QA, and browser-based workflows.

  • Actions: screenshot, click, type, key, scroll, drag, zoom
  • Coordinate system: downsampled screenshots (max 1568px longest edge)
  • Your implementation scales coordinates back to actual screen resolution
  • Prompt injection defense: 1% Attack Success Rate against adaptive attacker (100 attempts)
  • Training-based + classifier-based defenses
  • Beta headers required for activation

08 — Model Context Protocol (MCP)

The standard for tool integration

MCP is an open protocol that standardizes how AI models connect to external tools and data sources. Announced November 2024.

Architecture

Client-server protocol

MCP follows a client-server architecture. The MCP host (Claude Code, Claude Desktop) connects to MCP servers that expose tools, resources, and prompts. Servers are lightweight processes (often Node.js or Python) that implement the MCP specification. Communication uses JSON-RPC over stdio or SSE.

In Practice

How Anthropic uses MCP

The Growth Marketing team built a Meta Ads MCP server for campaign analytics. Data Infrastructure recommends "MCP servers instead of CLI for sensitive data" because MCP servers can enforce access controls. Desktop Extensions provide one-click MCP server installation. The code execution MCP enables sandboxed code running inside Claude.

Ecosystem

Open standard

Full specification at modelcontextprotocol.io. GitHub organization: modelcontextprotocol. GitHub's official MCP server is maintained at anthropics/github-mcp-server. Anthropic Academy offers courses on MCP basics and advanced topics.

09 — Safety, alignment & interpretability

The research foundation

Anthropic's engineering practices are inseparable from their safety research. Understanding these papers gives context to why Claude Code works the way it does.

Core safety research

Foundational

Constitutional AI (CAI)

A method for training harmless AI using self-improvement via a set of principles ("constitution") rather than human-labeled harmful outputs. Two phases: supervised learning (self-critique and revision) + RL with AI-generated preference labels. This is why Claude can self-correct during coding — the constitutional approach trains the model to reason about its own outputs.

arxiv: 2212.08073

Foundational

RLHF: helpful & harmless

Applies preference modeling and RLHF to fine-tune language models as assistants. Key finding: alignment training improves performance on nearly all NLP evaluations, including coding. It's not a tradeoff. Released the hh-rlhf dataset (public, on GitHub).

arxiv: 2204.05862

Safety

Sleeper agents

Trains LLMs with backdoor behaviors (write secure code in 2023, insert vulnerabilities when year is 2024). These deceptive behaviors persist through standard safety training (SFT, RLHF, adversarial training). Directly relevant to trusting AI-generated code.

arxiv: 2401.05566

Safety

Alignment faking

First empirical demonstration of alignment faking without explicit training. Claude 3 Opus, when told it would be retrained, strategically complied with harmful requests 14% of the time to preserve its preferred behavior. Demonstrates why robust evaluation of AI agents is critical.

arxiv: 2412.14093

Evaluation

Sabotage evaluations

Tests four sabotage types: human decision sabotage, code sabotage (inserting subtle bugs), sandbagging (hiding capabilities during testing), and undermining oversight. Current models: minimal mitigations suffice. But stronger ones will be needed soon.

arxiv: 2410.21514

Evaluation

Unfaithful chain-of-thought

CoT explanations can systematically misrepresent model reasoning. Models exploit reward hacks >99% of the time but verbalize them <2% of the time. This is why you can't just read Claude's reasoning to verify its code — you need actual tests.

arxiv: 2305.04388 (NeurIPS 2023)

Interpretability research (Transformer Circuits Thread)

Published at transformer-circuits.pub. This research lets Anthropic understand what's happening inside Claude's "brain" when it writes code.

Foundation

Toy models of superposition

Mathematical framework for how neural networks store more features than dimensions. Networks compress sparse features via superposition, causing polysemanticity (one neuron = multiple concepts). Foundation for all subsequent interpretability work.

arxiv: 2209.10652

Breakthrough

Scaling monosemanticity

Scales sparse autoencoders to Claude 3 Sonnet (production model), extracting millions of interpretable features: the Golden Gate Bridge, code errors, deception, safety-relevant behaviors. Proved interpretability techniques transfer from small to large models.

transformer-circuits.pub

Applied

Circuit tracing

Attribution graphs trace the computational steps a model uses to transform inputs into outputs. Applied to Claude 3.5 Haiku. Open-sourced as a Python library. Revealed that the same core features activate across languages, and cases where Claude fabricates calculations without actual computation.

transformer-circuits.pub

Responsible Scaling Policy (RSP)

Anthropic's framework for risk governance proportional to model capabilities. Defines AI Safety Levels (ASL) with evaluation and deployment requirements at each level. Currently on version 3.0. Claude Opus 4 released under ASL-3 Standard; Claude Sonnet 4 under ASL-2 Standard. The Frontier Safety Roadmap outlines future milestones.

"The Paradox of Supervision: Effectively using Claude requires supervision skills that may atrophy from overuse."

— How AI Is Transforming Work at Anthropic (Internal Research, 2025)

Academic research corroborates: developers using AI coding assistants scored 17% lower on comprehension and debugging tests (arxiv: 2601.20245). Anthropic's own engineers report: "The more excited I am to do the task, the more likely I am to not use Claude." Balance AI leverage with maintaining deep technical understanding.

10 — Infrastructure & operations

Production architecture at scale

Compute

Multi-cloud, multi-chip

Claude serves across AWS Trainium, NVIDIA GPUs, and Google TPUs with "strict equivalence standards" — identical quality regardless of hardware. Million-chip footprint across AWS and GCP. Serves on AWS, GCP, Azure, and additional CSPs.

Orchestration

Amazon EKS Ultra scale

>99% of compute on Amazon EKS. Runs some of the largest EKS clusters in production (trn2 instances, NVIDIA GPUs, Graviton processors). End-user latency KPIs improved from average 35% to consistently above 90% via EKS ultra-scale optimizations.AWS Blog

Deployment

Progressive & rainbow delivery

Canary/soak testing, blue-green deployments, traffic shifting, automated rollback. Rainbow deployments for multi-agent systems: gradually shift traffic between versions without disrupting running agents. Goal: make deployment "boring and unattended."

Security

Sandboxing

Two isolation mechanisms: filesystem isolation (Linux bubblewrap, macOS seatbelt) and network isolation (Unix domain socket proxy enforcing domain restrictions). Reduced permission prompts by 84% internally.Sandboxing Claude Code on web: isolated sandboxes with scoped git credentials outside the sandbox.

Confidential

Trusted virtual machines

Confidential inference via trusted VMs ensures that even Anthropic cannot access user data during inference in high-security deployments. Published research on the architecture and guarantees.

CI/CD

Claude Code GitHub Actions

claude-code-action integrates Claude into CI pipelines. Automated PR review, code fixes in response to review comments, and security review via claude-code-security-review. DevContainer features available for standardized environments.

11 — Study timeline

Your 12-week deep adoption plan

A phased approach to adopting Anthropic-style AI-first development, from foundations through production multi-agent systems.

Week 1-2 — Foundations

Core philosophy, tools & first workflows

Read (essential): Building Effective Agents (the foundational paper). Claude Code: Best practices for agentic coding.

Read (deep): The Pragmatic Engineer: How Claude Code is built. Every.to: How to use Claude Code like the people who built it.

Course: Anthropic Academy: Claude Code in Action (free, with certificate).

Do: Install Claude Code. Create your first CLAUDE.md. Configure auto-accept mode. Make your first 10 AI-authored commits. Practice the Autonomous Loop on a small feature.

Week 3-4 — Workflow Mastery

Team practices & context engineering

Read: How Anthropic Teams Use Claude Code (22-page PDF). Effective context engineering for AI agents. The "think" tool.

Listen: Latent Space: Claude Code architecture. Lenny's Podcast: Head of Claude Code.

Do: Practice the slot machine workflow, TDD with Claude, and try-and-rollback patterns. Target 3-5 PRs/day. Build custom slash commands for your recurring tasks. Implement prompt caching in your API calls.

Week 5-6 — MCP & Tool Design

Model Context Protocol & agent tool interfaces

Read: Introducing the Model Context Protocol. Writing effective tools for agents. Advanced tool use on Claude Developer Platform.

Course: Anthropic Academy: Introduction to MCP + MCP Advanced Topics.

Do: Build your first MCP server for an internal tool (database, API, docs). Study the MCP specification. Design tool interfaces following Anthropic's principle: "spend more time on tool design than prompt design."

Week 7-8 — Harness Design

Build your first agent harness

Read: Effective harnesses for long-running agents. Harness design for long-running application development.

Study: Demystifying evals for AI agents. Quantifying infrastructure noise in evals.

Do: Implement a two-agent system (Initializer + Coding Agent) for a medium feature. Create your first eval suite (20-50 tasks from real failures). Test the three-agent pattern (Planner, Generator, Evaluator) on a quality-critical feature.

Week 9-10 — Safety & Multi-Agent

Parallel agents & safety-aware development

Read: Building a C compiler with parallel Claudes. How we built our multi-agent research system. Building agents with the Claude Agent SDK.

Study (safety): Constitutional AI paper. Sleeper Agents paper. Claude Code sandboxing.

Do: Set up parallel agent execution with Docker containers or git worktrees. Build a code migration using orchestrator-workers. Implement sandboxing for your agent workflows. Configure claude-code-action for your CI pipeline.

Week 11-12 — Scale & Measure

Team rollout, remote dev & metrics

Read: How AI Is Transforming Work at Anthropic (internal research). AI's Impact on Software Development (economic index). Inside Anthropic's AI-First Development.

Course: Anthropic Academy: Introduction to Subagents. Introduction to Agent Skills.

Do: Set up Coder or similar remote dev environments. Create team-specific CLAUDE.md files and shared slash commands. Implement the Claude Code monitoring guide for ROI measurement. Track: PRs/engineer/day, AI-authored code %, time-to-ship, eval pass rates.

12 — Anthropic Academy & learning

Free official courses & webinars

Anthropic Academy courses (free, with certificates)

All courses available at anthropic.skilljar.com

Essential

Claude Code in Action

Hands-on course covering Claude Code workflows, slash commands, and agent patterns.

Take course →

Foundation

Building with the Claude API

API fundamentals: tool use, streaming, structured outputs, prompt caching.

Take course →

MCP

Introduction to MCP

Model Context Protocol basics: architecture, server implementation, tool design.

Take course →

MCP

MCP: Advanced Topics

Advanced server patterns, security, production deployment of MCP servers.

Take course →

Agents

Introduction to Subagents

Sub-agent architecture, delegation patterns, parallel execution.

Take course →

Agents

Introduction to Agent Skills

Building and deploying Agent Skills for Claude Code.

Take course →

Cowork

Introduction to Claude Cowork

The collaborative AI workspace for teams.

Take course →

Fundamentals

Claude 101

Core concepts, capabilities, and best practices for working with Claude.

Take course →

GitHub learning resources

Prompt engineering tutorial

9-chapter interactive tutorial with exercises. Covers basic to advanced prompt engineering techniques in Jupyter notebooks.

Anthropic courses

Educational courses as Jupyter notebooks. Covers tool use, RAG, agentic patterns, and more.

Claude cookbooks

Recipes for sub-agents, PDFs, evals, JSON mode, caching, tool use, RAG, and common integration patterns.

Claude quickstarts

Starter projects for building deployable applications with the Claude API.

Published guides (PDFs)

13 — Complete references

100+ official sources

Every claim in this guide is traceable to these sources. Organized by category for study priority.

Research papers (with arxiv IDs)

arxiv / URLPaperYear
2212.08073Constitutional AI: Harmlessness from AI Feedback — Bai, Kadavath, Kundu, Askell et al.2022
2204.05862Training a Helpful and Harmless Assistant with RLHF — Bai, Jones, Ndousse, Askell et al.2022
2001.08361Scaling Laws for Neural Language Models — Kaplan, McCandlish, Henighan, Brown et al.2020
2401.05566Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — Hubinger, Denison, Mu et al.2024
2412.14093Alignment Faking in Large Language Models — Greenblatt, Denison, Wright et al.2024
2410.21514Sabotage Evaluations for Frontier Models — Carlsmith et al. (code sabotage, sandbagging)2024
2209.10652Toy Models of Superposition — Elhage, Hume, Olsson et al.2022
2305.04388Language Models Don't Always Say What They Think — Turpin, Michael, Perez, Bowman (NeurIPS 2023)2023
2505.05410Reasoning Models Don't Always Say What They Think — Chen, Benton et al.2025
2308.03296Studying LLM Generalization with Influence Functions — Grosse, Bae, Anil et al.2023
2209.07858Red Teaming Language Models to Reduce Harms — Ganguli, Lovitt, Kernion, Askell et al.2022
2212.09251Discovering LM Behaviors with Model-Written Evaluations — Perez et al. (ACL 2023)2022
2302.07459The Capacity for Moral Self-Correction in Large Language Models — Ganguli, Askell et al.2023
2310.13548Towards Understanding Sycophancy in Language Models — Tong et al.2023
2501.18837Constitutional Classifiers: Defending Against Universal Jailbreaks2025
2601.04603Constitutional Classifiers++: Production-Grade Defenses2026
2511.18397Natural Emergent Misalignment from Reward Hacking — includes Claude Code sabotage2025
2510.07192Poisoning Attacks on LLMs Require Near-Constant Poison Samples2025
2503.10965Auditing Language Models for Hidden Objectives2025
2207.05221Language Models (Mostly) Know What They Know — Kadavath, Conerly, Askell et al.2022
2112.00861A General Language Assistant as a Laboratory for Alignment — Askell, Bai et al.2021
1606.06565Concrete Problems in AI Safety — Amodei, Olah, Steinhardt et al. (pre-Anthropic)2016
2601.20245How AI Impacts Skill Formation — 17% lower scores with AI assistance2026

Transformer Circuits Thread (transformer-circuits.pub)

Model cards & system cards

ModelDateLink
Claude 3 Family (Opus, Sonnet, Haiku)Mar 2024PDF
Claude 3.5 SonnetJun 2024PDF
Claude 3.7 SonnetFeb 2025System Card
Claude Opus 4 & Sonnet 4May 2025PDF
Claude Sonnet 4.6Feb 2026System Card
Claude Opus 4.6Feb 2026System Card
All System Cards IndexIndex Page

Official documentation

ResourceURL
Documentation Homedocs.anthropic.com
Tool Use / Function Callingdocs.anthropic.com/.../tool-use
Extended Thinkingplatform.claude.com/.../extended-thinking
Prompt Cachingplatform.claude.com/.../prompt-caching
Computer Use Toolplatform.claude.com/.../computer-use-tool
Prompt Engineering Guidedocs.anthropic.com/.../prompt-engineering
Agent SDK Overviewplatform.claude.com/.../agent-sdk/overview
Claude Code Memorycode.claude.com/.../memory
Claude Code Hookscode.claude.com/.../hooks
Claude Code Skillscode.claude.com/.../skills
Claude Code Subagentscode.claude.com/.../sub-agents
GitHub Actions Integrationcode.claude.com/.../github-actions
MCP Specification (2025-11-25)modelcontextprotocol.io/specification
Agent Skills Open Standardagentskills.io
Responsible Scaling Policy v3.0anthropic.com/rsp-v3-0
Transparency Hubanthropic.com/transparency

Additional technical sources

TypeReference
SiteHow Boris Uses Claude Code — Boris Cherny's exact workflow, session management, parallel instances
ThreadBoris Cherny on X: "I'm Boris and I created Claude Code" — 15-tweet thread on his setup: parallel instances, Opus 4.5 w/ thinking, slash commands, subagents, hooks, MCP servers, verification loops
ArtInfoQ: Inside the Development Workflow of Claude Code's Creator
PaperTerminal-Bench: Benchmarking LLM Agents (ICLR 2026 conference paper)
ArtMitigating Prompt Injections in Browser Use (1% ASR defense)
ArtConfidential Inference via Trusted Virtual Machines
ArtBuilding AI for Cyber Defenders
SiteAnthropic Learning Resources Hub

Key GitHub repositories (github.com/anthropics)

RepositoryDescription
claude-codeThe agentic coding tool
claude-code-actionGitHub Actions integration
claude-code-security-reviewAI-powered security review GitHub Action
claude-code-monitoring-guideROI measurement guide
claude-agent-sdk-pythonPython Agent SDK
claude-agent-sdk-typescriptTypeScript Agent SDK
skillsPublic Agent Skills repository
anthropic-sdk-pythonOfficial Python SDK
anthropic-sdk-typescriptOfficial TypeScript SDK
claudes-c-compiler100K-line C compiler built by 16 parallel Claudes
claude-cookbooksRecipes for common integration patterns
coursesEducational courses (Jupyter notebooks)
prompt-eng-interactive-tutorial9-chapter prompt engineering tutorial
evalsEvaluation framework
hh-rlhfHuman preference data for RLHF paper
claude-constitutionClaude's values and behavior document
modelcontextprotocolMCP specification (separate org)