Table of Contents
- Introduction — What Is an Agent?
- I. The Claude Agent SDK — Why We Built It
- II. The Harness — More Than Just a Model
- III. Bash Is All You Need
- IV. Workflows vs. Agents
- V. Designing the Agent Loop
- VI. Tools vs. Bash vs. Code Generation
- VII. Skills — Progressive Context Disclosure
- VIII. Security & Permissions — The Swiss Cheese Defense
- IX. Live Demo — The Pokémon Agent
- X. The Spreadsheet Agent Exercise
- XI. Sub-Agents and Context Management
- XII. Hooks — Deterministic Verification
- XIII. Q&A — Scaling, Reproducibility, and Large Codebases
Introduction — What Is an Agent?
"Agents build their own context, decide their own trajectories, and work very autonomously." — "The bash tool is the most powerful agent tool." — "Building an agent loop is kind of an art or intuition."
THARIQ: Thanks for joining me. I'm still on West Coast time, so this feels like 7:00 a.m. to me — but I'm glad to talk to you about the Claude Agent SDK. Here's the rough agenda: we're going to cover what the SDK is and why you'd use it, what an agent actually is and how it differs from a workflow, how to design an agent, and then I'm going to do some live coding to prototype one from scratch. The whole goal is to be collaborative — ask questions. This is also not a canned demo; we're going to think through things live. Building an agent loop is very much an art or intuition, so I want you to see that process in real time.
Before we get started, let me ask: how many of you have heard of the Claude Agent SDK? And how many have actually used it or tried it out? Great — decent show of hands. Let's dive in.
I think how AI features are evolving is something people have seen before, but it's still taking time to fully sink in. When GPT-3 came out, it was really about single LLM features — "can you categorize this and return one of these categories?" Then we got more workflow-like things: "take this email and label it," or "here's my codebase indexed via RAG, give me the next file to edit." That's a workflow — very structured, very constrained. Now we're getting to agents.
The canonical agent to point to is Claude Code. Claude Code is a tool where you don't really restrict what it can do. You talk to it in natural language, and it takes a wide variety of actions. Agents build their own context, decide their own trajectories, and work very autonomously. As the future unfolds, agents will become more and more capable. We're at a genuine breakpoint right now where we can start building them — they're not perfect, but this is absolutely the right time to get started.
Claude Code was the first true agent I saw working for 10, 20, 30 minutes straight. It's a coding agent, and the Claude Agent SDK is actually built on top of it.
I. The Claude Agent SDK — Why We Built It
THARIQ: The reason we built the agent SDK on top of Claude Code is simple: when we were building agents internally at Anthropic, we kept rebuilding the same parts over and over again. Once we put Claude Code out into the world, the engineers started using it — and then the finance people started using it, and the data science people, and the marketing people. We realized people were using Claude Code for non-coding tasks. And as we built our own non-coding agents internally, we kept coming back to it.
The lessons we learned from deploying Claude Code at scale — tool use errors, context compaction, best practices for long-running tasks — we've baked all of that into the SDK. As a result, we have strong opinions on the best way to build agents. The Claude Agent SDK is quite opinionated. One of the biggest opinions: the bash tool is the most powerful agent tool.
What does the "Anthropic way" to build agents look like? Roughly: Unix primitives — bash and the file system — as the foundation. Agents that build their own context. Code generation for non-coding tasks. And every agent runs in a container with a file system and bash access, because that is what enables it to operate generally. It's a very different architecture from spinning up a tool-calling loop against the completions API.
Who Is Already Building on the SDK?
Tons of teams are already using it. Software reliability, security triaging, bug finding, site and dashboard builders — those are extremely popular. Office agents for any sort of document work. Legal, finance, healthcare use cases. If you're building any of those, you should absolutely consider the SDK.
II. The Harness — More Than Just a Model
THARIQ: A robust agent requires more than just a model. We call the surrounding infrastructure the "Harness." In the harness, you've got tools — and tools are the obvious first step. But tools aren't just your own custom tools; they include things for interacting with the file system, like Claude Code uses. Then you've got the core agent prompts. Then the file system itself, which is a way of doing context engineering — and that's something I want to come back to, because it's one of the key insights we had through Claude Code: context is not just a prompt. It's the tools, the files, the scripts the agent can use.
Then there are skills, which we've rolled out recently. Sub-agents. Web search. Compacting. Hooks. Memory. There are a lot of things around the harness. The Claude Agent SDK packages all of these up so you can focus on the hard domain-specific problems, not the infrastructure.
III. Bash Is All You Need
THARIQ: This is my schtick, and I'm going to keep talking about it until everyone agrees with me. Bash is what makes Claude Code so good. Think about what the bash tool actually lets you do: store results of tool calls to files, store memory dynamically, generate scripts and call them, compose functionality using tools like tail, grep, ffmpeg, LibreOffice. It lets the agent use existing software — all the software that already exists on a computer.
If you were designing an agent harness without bash, you'd probably build a search tool, a lint tool, an execute tool. Every time you thought of a new use case, you'd say, "I need another tool." Instead, Claude Code just uses grep. No special package manager. It can run npm run test, it can lint, it can figure out how you lint and run npm run lint — and if you don't have a linter, it can ask, "What if I install ESLint for you?" That's the composability of bash.
Bash for Non-Coding Agents
Let me give a concrete non-coding example. Say you have an email agent and the user asks: "How much did I spend on ride-sharing this week?" Without bash, the agent searches your inbox for "Uber" or "Lyft," gets a hundred emails, and now has to reason about all of them at once. Imagine if someone handed you a stack of papers and said, "Can you just read through all my emails and add up the prices?" That's really hard — you need very high precision and recall.
With bash, you can pipe that search through grep for prices, save them to a file with line numbers, add them up, and then verify each result afterward — checking what each price actually corresponds to. There's a lot more dynamic information processing and work-checking you can do. That's the composability of bash in a non-coding context.
More examples: if you have an inbox API and a contacts API, you can compose them with bash. If you have a video meeting agent and you want to find all the moments in an earnings call where the speaker says "quarterly results," you can use ffmpeg to slice the video and jq to analyze the data afterward. The key insight is that bash is essentially the first code mode — you can take a very wide variety of actions very generically.
AUDIENCE: Do you have stats on how many people use "yolo mode"?
THARIQ: Internally we don't use it much — we just have a higher security posture. I could probably pull that data. Any other questions on bash?
IV. Workflows vs. Agents
THARIQ: You can build both workflows and agents on the agent SDK. Agents are like Claude Code — you want to talk to it in natural language and have it take actions flexibly. Maybe you have an agent that talks to your business data and you want to get insights, build dashboards, answer questions, or write code. That's an agent.
A workflow is more constrained. We do a lot of GitHub Actions internally, for example — you define the inputs and outputs very closely. "Take a PR, give me a code review." With the SDK, you can use structured outputs for that. But when I say "workflow," don't assume it's simpler. We have a bot that triages issues when they come in — a pretty workflow-like thing. But in order to triage issues properly, it needs to clone the codebase, sometimes spin up a Docker container, test things. There are a lot of free-flowing steps in the middle, and then structured output at the end. So the line is blurrier than it looks.
V. Designing the Agent Loop
THARIQ: Here's the meta-learning for designing an agent loop: read the transcripts over and over again. Every time you see the agent running, read it and figure out — what is it doing? Why is it doing this? Can I help it out somehow? We'll do some of that during the demo.
The agent loop has three parts. First: gather context. For Claude Code, that's grepping and finding the files it needs. For an email agent, it's finding the relevant emails. This step gets underestimated. A lot of people skip it or underthink it, but how the agent finds context is critically important.
Second: take action. Does it have the right tools to do its work? Code generation and bash are more flexible ways of taking action than predefined tool calls.
Third: verify the work. This is another really important step that people overlook. If you're thinking of building an agent, ask yourself: can you verify its work? If you can, it's a great candidate for an agent. Code is highly verifiable — you can lint it, at minimum make sure it compiles, and then execute it and see what it does. Deep research is harder to verify; you can require source citations, but that's less rigorous than a compile step. The agents closest to being truly general are the ones with the strongest verification steps.
AUDIENCE: When do you generate a plan?
THARIQ: You'd insert it between the gather context and take action steps. Plans help the agent think through things step by step, but they add latency — so there's a trade-off. In Claude Code, you don't always generate a plan; it depends on the task. The agent SDK has some planning support as well.
AUDIENCE: Can you make the agent create a to-do list and guarantee it will follow it?
THARIQ: Yes — the agent SDK ships with to-do tools. It will maintain and check off to-dos as it goes, and you can display that progress in real time.
VI. Tools vs. Bash vs. Code Generation
THARIQ: You have three things to reach for: predefined tools, bash, and code generation. Most people are only thinking about tools. Let me compare all three.
Tools are extremely structured and very reliable. If you want the fastest output with minimal errors and minimal retries, tools are great. The cons: they're high context usage. If you've ever built an agent with 50 or 100 tools, they take up a lot of context and the model can get confused. There's no discoverability — the model can't explore what the tools can do. And they're not composable.
Bash is very composable. Static scripts, low context usage. It does take some discovery time — if you have the Playwright CLI, for instance, the agent needs to run playwright --help to figure out what it can do, every time. That's kind of powerful because it trades away some context usage for a little latency. Call rates might be slightly lower, but this improves as models get better.
Code generation is highly composable and enables truly dynamic scripts. They take the longest to execute — they may need linting, possibly compilation. API design becomes a very interesting step in this context.
The practical synthesis: keep predefined tools for atomic actions you need a lot of control over. In Claude Code, we don't use bash to write a file — we have a write-file tool, because we want the user to see the output and approve it. We're not composing write-file with other things. Sending an email is another example: any irreversible or externally visible action is a good candidate for a tool. Use bash for composable actions like searching folders, linting code, or managing memory — you can write to files and that becomes your memory system. Use code generation for highly dynamic, flexible logic where you're composing APIs, doing data analysis, or reusing patterns.
AUDIENCE: Will there be ready-made tools for offloading long tool-call results to the file system, to prevent context explosion?
THARIQ: That's a good common practice. I've seen some recent work on this in Claude Code around handling very long outputs. My general approach now is: whenever I have a tool call, I save the results to the file system and have the tool call return the path of the result. That lets the agent recheck its work later by searching across those files.
VII. Skills — Progressive Context Disclosure
THARIQ: Skills are basically a way of allowing an agent to take on longer, more complex tasks by loading in context progressively. For example, we have a bunch of DOCX skills — these tell the agent how to use code generation to produce those files. Skills are, at their core, a collection of files. They're a great example of being very "file system first" because they're really just folders that your agent can cd into and read.
What we've found skills are really good for is repeatable instructions that need a lot of expertise baked in. We released a front-end design skill recently that I really like. It's essentially a very detailed, expert-level prompt on how to do front-end design — written by one of our best AI front-end engineers, with a lot of thought and iteration behind it. That's the value: skills let you capture expert knowledge in a retrievable, composable form.
AUDIENCE: What's the priority order between CLAUDE.md files and skill files?
THARIQ: Honestly, these concepts are so new — Claude Code was released only eight or nine months ago, and skills were released literally two weeks ago. I won't pretend I know all the best practices. The way I think about skills is as progressive context disclosure. The agent is like: "I need to do this. Let me find out how." It reads in the skill, figures out the steps, then executes. You ask it to make a DOCX file, it CDs into the skills directory, reads how to do it, writes some scripts, and keeps going. There's still intuition to build around what exactly belongs in a skill versus CLAUDE.md.
AUDIENCE: Are skills ultimately going to become part of the model itself?
THARIQ: Broadly, yes — the model will get better at a wider variety of tasks, and skills are the best way to handle out-of-distribution tasks right now. My general rule of thumb: I try to rethink and rewrite my agent code every six months, because capabilities have likely changed enough that I've baked in outdated assumptions. The agent SDK is built to advance with capabilities — the bash tool gets better, and since we build on Claude Code, you get those improvements automatically. But things have changed so dramatically in the last year that a general best practice is: we can write code ten times faster now, so we should be willing to throw it out ten times faster as well. If you're a startup, that's arguably your biggest advantage over larger competitors who are stuck in six-month incubation cycles.
AUDIENCE: Why use a skill versus an API?
THARIQ: Both are forms of progressive disclosure to the agent. The choice is use-case dependent. Read the transcript of your agent running and see what it naturally reaches for. If it always wants to think about the API as an api.ts file, do that. Skills are a great introduction to thinking about the file system as a way of storing context, but there are many ways to use the system. One thing to note: skills require the bash tool and a virtual file system, so the agent SDK is basically the only way to use them to their full extent right now.
AUDIENCE: Can we expect a marketplace for skills?
THARIQ: Claude Code has a plugin marketplace you can use with the agent SDK, and we're evolving it over time. It's more of a discovery system than a commercial marketplace, but it does exist right now — you can do /plugins in Claude Code to find some.
VIII. Security & Permissions — The Swiss Cheese Defense
AUDIENCE: If you're using bash as an all-powerful generic tool, is the onus on the agent builder to guard against common attack vectors, or is the model doing that itself?
THARIQ: We call it the Swiss cheese defense. On every layer there are some defenses, and together they block everything — we hope.
At the model layer, we do a lot of alignment work. We just put out a really good paper on reward hacking — strongly recommend checking it out. We try to make the Claude models very aligned. Then at the harness level, we have a lot of permissioning and prompting built in. We actually run a parser on the bash tool so we know fairly reliably what the bash tool is actually doing — and that is definitely not something you want to build yourself. Finally, there's sandboxing. If someone has maliciously taken over your agent, what can it actually do? We've included a sandbox where you can restrict network requests and file system operations outside the designated directory.
The "lethal trifecta," as it's sometimes called, is: execute code in the environment, change the file system, and exfiltrate the code out. If you sandbox the network, that last step becomes very hard. If you're hosting on a sandboxed container — Cloudflare, Modal, E2B, Daytona, any of these sandbox providers — they've also done some level of security. You're not hosting it on your personal computer with your production secrets sitting around.
"ReadOnly" vs. "ReadWrite" File Access
AUDIENCE: How do you ensure role-based access controls?
THARIQ: Generally that's in how you provision your API key or your backend service. I typically create temporary API keys scoped in certain ways. Some people create proxies in between to handle API key injection. On the backend you can check what the agent is trying to do and give it appropriate feedback. The model listens to feedback — if you throw an error, it will read the error output and iterate. That's a powerful property to design around.
IX. Live Demo — The Pokémon Agent
THARIQ: Let's make a Pokémon agent. Pokémon is a game with a lot of information — thousands of Pokémon, each with a ton of moves. There is a public PokéAPI. I chose it because I know you all have your own complex APIs, and I wanted to choose something with a fairly complex API that I haven't tried building against before.
One thing a user might want to do is build a competitive Pokémon team. I love Pokémon but know very little about competitive play. Could an agent help me with that? That would be cool. My goal is to build an agent that can chat about Pokémon, and we'll see how far we get.
The first step was: I gave Claude Code the prompt "go search the PokéAPI for its API and create a TypeScript library." Here's what it generated — a PokemonAPI interface with methods like getByName, listPokemon, getAllPokemon, getSpecies, getAbilities, and the same for moves. It also created a CLAUDE.md describing this TypeScript SDK and instructing the agent to write scripts in an examples directory and execute those scripts to handle queries. That's my agent, really — a prompt to generate a TypeScript library plus a CLAUDE.md.
Comparing Tools-Only vs. Bash + Code Generation
I also built a version using the messages/completions API directly with predefined tools: getPokemon, getPokemonSpecies, getPokemonAbility, getPokemonType, getMove. You can see how with the tools-only approach, I had to define all of these upfront. And there's only so many parameters I could anticipate.
When I ask the Claude Code version "what are the Generation 2 water Pokémon?", it writes a script, fetches the PokéAPI directly, iterates over 200+ Pokémon, checks their types, and returns the list. It's using code generation to handle a query I didn't explicitly anticipate when I built the harness. That's the key difference — the bash-and-codegen version can handle queries I never pre-designed tools for.
Competitive Analysis with Smogon Data
For competitive play, there's a text file from Smogon — an online library that stores all Pokémon, their moves, who they work well with, who counters them. I put this in the data folder and asked: "I want to make a team around Venusaur. Can you give me some suggestions based on the Smogon data?"
What it did was start grepping through that text file — finding Venusaur's profile, then finding other Pokémon that mentioned Venusaur as a teammate or counter. It ran searches, found interesting synergies, wrote a script to analyze the most common teammates, and returned a full team suggestion. All based on a raw text file. That's the power of the approach — I didn't build a structured Smogon API. I gave it a text file and bash access and it figured out the rest.
AUDIENCE: Is that code going to be on GitHub somewhere?
THARIQ: Yeah, it's on my personal GitHub. I'll push the latest changes and also tweet about it — I'm @TRQ212 on Twitter and post a lot about agent SDK stuff.
X. The Spreadsheet Agent Exercise
THARIQ: Let's think through a spreadsheet agent. You want it to be able to search, take action, and verify its work. How would you approach it? Take a minute to think about it.
What's the best way for an agent to search a spreadsheet? You've got a CSV. The agent wants to search it. What does it do?
AUDIENCE: Convert it to CSV and grep. Look at all the headers.
THARIQ: Good. Headers give you a starting point, but a spreadsheet is a multi-dimensional problem. If you ask "what's the revenue in 2026?" you need to find a revenue column and then filter by 2026. Pulling headers alone gives you column names but not row values. Other ideas?
AUDIENCE: AWK. SQLite on the CSV directly. Use the Google Sheets API.
THARIQ: All great. I love the SQLite one — querying a CSV via SQLite is a creative way to think about API design. If you can translate your data source into a SQL interface, the agent knows SQL extremely well. That transformation step is one of the most powerful design moves you can make for an agentic search interface.
Something else worth noting: XLSX files are XML under the hood. You can do XML path queries against the raw file, and there are libraries that support that. So you've got at least four approaches: grep on headers, AWK, SQLite wrapper, XML query. The key insight is that gathering context is really creative work. If you've only tried one iteration, that's probably not enough. Try SQL, try grep, try AWK — run tests across approaches, see what the agent naturally does well, and iterate.
Taking Action and Verifying Work
Taking action in a spreadsheet looks a lot like gathering context — inserting a 2D array, running a SQL update, editing XML. The APIs tend to be similar going both directions. Then verification: check for null pointers. Look for anomalies — did a value change that shouldn't have? One pattern is to use a sub-agent for adversarial verification. Give a fresh agent the output with no context about who made it and ask it to critique the work. As models get better at reasoning, that adversarial sub-agent gets better too.
AUDIENCE: What about undoing mistakes? If the agent deletes a whole spreadsheet, what then?
THARIQ: This is a really important consideration when choosing what domains agents are good at. How reversible is the work? Code is very reversible — you can undo via git history. Claude Code does atomic operations and I use git constantly through it; I don't type git commands anymore. Computer use is a bad example of reversibility because every action compounds the state. If you're thinking about building a spreadsheet agent, try to turn it into a reversible state machine: store state at checkpoints so the user can say, "my spreadsheet is messed up, go back to where we were." Someone actually built a "time travel" tool for a coding agent that let it revert to a point before something went wrong. Those tools are still early but the idea is sound.
XI. Sub-Agents and Context Management
AUDIENCE: Do all agents share the same context window?
THARIQ: Sub-agents are one of the most important primitives for managing context, and I haven't talked about them enough. Sub-agents are great for when you need to do a lot of work and return just the answer to a main agent. For a spreadsheet search, the main agent might spin up three sub-agents in parallel — one to summarize sheet one, one for sheet two, one for sheet three — and then reason over all three results. The sub-agents do the expensive exploratory work; the main agent just sees the distilled output.
Sub-agents with bash is something Claude Code handles really well — better than I've seen in any other framework, honestly. Running parallel sub-agents with bash becomes very complex (race conditions and file system conflicts are real), and there's a lot of engineering we've done to solve that. In the agent SDK you can just ask it to spin up three sub-agents to do a task and it will do that.
AUDIENCE: What fraction of the context window can you use before hitting diminishing returns?
THARIQ: When I talk to people using Claude Code, they'll mention being on their fifth compact, and I'm genuinely surprised — I've almost never done a compact. I tend to clear the context window very often. In code, the state is in the files of the codebase, not in the chat history. If I've made some changes, Claude Code can just look at the git diff and continue a new task without needing to know my entire conversation history. So for code, I clear context very aggressively.
For your own agents — a spreadsheet agent, say — context management is harder because your users don't know what a context window is. The UX design challenge is: can you reset conversation state gracefully? In a spreadsheet, a lot of the state is in the spreadsheet itself, not in the chat history, so the agent probably doesn't need to carry as much forward. Can you store user preferences as it goes, so you remember some things without needing the full transcript? There's a lot of design space here, but you are trying to minimize context usage without the user noticing.
XII. Hooks — Deterministic Verification
AUDIENCE: I haven't heard you talk about hooks yet. What's your take?
THARIQ: Hooks are great, and we do ship with them. Hooks are a way of doing deterministic verification or inserting context at key moments. We fire these as events and you can register handlers in the agent SDK.
Some examples: you can run a verification check on the spreadsheet after every tool call. Or imagine you're working with an agent on a spreadsheet and the user is also editing the spreadsheet in parallel. A hook can detect that the user has changed something and inject those changes into the agent's context — giving it live updates between tool calls. That's a really interesting way to use hooks.
For determinism: if an agent hallucinates or skips a step — say, it guesses a Pokémon stat instead of actually running a script to check — a hook can intercept the response and inject feedback: "Please make sure you write a script and read the actual data." In Claude Code, we have a rule that if the agent tries to write to a file it hasn't read yet, we throw an error and say "you haven't read this file yet, try reading it first." That's deterministic verification at the tool level, and hooks let you build the same kind of thing for your own agents.
More documentation is in the agent SDK docs, and I'm happy to talk through specific use cases afterward.
XIII. Q&A — Scaling, Reproducibility, and Large Codebases
Reproducibility and Helper Scripts
AUDIENCE: Let's say I've done this prototyping and found something that works. How do I convert that into a reproducible, productionizable thing?
THARIQ: I'd start in CLAUDE.md — for example, one of my early runs of the Pokémon agent ignored my pre-built TypeScript API and wrote JavaScript. I should have been more explicit in CLAUDE.md: "you should use this library." The second thing is: summarize what worked in terms of helper scripts and write something like an agent.ts script to run the tests again. Good helper scripts mean the agent can discover and verify what tools it has available.
When the Agent Ignores the Script and Just Answers from Memory
AUDIENCE: My Pokémon agent tries once or twice to use the scripts, fails, and then just returns a comparison table from its training data. Any advice?
THARIQ: This is a good problem and there is some messiness here. Pokémon is in-distribution for the model, so it will sometimes just answer from memory rather than actually checking. One of the best solutions is hooks: you can check whether the response includes a script execution, and if not, inject feedback — "please make sure you write a script, please make sure you read the actual data." It adds a deterministic guardrail without retraining the model. That said, it is an art — sometimes you just need to push harder on it through the prompt or iterate on the verification logic.
Scaling to Massive Codebases (50M+ Lines)
AUDIENCE: I'm working with a 50-million-plus line codebase. Standard grep doesn't work at that scale. I'm building my own semantic indexing. Is Anthropic thinking about making that more native to the product?
THARIQ: Is what you're building going to be obsolete in a couple of months? Generally, with AI — yes. [laughter] But let me give a real answer. Semantic search has trade-offs: it's more brittle, and the model isn't trained on your specific semantic index, so the queries can be unpredictable. What I've seen working well for very large codebases is good CLAUDE.md files — structured context that tells the agent where things live — and starting the agent in the specific subdirectory you want it to work in rather than trying to index the entire 50 million lines at once. Good verification steps, hooks, and linting matter a lot here. We don't use custom semantic indexing for Claude Code itself, but we do use good context files and scoped starting points. It's not a perfect answer, but it's what works today.
Monetization and Pricing Models
AUDIENCE: Agents are pricey. How are you thinking about monetization when the margins all flow back to Anthropic?
THARIQ: Agents are pricey right now — models have just started to become truly agentic and we focus on the most intelligent models we can build. The general software answer applies here: you'd rather charge fewer people more money for solving a genuinely hard problem than charge many people a little for something they could do themselves. Find the use cases where people have a hard problem and will pay meaningfully for a solution. On pricing structure — subscription or usage-based — it really depends on your user base. Claude Code does a mix: rate limits with overage usage-based pricing above the limit. Design your monetization model up front, because it's hard to walk back pricing promises once users expect them.
[Applause. End of session.]