Author: Claude AI, under the supervision, prompting and editing by HocTro
Based on a full-day developer workshop hosted by Anthropic. Presenter: Thariq Shihipar, Anthropic. The original transcript has been restructured and every technical concept explained in plain language.
Introduction — What Is an Agent?
To understand what Thariq is teaching, you first need to understand the three generations of AI tools he lays out, because almost everything he says builds on that foundation.
The first generation was the single AI call. You had an AI model — think of a model as the AI "brain," a program trained on enormous amounts of text that can understand and generate language — and you would send it one question and get one answer. "Categorize this email as spam or not spam." That's it. One prompt in, one answer out. It was useful but limited. The AI had no memory of what it just did, no ability to look anything up, and no way to take action in the world.
The second generation was the workflow. A workflow is a structured, pre-planned sequence of AI calls stitched together by a programmer. Think of it as a recipe: "Step one, take the user's email. Step two, ask the AI to label it. Step three, save the label to the database." Each step is defined ahead of time by the developer. The AI has no choice about what to do next — it just fills in its part of the assembly line. Workflows were more powerful, but they were rigid. If a new situation arose that the programmer hadn't anticipated, the workflow broke.
The third generation — and the one this workshop is all about — is the agent. An agent is an AI that decides for itself what steps to take, in what order, for how long. Thariq puts it plainly: "Agents build their own context, decide their own trajectories, and work very autonomously." Context here is a word you will see constantly in this field. The AI model can only ever see a limited amount of text at once — its working memory, essentially. This is called the context window. A single-turn AI call sees just your one question. An agent actively goes out and collects the information it needs, loads it into that context window, works through the problem, takes actions, and checks its own results. It is not following a recipe; it is figuring out the recipe as it goes.
The example Thariq uses is Claude Code — Anthropic's own coding assistant. Claude Code is the first agent he watched run for ten, twenty, thirty minutes straight without a human steering it. You tell it in plain English what you want — "add a user login feature to my app" — and it reads your codebase, figures out what needs changing, writes the code, runs the tests, and fixes errors, all on its own. That is what makes it an agent rather than a workflow: it is making decisions at every step, not executing a predefined script.
The Claude Agent SDK — the topic of this entire workshop — is a toolkit Anthropic built specifically to make it easier for other developers to build their own agents. Not coding agents like Claude Code, but any kind of agent: a customer service agent, a finance agent, a legal research agent, a healthcare agent. The SDK packages up all the hard-won lessons from building Claude Code so that you do not have to solve those problems yourself.
I. The Claude Agent SDK — Why We Built It
An SDK stands for Software Development Kit. Think of it as a toolbox that someone has already assembled for you. Instead of going to the hardware store, buying every individual tool, figuring out which brand works best, and learning how to use each one from scratch, you get a pre-packed toolbox with everything organized and ready to go. The Claude Agent SDK is Anthropic's pre-packed toolbox for building AI agents.
Thariq explains the reason it exists in a refreshingly honest way: Anthropic engineers kept rebuilding the same pieces over and over again. Every time they wanted to build a new agent for an internal project — tracking bugs, reviewing code, helping the finance team — they found themselves solving the same plumbing problems. How do you handle it when the AI makes an error calling a tool? How do you stop the conversation from running out of memory? What do you do when the agent runs for an hour and the context window fills up? Every team was solving these problems independently, writing duplicate code, making the same mistakes.
Once Claude Code was released to the public, something unexpected happened. Engineers started using it for their work, which made sense — it was a coding tool. But then the finance team started using it. The marketing team. The data scientists. People were using a coding agent to do tasks that had nothing to do with code, and it worked because the underlying approach was general enough. That observation crystallized the opportunity: if the architecture of Claude Code could be extracted and packaged, anyone could build a capable agent on top of it without re-solving the foundational problems.
The SDK bakes in Anthropic's hardest-won lessons. Tool use errors — meaning situations where the AI tries to call one of its available functions and something goes wrong — are handled gracefully. Context compaction is built in; this is the process of summarizing older parts of a conversation when the context window starts to fill up, so the agent can keep working on long tasks without losing its mind. Best practices for long-running tasks — agents that might run for an hour or more — are encoded into the SDK's defaults. When you build on the SDK, you inherit all of this for free.
Thariq is direct about one thing: the SDK is opinionated. Opinionated means it makes choices for you and steers you toward a particular way of doing things. This is the opposite of a neutral, do-whatever-you-want toolkit. The SDK has a strong point of view about how agents should be built, because Anthropic believes that point of view leads to better results. The biggest opinion it holds — one that Thariq returns to again and again — is that the command line, specifically the bash shell, is the most powerful tool you can give an agent.
II. The Harness — More Than Just a Model
One of the most important ideas in this workshop is that a capable AI agent is not just an AI model sitting alone. The model is the brain, but a brain without a body, a workspace, and a set of tools cannot accomplish much. The surrounding infrastructure — everything that supports and empowers the model — is what Thariq calls the Harness.
Imagine a new employee at a company. The employee (the AI model) might be brilliant, but on their first day they cannot do much without a desk, a computer, access to the company's files, a list of their responsibilities, and colleagues they can ask for help. The harness is all of that — the desk, the computer, the access, the instructions. Without the harness, the model sits in a room with a question and no way to investigate, act, or verify anything.
The harness has several components, and Thariq lists them explicitly. Tools come first — these are specific actions the AI is permitted to take, like reading a file, searching the web, or sending a message. Think of tools as buttons on a control panel: the AI knows the buttons exist, knows what each one does, and can press them when needed. Then there is the file system, which is simply the computer's storage — folders and files the agent can read, write to, and organize. Thariq makes the point that the file system is not just storage; it is a form of context engineering. Instead of trying to hold everything in the AI's working memory at once, you can offload information to files and have the agent read back what it needs when it needs it.
There are also core agent prompts — the standing instructions that tell the agent who it is and how it should behave. A special file called CLAUDE.md serves this function in Anthropic's system. Every time the agent starts or restarts, it reads CLAUDE.md first. You can think of it as the employee handbook: it tells the agent the rules of this particular job, what tools are available, what it should and should not do.
Beyond those fundamentals, the harness includes skills — reusable instruction packs the agent can load on demand when it encounters a specialized task. It includes sub-agents — mini agents the main agent can spawn to work on sub-problems in parallel. It includes web search, compacting (the process of summarizing older conversation to free up memory), hooks (automatic checkpoints that run verification checks), and memory (persistent storage of information across sessions). Each of these will be explained in its own chapter. The point for now is simply this: a powerful agent is a system, not just a model. The Claude Agent SDK packages that entire system so you can deploy it without building it piece by piece yourself.
III. Bash Is All You Need
If there is one idea in this workshop that Thariq is most passionate about, it is this one. Bash is the command-line shell — the text-based interface where you type commands directly to the operating system. It is the black window on a programmer's screen where you might see lines like ls, grep, or cd. If you have never used it, you may think of it as the "programmer's control panel" for the computer. It predates graphical interfaces, it is extremely powerful, and almost every computer in the world has it.
Thariq's argument is that giving an AI agent access to bash is fundamentally more powerful than giving it a list of custom-built tools. Here is why. When you give an agent a list of tools — say, a searchEmail tool, a calculateSum tool, a downloadFile tool — you are, as the developer, deciding in advance what the agent is allowed to do. Every new capability requires you to go write another tool. You are always one step behind the agent's needs. But when you give the agent bash access, you give it access to every piece of software already installed on the computer. The agent can use grep to search for a pattern in a thousand files. It can use ffmpeg — a program for processing video and audio — to extract clips. It can use jq to parse structured data. It can install new software if needed. It can compose multiple programs together using the pipe operator, which feeds the output of one command directly into the input of another. That composability is the key.
Thariq gives a concrete example to make this tangible. Suppose you have an email agent and a user asks: "How much did I spend on ride-sharing this month?" Without bash, the agent retrieves every email mentioning Uber or Lyft — perhaps a hundred emails — and then has to reason through all of them at once, trying to extract dollar amounts and add them up. That is cognitively expensive and error-prone, like handing a person a stack of a hundred papers and saying "read all of these and give me the total." With bash, the agent can use grep to extract just the lines in those emails containing dollar amounts, save them to a temporary file, use another command to add them up, and then spot-check a few results to confirm they make sense. The task that was vague and overwhelming becomes a precise, verifiable computation.
This approach works just as well for non-technical tasks. A video processing agent can use ffmpeg to slice an earnings call recording every time the word "revenue" is detected. A data agent can pipe API results through jq to extract just the fields it needs. An email agent can combine your inbox API with your contacts API using bash scripting. The underlying principle is that bash treats every existing piece of software as a potential building block. Rather than rebuilding the wheel every time you need a new capability, the agent reaches for whatever tool already exists on the computer and composes it with everything else.
Thariq also makes a practical point about memory. When the agent computes something — say, a list of ride-sharing expenses — it can write that result to a file on the disk. Later, if it wants to double-check its work or add more data, it does not have to redo the computation. It just reads the file back. The file system, combined with bash, becomes a flexible, persistent memory that lives outside the context window. This is a central design philosophy of the Claude Agent SDK: the file system is not just storage, it is an extension of the agent's thinking.
IV. Workflows vs. Agents
Thariq draws a line between two kinds of things you can build with the SDK: workflows and agents. Understanding this distinction will save you from misapplying the technology and from being confused when you see the word "agent" used to mean very different things.
An agent, as described in the introduction, is autonomous. It decides its own steps. You give it a goal in natural language — "help me analyze my business data and create a dashboard" — and it figures out how to get there. It might grep through your files, write some code, run the code, look at the output, notice something unexpected, ask you a clarifying question, and then continue. The path is not known in advance. The agent constructs it as it goes, responding to what it discovers along the way.
A workflow is more like a vending machine. You define the inputs clearly, you define the outputs clearly, and the steps in between are largely fixed. Thariq gives the example of a GitHub Actions workflow — a common automation tool used by software teams. You define: "When a pull request comes in, run these steps in this order and produce this output." It is structured, constrained, and predictable. Structured outputs — meaning responses formatted in a specific, machine-readable way — fit naturally here. If you want the AI to review a piece of code and always return a JSON object with fields like rating, summary, and suggested_changes, that is workflow thinking.
But here is the nuance that Thariq wants you to hold onto: the line between workflows and agents is blurry in practice. He gives the example of a bug-triage bot. On the surface it sounds like a simple workflow: a new issue comes in on GitHub, the bot reads it, labels it, and assigns it to the right team. Input, process, output — clean and structured. But actually doing the job well requires the bot to clone the source code repository, spin up a Docker container (a self-contained virtual environment where code can be run safely), run the code to reproduce the bug, check whether similar bugs have appeared before, and then make a judgment call about severity. There are a lot of free-flowing, open-ended steps in the middle, even though the final output is a structured label. The workflow-versus-agent distinction is less about the output format and more about how much freedom the AI has in deciding how to get there.
Thariq's practical guidance is to default toward agents when the problem is messy and underspecified, and to use workflows — with structured outputs — when the problem is well-defined and the steps are known. Most real-world business problems, he suggests, are messier than they appear, and a fully rigid workflow will eventually break when reality fails to match the plan. The SDK supports both patterns, and the same underlying infrastructure serves either approach.
V. Designing the Agent Loop
Every agent, regardless of what it does, operates in a loop. It does not receive one question and return one answer. It cycles through a process — gathering information, taking action, checking results — over and over until the job is done or it gets stuck. Thariq calls this the agent loop, and understanding its three phases is the most practical framework he offers for anyone who wants to build their own agent.
The first phase is gathering context. Context, in AI terms, means all the information the model currently has available to reason with. Before the agent can do anything useful, it needs to understand the situation. For a coding agent like Claude Code, gathering context means finding and reading the relevant source files. For an email agent, it means searching the inbox for relevant messages. For a customer support agent, it means pulling up the customer's account history. Thariq says this step gets underestimated more than any other — most developers rush past it to get to the exciting part (taking action), but how well the agent gathers its context almost entirely determines the quality of what it does next. A doctor who takes a full patient history gives better diagnoses than one who guesses. An agent that thoroughly gathers relevant context gives better results than one that jumps to action with incomplete information.
The second phase is taking action. This is where the agent actually does something: writes code, sends a message, modifies a file, calls an API, generates a report. The quality of this phase depends heavily on what tools and capabilities the agent has been given — which is why the choice between predefined tools, bash, and code generation (discussed in the next chapter) matters so much.
The third phase is verification. After taking an action, the agent should check whether it actually worked. This is the phase Thariq says most beginners skip entirely, and it is what separates a toy agent from a reliable one. For a coding agent, verification might mean running the code to see if it compiles, then running the automated tests to see if it passes them. Compilation is binary — the code either compiles or it does not — which makes it an extremely trustworthy verification step. For a research agent, verification might mean checking that every claim in a generated report is supported by a source citation. For a spreadsheet agent, it might mean checking that totals add up correctly before returning the result to the user.
Thariq makes an important observation about which tasks are good candidates for agents: the ones where verification is easy and reliable. Code is great because you can lint it (a linter is a program that automatically checks code for common errors and style violations), compile it, and run it. The verification steps are objective and cheap. Deep research is harder — you can require citations, but a citation does not prove the text accurately represents the source. The closer you can get to objective, automated verification, the more reliably your agent will perform over time.
An audience member asked about planning — when should the agent write out a plan before acting? Thariq places planning between gathering context and taking action: the agent collects information, formulates a plan, then executes it. Plans help the AI think step by step, which improves accuracy on complex tasks. But they add latency, meaning extra time before anything visible happens. The SDK supports planning, and the decision of whether to include a planning step comes down to whether the task is complex enough to benefit from it. Thariq also notes that the SDK ships with to-do list tools — the agent can maintain a checklist of sub-tasks and check them off as it goes, giving the user live progress updates in real time.
VI. Tools vs. Bash vs. Code Generation
When you build an agent, you have three fundamentally different ways to give it the ability to act in the world: predefined tools, bash commands, and code generation. Most developers only know about the first option. Thariq walks through all three, explaining when each one is the right choice.
Predefined tools are functions you define in advance as a developer. You write a function called searchDatabase, give it a description, and tell the AI it can call that function when it needs to search a database. The AI calls it, gets a result, and continues reasoning. Tools are the most structured and reliable of the three options — the AI always gets a predictable, well-defined result. They are ideal when you need fast, error-free responses and know exactly what actions your agent will ever need to take. The downside is that tools are expensive in terms of context. Every tool you define consumes some of the AI's working memory just by existing, because the AI needs to hold the entire list of available tools in its head. If you have fifty or a hundred tools, they crowd out space that could be used for actual reasoning. There is also a deeper problem: tools are not composable. You cannot combine a search tool and a calculation tool into a single fluid operation the way you can chain commands in bash.
Bash is the opposite end of the spectrum. As established in the previous chapter, bash gives the agent access to the entire software ecosystem of the machine it is running on. It is composable — you can chain commands together. It is memory-efficient — the AI does not need to hold a list of possible commands in its head, because it can run --help on any program to discover what it can do on the fly. This means there is a small latency cost: the agent spends a moment figuring out what tools are available rather than knowing in advance. But in exchange, the agent can handle situations its designer never anticipated. It can search, transform, combine, and verify data in ways that no fixed list of tools could support.
Code generation is the most powerful and most expensive option. Instead of calling a predefined function or running a one-line command, the agent writes an entire program — in Python, TypeScript, or whatever language fits — and then executes it. This enables extremely dynamic, flexible logic: looping over thousands of records, combining data from multiple APIs, performing statistical calculations, generating charts. The trade-off is time: writing and executing code takes longer than a tool call or a bash command. It may also require linting the generated code first to catch errors before running it, and in compiled languages it requires a compilation step. But when the task is genuinely complex — building a custom data pipeline, transforming a messy dataset, generating a formatted document — code generation is the only tool powerful enough.
Thariq's synthesis is practical and clear. Use predefined tools for atomic actions where you want maximum control and visibility. In Claude Code, the write-file operation is a predefined tool — not a bash command — specifically because Anthropic wants the user to see what is about to be written and approve it before it happens. Any action that is irreversible (you cannot undo it) or externally visible (it sends an email, posts a message, charges a credit card) should be a predefined tool, because that gives you a natural checkpoint where a human can review before proceeding. Use bash for composable, exploratory operations like searching folders, grepping through logs, managing temporary files, or testing whether something works. Use code generation for tasks that require genuine flexibility and intelligence: composing data from multiple APIs, doing analysis, or generating rich output like formatted documents or dashboards.
An audience member asked a smart follow-up: what about preventing context explosion when a tool call returns a huge amount of data? Thariq's answer was simple and practical: save the tool output to a file immediately, and have the tool return only the file path. This way the large result lives on the disk rather than in the context window. The agent can read specific parts of it later when needed, rather than holding all of it in working memory at once.
VII. Skills — Progressive Context Disclosure
Skills are one of the newest features of the Claude Agent SDK — released just two weeks before this workshop — and they represent a clever solution to a problem that only becomes visible when you start building more complex agents.
The problem is this: you cannot give the agent all the instructions it might ever need right at the start. If you tried, those instructions would consume enormous amounts of context window space, crowding out room for the actual work. But if you do not give the agent enough instructions, it will not know how to handle specialized tasks. How do you give the agent detailed expertise when it needs it, without loading it down with everything upfront?
The answer is progressive context disclosure — a phrase that simply means showing the agent more information progressively, as it becomes relevant. Skills are how the SDK implements this idea. A skill is essentially a folder of files sitting on the file system. Inside that folder are detailed, expert-level instructions for how to do something specific — how to produce a properly formatted DOCX file, how to do front-end web design, how to call a particular API correctly. When the agent encounters a task that requires that expertise, it reads the skill — it "opens the folder" and loads in those instructions — and then proceeds with full knowledge of how to do the job.
Thariq gives a vivid example of why this is valuable. Anthropic recently released a front-end design skill. It was written by one of Anthropic's best AI engineers who specializes in front-end design, and it encodes everything that engineer knows about producing good interfaces: layout principles, color choices, responsive design patterns, accessibility considerations, how to structure HTML and CSS for AI generation. That knowledge now lives in a skill file, and any agent that loads it instantly has access to the same expertise. Skills are a way of capturing human expertise in a retrievable, composable form.
An audience member asked a good structural question: what is the priority order when a skill file and the main CLAUDE.md file both contain instructions? Thariq's honest answer was that there is not yet a clear best practice — the technology is too new. His working mental model is that CLAUDE.md contains standing rules that always apply, while skills contain situational expertise that the agent loads when it needs it. The right analogy might be a company policy manual (CLAUDE.md, always in force) versus a specialist handbook that a team member opens when they take on a specific type of project (the skill, loaded on demand).
He also fielded a question about whether skills will eventually be absorbed into the model itself — meaning, will future AI models just know how to do all these things without needing separate instruction files? The answer is broadly yes: as models improve, they will handle more tasks without explicit guidance. But skills are the best current solution for tasks that fall outside what the model naturally handles well. Thariq offers a candid and important piece of advice here: revisit and rewrite your agent code every six months, because the underlying model will have changed enough that many of the workarounds and explicit instructions you baked in may no longer be needed, or may even interfere with the model's improved native capabilities. The best agents are lean ones that trust the model to do what it is good at and only add explicit instructions where genuine gaps exist.
VIII. Security and Permissions — The Swiss Cheese Defense
An audience member raises what is probably the most obvious concern about the whole bash-first philosophy: if the agent can run any command on the computer, what stops a malicious actor from hijacking it? This is a legitimate and serious question, and Thariq's answer introduces a framework called the Swiss cheese defense.
The Swiss cheese defense is a classic concept from safety engineering. Imagine a stack of slices of Swiss cheese, each one full of holes. No single slice is solid — each has vulnerabilities. But when you stack enough slices together, the holes rarely line up, and together they block almost everything from passing through. Security in a complex system works the same way: no single layer is perfect, but multiple imperfect layers together create effective protection.
For the Claude Agent SDK, the layers work as follows. At the model layer — inside the AI itself — Anthropic does extensive alignment work. Alignment is the process of training the AI to behave in ways that are safe and beneficial, refusing harmful requests and flagging suspicious instructions. The model has been taught to recognize when it is being manipulated into doing something dangerous. At the harness layer — the software infrastructure around the model — the SDK includes its own permissioning system and safety prompts. Critically, it runs a parser on every bash command the agent tries to execute. A parser is a program that reads and interprets code or commands to understand what they actually do. Because the SDK parses bash commands, it can detect suspicious patterns — like an attempt to delete system files or send data to an unknown server — before they happen. Thariq is emphatic that this parser is not something you want to try to build yourself; it is one of the SDK's most valuable safety features.
The third layer is sandboxing. A sandbox is a contained, isolated environment where code can run without being able to affect anything outside its walls. If a compromised agent runs inside a sandbox with restricted network access and no access to files outside its designated directory, the damage it can do is severely limited. Thariq describes what he calls the "lethal trifecta" of agent security risks: the ability to execute code, the ability to modify the file system, and the ability to exfiltrate — meaning to secretly copy and send away — that data to an outside server. If you sandbox the network connection so that the agent cannot make outbound requests to arbitrary servers, the third leg of that trifecta is cut off, and a successful attack becomes far less dangerous.
Thariq also addresses role-based access control — the practice of giving different users or services different levels of permission. His recommendation is to handle this at the infrastructure level, not inside the agent itself. You provision API keys (access credentials) that are scoped to only the operations needed. You can create proxy servers that intercept API requests and enforce policies before they reach the underlying service. And you can take advantage of the model's receptiveness to feedback: if your backend throws an error message — "permission denied to access this record" — the agent will read that error, understand it, and stop trying. The model's built-in ability to read and respond to error messages is itself a security mechanism that can be deliberately designed around.
IX. Live Demo — The Pokémon Agent
To make everything concrete, Thariq builds a live agent from scratch during the workshop. He chooses Pokémon as the domain deliberately — it has a large, publicly available API (a web interface for requesting data programmatically) with thousands of records and complex relationships, simulating the kind of messy real-world data sources his audience deals with in their own businesses. And because he is not himself an expert in competitive Pokémon, the agent has to genuinely figure things out rather than just confirming what Thariq already knows.
His first move is to let Claude Code — the AI coding assistant — build the library for him. He gives it a simple prompt: "Search the PokéAPI, understand how it works, and create a TypeScript library that wraps it." TypeScript is a programming language commonly used for web applications. A library, in programming, is a collection of pre-built functions that make a complex task easier — in this case, instead of the agent having to figure out the raw API every time, it can call a clean function like getByName("pikachu") and get a structured result back. Claude Code explores the API, reads its documentation, and generates a library with functions for fetching Pokémon by name, listing all Pokémon, getting species details, looking up abilities, and so on.
Then comes the critical piece: Claude Code also generates a CLAUDE.md file for this specific project. This CLAUDE.md tells the Pokémon agent the rules of its environment — specifically, that it should use this TypeScript library, that it should write executable scripts in an examples folder to answer queries rather than just answering from memory, and that it should run those scripts and use the actual data. That CLAUDE.md file, combined with the TypeScript library, is essentially the entire agent design. There is no complex orchestration code. There is just a clear instruction file and a clean set of tools. This is what Thariq means when he talks about context engineering: the agent's behavior is shaped largely by what is in its instruction file and what tools are available to it, not by elaborate logic in the surrounding code.
He then demonstrates the difference between the bash-and-code-generation approach and a traditional tools-only approach. For the tools-only version, he pre-defined five functions: getPokemon, getPokemonSpecies, getPokemonAbility, getPokemonType, and getMove. When a user asks "what are the Generation 2 water Pokémon?" the tools-only version struggles — that particular query requires iterating over hundreds of Pokémon and filtering by both generation and type, which is not something any of the five predefined tools was built for. The bash-and-codegen version, by contrast, writes a script, loops through the PokéAPI's list of Pokémon, checks each one's type and generation, and returns the filtered result. It handles a query that was never anticipated in the original design.
The most impressive part of the demo involves competitive Pokémon analysis using data from Smogon, a large online database of competitive Pokémon strategy. This data is not available through the PokéAPI — Thariq simply downloads a Smogon text file and puts it in the project's data folder. No special parsing, no structured database, just a raw text file. When asked "I want to build a team around Venusaur — give me suggestions based on the Smogon data," the agent begins grep-ing through the text file. It finds Venusaur's profile, then searches for other Pokémon that mention Venusaur as a synergy pick, runs a script to count how often different partners appear together, and assembles a full team recommendation. The agent never needed a Smogon API. It needed bash and a text file. That is the power of the approach: the agent figured out how to extract structured insight from unstructured data, using tools that have existed for decades.
X. The Spreadsheet Agent Exercise
Rather than just demonstrating his own agent, Thariq turns the workshop interactive by asking the audience to think through a spreadsheet agent together. A spreadsheet agent is a practical, relatable use case — most businesses run on spreadsheets — and it illustrates the three-phase agent loop (gather context, take action, verify) in a domain everyone understands.
The first question is: how does the agent search a spreadsheet? A spreadsheet is a two-dimensional structure — rows and columns — which makes it trickier than searching a flat text file. The audience brainstorms. Someone suggests converting the file to CSV format (a plain text format where values are separated by commas) and using grep. Good start — the agent can quickly scan for keywords. But Thariq points out the limitation: if a user asks "what is the total revenue in 2026?" the agent needs to find a revenue column and then filter rows by year. Column headers alone, discovered by grep, do not tell you which rows correspond to 2026.
Another suggestion: use AWK, an old and powerful Unix command for processing structured text files. AWK can match patterns in specific columns and perform arithmetic on the fly, making it considerably more capable than grep for tabular data. Then someone offers the most elegant solution: load the CSV into SQLite and query it with SQL. SQLite is a lightweight database engine that can be run entirely on a single file — no server required. SQL is the standard language for querying structured data, and AI models are exceptionally well-trained on SQL because there is so much of it on the internet. Thariq lights up at this suggestion: if you can translate your data source into a SQL interface, the agent can query it with great precision and handle almost any question a user might ask. That transformation step — converting a messy data source into something query-able — is one of the most powerful architectural moves you can make when designing an agentic data interface.
A fourth approach appears when Thariq reminds the audience that XLSX files — the standard Excel format — are actually ZIP archives containing XML files. XML is a structured text format, and you can run path-based queries against it directly, or use libraries that understand the Excel format natively. So you have at least four approaches for a spreadsheet agent: grep on headers, AWK for row-level filtering, a SQLite wrapper, or XML path queries. Thariq's advice: do not stop at the first approach that works. Try multiple strategies, run tests, watch the transcripts of your agent running, and see which one it handles best.
For the action phase — making changes to a spreadsheet — the interfaces look remarkably similar to the search interfaces. Inserting a row looks like writing a 2D array. Running a SQL UPDATE modifies records. Editing the underlying XML works for both reading and writing. The symmetry between reading and writing is a useful design principle: if you built a good search interface, you likely already have most of what you need for the action interface.
Verification for a spreadsheet agent requires specific attention. The agent should check for null values — empty cells where data is expected. It should look for anomalies, like a cell that changed unexpectedly or a total that does not add up. One powerful pattern Thariq describes is the adversarial sub-agent: after the main agent finishes its work, spin up a second agent with no knowledge of what the first agent did, show it the output, and ask it to critique the results. Because the second agent has no attachment to the work, it will notice inconsistencies and errors that the first agent might have rationalized away. As AI reasoning improves, this adversarial verification step becomes increasingly reliable.
An audience member raises the critical issue of reversibility — what if the agent deletes an entire spreadsheet? Thariq treats this as a fundamental design question, not just a technical one. How reversible is the domain you are building in? Code is highly reversible because every change is tracked by git, a version control system that records the complete history of every modification. Claude Code uses git constantly — Thariq mentions that he no longer types git commands himself; the agent handles them. Computer use agents that click around a graphical interface are nearly impossible to reverse, because every click compounds the state. A spreadsheet agent sits somewhere in between. His recommendation: design for reversibility from the start. Store state checkpoints so that if something goes wrong, the user can say "go back to before that change." He mentions that someone built a "time travel" tool for a coding agent that allows exactly this kind of rollback. The underlying principle — that agents should be able to undo their actions, at least for critical operations — applies to any domain where mistakes are possible, which is every domain.
XI. Sub-Agents and Context Management
An audience member asks a question that gets at one of the most important practical constraints of AI agents: "Do all agents share the same context window?" The answer leads Thariq into one of the most technically interesting parts of the workshop.
To restate what the context window is: it is the AI's working memory. Everything the agent is currently "thinking about" — its instructions, the conversation history, the results of tool calls, the files it has read, the code it has written — must fit within this window. Most current AI models have context windows measured in hundreds of thousands of tokens (a token is roughly a word or a few characters). That sounds large, but a long-running agent can burn through it quickly: read a hundred files, run a dozen scripts, review the output of each one, and you have consumed an enormous amount of memory. When the context window fills up, the agent loses its earlier thinking unless something is done to preserve the important parts.
Sub-agents are one of the most important tools for managing context. A sub-agent is simply a fresh AI instance that the main agent spawns to handle a specific sub-task. The sub-agent starts with a clean context window, does its work, and returns just the answer — a distilled summary of what it found — back to the main agent. The main agent never has to hold all the exploratory, in-progress thinking of the sub-task in its own memory. It just gets the result.
Thariq gives the spreadsheet example: if you have a workbook with three sheets and need to analyze all three, the main agent can spawn three sub-agents in parallel — one for each sheet. Each sub-agent reads its sheet in full, reasons over it, and returns a summary. The main agent then reasons over three summaries rather than three full sheets. The expensive exploratory work is distributed; the main agent stays lean. Running sub-agents in parallel means all three can work simultaneously, which also reduces total time.
This sounds straightforward, but Thariq points out that parallel sub-agents with bash access are actually one of the harder engineering problems in agent design. Race conditions — situations where two processes try to read or write the same file at the same moment and produce corrupted results — are a real risk. File system conflicts require careful coordination. The Claude Agent SDK handles this complexity internally, using architecture borrowed directly from Claude Code's multi-agent implementation. Thariq says he has not seen another framework handle parallel bash sub-agents as robustly.
Context management also has a direct user experience dimension. Thariq observes that when he talks to advanced Claude Code users, some of them mention being on their "fifth compact." Compaction, also called context compression, is the process of summarizing the older parts of the conversation to free up room for new work. The summary replaces the full history, preserving the key facts while discarding the detailed back-and-forth. Thariq himself says he compacts rarely — for coding work, the state lives in the files of the codebase, not in the chat history. If he finishes a task and starts a new one, Claude Code can look at the git diff (a summary of what changed in the code) and understand the current state without needing the full conversation history.
For agents deployed to non-technical users — a customer service agent, a spreadsheet assistant — the context management challenge is more complex, because users do not understand the concept of a context window and should not need to. The UX design challenge is: can you reset or compress the conversation state gracefully without the user noticing? In a spreadsheet agent, most of the state lives in the spreadsheet itself, not in the conversation history, so the agent probably does not need to carry much forward between turns. Storing user preferences to a file as they are expressed — so the agent remembers them in future sessions without needing the full transcript — is one practical approach. The fundamental goal is to minimize context usage without degrading the user's experience.
XII. Hooks — Deterministic Verification
An audience member asks about hooks, and Thariq says he has not talked about them enough. Hooks are one of the most underappreciated features in the SDK, and they solve a problem that becomes very apparent once your agent is running in the real world: the agent is probabilistic, meaning it does not always behave the same way, but sometimes you need certain things to happen with absolute certainty.
A hook is a piece of code that executes automatically at a specific moment in the agent's operation — before a tool call, after a tool call, when a response is generated. You register a hook handler in the SDK, and every time that event fires, your handler runs. Unlike the AI's own reasoning — which is flexible and sometimes unpredictable — hooks are deterministic. They always run. They always do exactly what you programmed them to do.
Thariq gives several examples that make the concept immediately concrete. Suppose you have a spreadsheet agent and you want to run a validation check after every change the agent makes — confirming that totals still add up, that no required cells are empty, that values are within expected ranges. You register a hook that fires after every tool call and runs that validation. The agent cannot skip it. It runs every time, no matter what the agent just did. Another example: the user is editing a spreadsheet in real time while the agent is also working on it. A hook can detect that the file has changed externally, pause the agent, and inject a message into the context saying "the user has modified column C — please re-read the current state before proceeding." The agent gets live updates about its environment without needing to poll for them constantly.
Hooks also address one of the specific failure modes Thariq observed during his Pokémon demo. The Pokémon domain is what engineers call "in-distribution" for the model — meaning the model was trained on a lot of Pokémon data and has many facts memorized. Sometimes the agent would simply answer from memory instead of running a script to check the actual data. This is a problem because the agent might be answering based on stale training data, not the up-to-date Smogon file on disk. A hook can intercept the response before it reaches the user, check whether the response is based on an actual script execution, and if not, inject feedback: "Please make sure you wrote a script and read from the actual data file." The agent gets that feedback, revises its approach, and tries again. This is deterministic behavior imposed on top of probabilistic reasoning — and it is a much more reliable solution than trying to prevent the behavior through prompt engineering alone.
Thariq points to Claude Code's own use of hooks as an illustration. Claude Code fires a hook whenever the agent tries to write to a file it has not yet read. The hook throws an error: "You have not read this file yet. Try reading it first." This prevents a common and costly mistake — the agent overwriting a file based on assumptions rather than actual knowledge of its current contents. That one deterministic rule saves countless hours of debugging incorrect overwrites. Hooks, in this sense, are the mechanism by which you can take the best practices you discover through observation and hard-code them into the system so they are always enforced, regardless of how the model is feeling in any given session.
XIII. Q&A — Scaling, Reproducibility, and the Real World
The final part of the workshop opens up into a Q&A that covers the practical questions any developer or business person will face when they move from the workshop to actually shipping something. The questions reveal the gap between the theoretical elegance of the agent architecture and the messy reality of production systems.
From Prototype to Production
The first question is the most fundamental: you have built a prototype that works. Now how do you make it reproducible and reliable? Thariq's answer starts with CLAUDE.md. The instruction file is the single most important place to encode what you learned during prototyping. During his own Pokémon demo, the agent ignored his pre-built TypeScript library on one run and wrote plain JavaScript instead. That happened because CLAUDE.md did not explicitly say "you must use this library." The lesson: anything the agent did wrong in your prototype that was caused by ambiguous instructions should be fixed by adding a clear, explicit rule to CLAUDE.md. The instruction file should grow as you discover edge cases, not just serve as a vague introduction.
The second part of the answer is helper scripts. When you find an approach that works — say, loading the CSV into SQLite and querying it — write a script that does that setup, and note in CLAUDE.md that the script exists and when to use it. Good helper scripts serve two purposes: they give the agent a fast, reliable path to the right approach, and they document what you learned during prototyping so future agents (and future versions of yourself) can benefit from it.
When the Agent Cheats
An audience member describes a maddening experience: their Pokémon agent tries to run a script, fails a couple of times, and then simply returns a table of Pokémon statistics from memory. The agent stopped trying to use the actual data and started making things up from its training data. Thariq acknowledges this as a genuine challenge. The model's training data is its background knowledge, and for popular domains like Pokémon — or common business topics — that background knowledge is often close enough to correct that the model takes the shortcut. This is called being "in-distribution": the task is so similar to what the model has seen before that it does not feel the need to go look things up.
The solution Thariq recommends is hooks, as discussed in the previous chapter. You write a hook that intercepts every response before it reaches the user and checks for evidence that the agent actually ran a script. If no script was run, the hook injects feedback: "You must write a script and use the actual data. Do not answer from memory." This is a deterministic guardrail that makes cheating impossible rather than just unlikely. No amount of prompting alone can prevent this behavior reliably — the model will sometimes ignore instructions when its confidence in its own knowledge is high. A hook cannot be ignored.
Massive Codebases
A developer raises a sobering practical problem: they are working with a codebase of over fifty million lines of code. Standard tools like grep become painfully slow at that scale. They are building their own semantic indexing system — a search system that finds code based on meaning rather than just exact text matches — and they want to know if Anthropic is planning to make something like that native to the SDK.
Thariq gives a candid, two-part answer. First, the honest part: yes, anything you build for this problem today may well be made obsolete by improvements in the tools over the next six to twelve months. The field is moving fast enough that custom infrastructure built to compensate for current limitations often becomes unnecessary. Second, the practical part: Anthropic does not use semantic indexing for Claude Code itself. Instead, they rely on well-structured CLAUDE.md files that tell the agent where things live in the codebase — which directories contain which kinds of code, what the naming conventions are — so the agent can navigate efficiently without needing to search everything. And they typically start the agent in a specific subdirectory relevant to the task, rather than dropping it at the root of a fifty-million-line repository and expecting it to find its way. Good context engineering — telling the agent where to look — often outperforms sophisticated search infrastructure. That said, Thariq is careful not to claim this fully solves the problem. For truly massive codebases, there is still open territory.
Monetization
The final question is about money. Agents are expensive to run — they make many API calls, they run for a long time, and the cost per task can be significant. With all the cost flowing back to Anthropic, how are developers supposed to build a viable business?
Thariq's answer is the classic software business advice applied to a new context: find problems that are genuinely hard and genuinely valuable, and charge accordingly. The economics of agents work when the problem they solve is expensive enough that the per-task cost is a bargain. If an agent saves a lawyer four hours of document review, charging twenty dollars for that task is not just viable — it is a steal. The failures come from trying to charge tiny amounts for tasks that are not valuable enough to justify the cost. On pricing structure, Thariq notes that Claude Code uses a hybrid model: a flat monthly subscription for standard usage, with usage-based pricing for heavy users who exceed the limit. This balances predictability (users know roughly what they will pay) with fairness (heavy users pay more). He advises thinking about pricing structure early, because it is much harder to change once users have come to expect a particular model.
The workshop ends with applause. Thariq's central message — that the right architecture for AI agents is the Unix philosophy of composable, file-first, bash-powered tools — may feel counterintuitive at first. It seems too simple, too old-fashioned, too reliant on technology that predates AI by decades. But that is precisely the point. The most durable tools are the ones that have survived the longest. The file system and the command line have survived because they are genuinely general-purpose. Giving an AI access to them does not limit what the AI can do — it amplifies it.