Claude Code: Anthropic's CLI Agent
Table of Contents
- Introduction
- I. Origins of Claude Code
- II. Anthropic's Product Philosophy
- III. What Should Go into Claude Code?
- IV. Claude.md and Memory Simplification
- V. Claude Code vs Aider
- VI. Parallel Workflows and Unix Utility Philosophy
- VII. Cost Considerations and Pricing Model
- VIII. Key Features Shipped Since Launch
- IX. Claude Code Writes 80% of Claude Code
- X. Custom Slash Commands and MCP Integration
- XI. Terminal UX and Technical Stack
- XII. Code Review and Semantic Linting
- XIII. Non-Interactive Mode and Automation
- XIV. Engineering Productivity Metrics
- XV. Balancing Feature Creation and Maintenance
- XVI. Memory and the Future of Context
- XVII. Sandboxing, Branching, and Agent Planning
- XVIII. Future Roadmap
- XIX. Why Anthropic Excels at Developer Tools
Introduction
ALESSIO: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my co-host Swyx, founder of Small AI.
SWYX: Hey, and today we're in the studio with Cat Woo and Boris Cherny. Welcome.
BORIS: Thanks for having us.
CAT: Thank you.
SWYX: Cat, you and I know each other from before. Dagster as well and then Index Ventures and now Anthropic.
CAT: Exactly. It's so cool to see like a friend that you know from before now working at Anthropic and shipping really cool stuff.
SWYX: And Boris, you were a celebrity because we were just having you outside just getting coffee and people recognized you from your video.
BORIS: Oh wow. That's new. I definitely had that experience like once or twice in the last few weeks. It was surprising.
SWYX: Well, thank you for making the time. You're here to talk — we're here to talk about Claude Code. Most people probably have heard of it. We think quite a few people have tried it, but let's get a crisp upfront definition — what is Claude Code?
BORIS: So Claude Code is Claude in the terminal. Claude has a bunch of different interfaces. There's desktop, there's web, and yeah, Claude Code runs in your terminal. Because it runs in the terminal, it has access to a bunch of stuff that you just don't get if you're running on the web or on desktop or whatever. So it can run bash commands, it can see all of the files in the current directory, and it does all that agentically. I guess it maybe comes back to the question under the question — where did this idea come from? Part of it was we just want to learn how people use agents. We're doing this with the CLI form factor because coding is kind of a natural place where people use agents today, and there's kind of product market fit for this thing. But yeah, it's just sort of this crazy research project. Obviously it's kind of bare bones and simple, but yeah, it's an agent in your terminal.
SWYX: That's how the best stuff starts.
I. Origins of Claude Code
ALESSIO: How did it start? Did you have a master plan to build Claude Code?
BORIS: There's no master plan. When I joined Anthropic, I was experimenting with different ways to use the model kind of in different places. And the way I was doing that was through the public API, the same API that everyone else has access to. And one of the really weird experiments was this Claude that runs in a terminal. And I was using it for kind of weird stuff — I was using it to look at what music I was listening to and react to that, and then screenshot my video player and explain what's happening there and things like this. And this was a pretty quick thing to build and it was pretty fun to play around with.
And then at some point I gave it access to the terminal and the ability to code, and suddenly it just felt very useful — I was using this thing every day. It kind of expanded from there. We gave the core team access and they all started using it every day, which was pretty surprising. And then we gave all the engineers and researchers at Anthropic access and pretty soon everyone was using it every day. And I remember we had this DAU chart for internal users and I was just watching it and it was vertical — like for days — and we're like, "All right, there's something here. We got to give this to external people so everyone else can try this too."
ALESSIO: And were you also working with Boris already, or did this come out and then it started growing and then you're like, "Okay, we need to maybe make this a team so to speak?"
CAT: Yeah, the original team was Boris, Sid, and Ben. And over time, as more people were adopting the tool, we felt like, okay, we really have to invest in supporting it because all our researchers are using it and this is like our one lever to make them really productive. And so at that point I was using Claude Code to build some visualizations. I was analyzing a bunch of data and sometimes it's super useful to spin up a Streamlit and see all the aggregate stats at once. Claude Code made it really really easy to do. So I think I sent Boris a bunch of feedback and at some point Boris was like, "Do you want to just work on this?"
BORIS: It was actually more than that on my side. You were sending all this feedback and at the same time we were looking for a PM and we were looking at a few people. And then I remember telling the manager, "Hey, I want Cat."
SWYX: I'm sure people are curious — what's the process within Anthropic to graduate one of these projects? So you have kind of a lot of growth, then you get a PM. When did you decide, okay, it's ready to be opened up?
II. Anthropic's Product Philosophy
CAT: Generally at Anthropic we have this product principle of "do the simple thing first" and I think the way we build product is really based on that principle. So you kind of staff things as little as you can and keep things as scrappy as you can because the constraints are actually pretty helpful. And for this case, we wanted to see some signs of product market fit before we scaled it.
SWYX: I imagine MCP also now has a team around it in much the same way. It is now very much officially an Anthropic product. So I'm curious, Cat — how do you view PM-ing something like this?
CAT: I think I am with a pretty light touch. I think Boris and the team are extremely strong product thinkers and for the vast majority of the features on our road map, it's actually just people building the thing that they wish the product had. So very little actually is top-down. I feel like I'm mainly there to clear the path if anything gets in the way and just make sure that we're all good to go from a legal, marketing, etc. perspective.
And then I think in terms of very broad road map or long-term road map, the whole team comes together and just thinks about, "Okay, what do we think models will be really good at in 3 months, and let's just make sure that what we're building is really compatible with the future of what models are capable of."
SWYX: I'd be interested to double click on this. What will models be good at in 3 months? Because I think that's something that people always say to think about when building AI products, but nobody knows how to think about it. Everyone's just like, "It's generically getting better all the time. We're getting AGI soon, so don't bother." How do you calibrate 3 months of progress?
CAT: I think if you look back historically, we tend to ship models every couple of months or so. So 3 months is just an arbitrary number that I picked. I think the direction that we want our models to go in is being able to accomplish more and more complex tasks with as much autonomy as possible. And so this includes things like making sure that the models are able to explore and find the right information that they need to accomplish a task. Making sure that models are thorough in accomplishing every aspect of a task. Making sure the models can compose different tools together effectively. These are directions we care about.
BORIS: Coming back to Code, this kind of approach affected the way that we built Code also, because we know that if we want some product that has very broad product market fit today, we would build a Cursor or a Windsurf or something like this. These are awesome products that so many people use every day. I use them. That's not the product that we want to build. We want to build something that's kind of much earlier on that curve and something that will maybe be a big product a year from now or however much time from now as the model improves. And that's why Code runs in a terminal. It's a lot more bare bones. You have raw access to the model because we didn't spend time building all this kind of nice UI and scaffolding on top of it.
III. What Should Go into Claude Code?
SWYX: When it comes to the harness so to speak and things you want to put around it, there's the maybe prompt optimization. I use Cursor every day. There's a lot going on in Cursor that is beyond my prompt for optimization and whatnot, but I know you recently released compacting context features and all that. How do you decide how thick it needs to be on top of the CLI? And at what point are you deciding between, "Okay, this should be a part of Claude Code versus this is just something for the IDE people to figure out?"
BORIS: There's kind of three layers at which we can build something. Being an AI company, the most natural way to build anything is to just build it into the model and have the model do the behavior. The next layer is probably scaffolding on top — Claude Code itself. And then the layer after that is using Claude Code as a tool in a broader workflow. So for example, a lot of people use Code with tmux to manage a bunch of windows and sessions happening in parallel. We don't need to build all of that in.
Compact — it's sort of this thing that has to live in the middle because it's something that we want to work when you use Code. You shouldn't have to pull in extra tools on top of it. And rewriting memory in this way isn't something the model can do today. So you have to use a tool for it. We tried a bunch of different options for compacting — rewriting old tool calls, truncating old messages and not new messages. And then in the end we actually just did the simplest thing: ask Claude to summarize the previous messages and just return that. And that's it. It's funny — when the model is so good, the simple thing usually works. You don't have to over-engineer it.
IV. Claude.md and Memory Simplification
SWYX: And then you have the CLAUDE.md file for the more user-driven memories so to speak — kind of like the equivalent of maybe Cursor rules.
BORIS: CLAUDE.md is another example of this idea of "do the simple thing first." We had all these crazy ideas about memory architectures, and there's so much literature about this. There's so many different external products about this, and we wanted to be inspired by all the stuff. But in the end the thing we did is ship the simplest thing — it's a file that has some stuff and it's auto-read into context. And there's now a few versions of this file. You can put it in the root or you can put it in child directories or you can put it in your home directory, and we'll read all of these in kind of different ways. But yeah, simplest thing that could work.
V. Claude Code vs Aider
SWYX: I'm sure you're familiar with Aider which is another thing that people in our Discord loved. And then when Claude Code came out, the same people love Claude Code. Any thoughts on inspiration that you took from it, things you did differently, design principles in which you went a different way?
BORIS: This is actually the moment I got AGI-pilled, and it's related to this. So Aider inspired this internal tool that we used to have at Anthropic called Clyde. So Clyde is like CLI Claude. And that's the predecessor to Claude Code. It's kind of this research tool that's written using Python. It takes like a minute to start up. It's very much written by researchers — it's not a polished product.
And when I first joined Anthropic, I was putting up my first pull request. I hand-wrote this pull request because I didn't know any better. And my boot camp buddy at the time, Adam Wolf, was like, "You know, actually, maybe instead of handwriting it, just ask Clyde to write it." And I was like, "Okay, I guess so. It's an AI lab. Maybe there's some capability I didn't know about." And so I start up this terminal tool. It took like a minute to start up. And I asked Claude, "Hey, here's the description. Can you make a PR for me?" And after a few minutes of chugging along, it made a PR and it worked.
And I was just blown away because I had no idea. I had just no clue that there were tools that could do this kind of thing. I thought that single-line autocomplete was the state of the art before I joined. And then that's the moment where I got AGI-pilled, and that's where Code came from. So yeah, Aider inspired Clyde which inspired Claude Code. Very much big fan of Aider. It's an awesome product.
VI. Parallel Workflows and Unix Utility Philosophy
SWYX: I think people are interested in comparing and contrasting, obviously. People are interested in figuring out how to choose between tools — there's the Cursors of the world, there's the Devins of the world, there's Aiders and there's Claude Code. Where do you place it in the universe of options?
BORIS: We use all these tools in house too. We're big fans of all this stuff. Claude Code is obviously a little different than some of these other tools in that it's a lot more raw. Like I said, there isn't this kind of big beautiful UI on top of it. It's raw access to the model — as raw as it gets. So if you want to use a power tool that lets you access the model directly and use Claude for automating big workloads — for example, if you have a thousand lint violations and you want to start a thousand instances of Claude and have it fix each one and then make a PR — then Claude Code is a pretty good tool.
ALESSIO: It's a tool for power workloads, for power users. The idea of parallel versus single path — the IDE is really focused on what you want to do, versus Claude Code where you kind of more see it as less supervision required. You can spin up a lot of them. Is that the right mental model?
BORIS: Yeah. And there's some people at Anthropic that have been racking up thousands of dollars a day with this kind of automation. Most people don't do anything like that, but you totally could. We think of it as a Unix utility. The same way that you would compose grep or cat or other tools, the same way you can compose Code into workflows.
VII. Cost Considerations and Pricing Model
ALESSIO: The cost thing is interesting. Do people pay internally or do you get it for free?
BORIS: If you work at Anthropic, you can just run this thing as much as you want every day. It's for free internally.
SWYX: I think if everybody had it for free, it would be huge. Because if I think about it, I pay Cursor 20 bucks a month. I use millions and millions of tokens in Cursor that would cost me a lot more in Claude Code. And so a lot of people that I've talked to don't actually understand how much it costs to do these things. They'll do a task and they're like, "Oh, that cost 20 cents. I can't believe I paid that much." How do you think about that, going back to the product side? How much do you think of that being your responsibility to try and make it more efficient versus that's not really what you're trying to do with the tool?
CAT: We really see Claude Code as the tool that gives you the smartest abilities out of the model. We do care about cost insofar as it's very correlated with latency and we want to make sure that this tool is extremely snappy to use and extremely thorough in its work. We want to be very intentional about all the tokens that it produces. I think we can do more to communicate the costs with users. Currently we're seeing costs around $6 per day per active user. So it does come out to a bit higher over the course of a month than Cursor, but I don't think it's out of band, and that's roughly how we're thinking about it.
BORIS: I would add that the way I think about it is it's an ROI question, not a cost question. If you think about an average engineer salary — engineers are very expensive. And if you can make an engineer 50–70% more productive, that's worth a lot. I think that's the way to think about it.
SWYX: So if you're targeting Claude to be the most powerful end of the spectrum as opposed to the less powerful but faster cheaper side, then there's typically a waterfall — you try the faster simple one, that doesn't work, you upgrade and upgrade and finally you hit Claude Code. At least for people who are token-constrained and don't work at Anthropic.
VIII. Key Features Shipped Since Launch
SWYX: How about we recap the brief history of Claude Code between when you launched and now? There have been quite a few ships. How would you highlight the major ones?
CAT: I think a big one that we've gotten a lot of requests for is web fetch. We worked really closely with our legal team to make sure that we shipped as secure of an implementation as possible. So we'll web fetch if a user directly provides a URL — whether that's in their CLAUDE.md or in their message directly, or if a URL is mentioned in one of the previously fetched URLs. And so this way enterprises can feel pretty secure about letting their developers continue to use it.
We shipped a bunch of auto features. Like autocomplete — where you can press tab to complete a file name or file path. Auto compact — so that users feel like they have infinite context since we'll compact behind the scenes. And we also shipped auto accept, because we noticed that a lot of users were like, "Hey, Claude Code can figure it out. I've developed a lot of trust for Claude Code. I want it to just autonomously edit my files, run tests, and then come back to me later." So those are some of the big ones. Vim mode, custom slash commands. People love Vim mode — that was a top request. Memory — the hashtag to remember. Yeah.
IX. Claude Code Writes 80% of Claude Code
SWYX: On the technical side, Paul from Aider always says how much of it was coded by Aider. So the question is how much of it was coded by Claude Code?
BORIS: Pretty high. Probably near 80%, I'd say.
SWYX: That's very high.
BORIS: A lot of human code review though. A lot of human code review. Some of the stuff has to be handwritten and some of the code can be written by Claude. And there's sort of a wisdom in knowing which one to pick and what percent for each kind of task. Usually where we start is Claude writes the code and then if it's not good, then maybe a human will dive in. There's also some stuff where I actually prefer to do it by hand — like intricate data model refactoring or something. I won't leave it to Claude because I have really strong opinions and it's easier to just do it and experiment than it is to explain it to Claude. So yeah, I think that nets out to maybe 80–90% Claude-written code overall.
X. Custom Slash Commands and MCP Integration
ALESSIO: The custom slash command — I had a question. How do you think about custom slash commands, MCPs, how does this all tie together? Is the slash command in Claude Code kind of like an extension of the MCP? Are people building things that should not be MCP but are just kind of self-contained? How should people think about it?
BORIS: Obviously we're big fans of MCP. You can use MCP to do a lot of different things — custom tools, custom commands, all this stuff. But at the same time you shouldn't have to use it. If you just want something really simple and local — you just want some essentially like a prompt that's been saved — just use local commands for that.
Over time, something that we've been thinking a lot about is how to re-expose things in convenient ways. So for example, let's say you had this local command. Could you re-expose that as an MCP prompt? Because Claude Code is an MCP client and an MCP server. Or let's say you pass in a custom bash tool. Is there a way to re-expose that as an MCP tool? We think generally you shouldn't have to be tied to a particular technology. You should use whatever works for you.
ALESSIO: Because there's like Puppeteer — that's a great thing to use with Claude Code for testing. There's a Puppeteer MCP protocol, but then people can also write their own slash commands. I'm curious where MCPs are going to end up — where it's like maybe each slash command leverages MCPs, but no command itself is an MCP because it ends up being customized.
BORIS: I think for something like Puppeteer, that probably belongs in MCP because there's a few tool calls that go into that. So it's probably nice to encapsulate that in the MCP server. Whereas slash commands are actually just prompts — they're not actually tools. We're thinking about how to expose more customizability options so that people can bring their own tools or turn off some of the tools that Claude Code comes with. But there's also some trickiness because we want to make sure the tools people bring are things that Claude is able to understand and that people don't accidentally inhibit their experience by bringing a tool that is confusing to Claude.
I'll give an example of how this stuff connects for Claude Code. Internally in the GitHub repo, we have this GitHub action that runs. The GitHub action invokes Claude Code with a local slash command. The slash command is "lint." So it just runs a linter using Claude. It checks for a bunch of things that are pretty tricky to do with a traditional linter based on static analysis — for example, it'll check for spelling mistakes, but also checks that code matches comments. It also checks that we use a particular library for network fetches instead of the built-in library. There's a bunch of specific things that are pretty difficult to express just with lint.
In theory, you can go and write a bunch of lint rules for this. Some of it you could cover, some of it you probably couldn't. But honestly, it's much easier to just write one bullet in markdown in a local command and commit that. So what we do is Claude runs through the GitHub action. We invoke it with /project:lint. It'll run the linter, identify any mistakes, make the code changes, and then use the GitHub MCP server to commit the changes back to the PR. And so you can compose these tools together. That's a lot of the way we think about Code — just one tool in an ecosystem that composes nicely without being opinionated about any particular piece.
XI. Terminal UX and Technical Stack
SWYX: There's a decompilation of Claude Code out there. It seems like you use Commander.js and React Ink. At some point you're not even building Claude Code — you're kind of just building a general-purpose CLI framework. Do you ever think about this level of configurability being more of a CLI framework or some new form factor?
BORIS: It's definitely been fun to hack on a really awesome CLI because there's not that many of them. We're big fans of Ink. Vadim Demedes — we actually used React Ink for a lot of our projects.
Ink is amazing. It's sort of hacky and janky in a lot of ways — you have React and then the renderer is just translating the React code to ANSI escape codes. And there's all sorts of stuff that just doesn't work at all because ANSI escape codes are like this thing that started to be written in the 1970s and there's no really great spec about it. Every terminal is a little different. So building in this way feels a little bit like building for the browser back in the day, where you had to think about Internet Explorer 6 versus Opera versus Firefox. You have to think about these cross-terminal differences a lot. But yeah, big fans of Ink because it helps abstract over that.
We also use Bun. We don't use it in the runtime yet — we use Bun to compile the code together. It makes writing our tests and running tests much faster.
XII. Code Review and Semantic Linting
SWYX: On the review side — the linter part that you mentioned, I think maybe people skipped over it. Going from rule-based linting to semantic linting I think is great and super important. And I think a lot of companies are trying to figure out how to do autonomous PR review, which I've not seen one that I use so far. I'm curious how you think about closing the loop or making that better — especially like, what are you supposed to review? Because these PRs get pretty big when you vibe code.
BORIS: We have some experiments where Claude is doing code review internally. We're not super happy with the results yet. So it's not something that we want to open up quite yet. The way we're thinking about it is Claude Code is, like I said before, a primitive. So if you want to use it to build a code review tool, you can do this. If you want to build a security scanning or vulnerability scanning tool, you can do that. If you want to build a semantic linter, you can do that. And hopefully with Code it makes it so that if you want to do this, it's just a few lines of code — and you can just have Claude write that code also, because Claude is really great at writing GitHub actions.
SWYX: Sometimes you let the model decide, sometimes you're like, "This is a destructive action, always ask me." I'm curious if you have any internal heuristics around when to auto-accept and where all this is going.
BORIS: We're spending a lot of time building out the permission system. Robert on our team is leading this work. We think it's really important to give developers the control to say, "Hey, these are the allowed permissions." Generally, this includes stuff like — the model is always allowed to read files or read anything. And then it's up to the user to say, "Hey, it's about to edit files, about to run tests." These are probably the safest three actions. And then there's a long list of other actions that users can either allow-list or deny-list based on regex matches with the action.
SWYX: Can writing a file ever be unsafe if you have version control?
BORIS: I think there's a few different aspects of safety to think about. For file editing, it's actually less about safety — although there is still a safety risk. What might happen is let's say the model fetches a URL and then there's a prompt injection attack in the URL, and then the model writes malicious code to disk and you don't realize it. Although there is code review as a separate layer of protection.
But generally, for file writes, the biggest thing is the model might just do the wrong thing. What we find is that if the model is doing something wrong, it's better to identify that earlier and correct it earlier. If you wait for the model to just go down this totally wrong path and then correct it 10 minutes later, you're going to have a bad time. So it's better to identify failures early.
But at the same time, there's some cases where you just want to let the model go. For example, if Claude Code is writing tests for me, I'll just hit shift-tab, enter auto-accept mode, and just let it run the tests and iterate on the tests until they pass. Because I know that's a pretty safe thing to do. And then for some other tools like the bash tool, it's pretty different — because Claude could run rm -rf / and that would suck. So we definitely want people in the loop. The model is trained and aligned not to do that, but these are non-deterministic systems. You still want a human in the loop.
CAT: I think generally the way that things are trending is less time between human input.
XIII. Non-Interactive Mode and Automation
CAT: One thing to mention is we do have a non-interactive mode — which is how we use Claude in these situations to automate Claude Code. A lot of the companies using Claude Code actually use this non-interactive mode. They'll for example say, "Hey, I have hundreds of thousands of tests in my repo. Some of them are out of date, some of them are flaky," and they'll send Claude Code to look at each of these tests and decide, "How can I update any of them? Should I deprecate some of them? How do I increase our code coverage?" So that's been a really cool way that people are non-interactively using Claude Code.
SWYX: What are the best practices here? Because when it's non-interactive, it could run forever and you're not necessarily reviewing the output of everything.
BORIS: For folks that haven't used it — non-interactive mode is just claude -p and then you pass in the prompt in quotes. That's all it is — it's just the -p flag. Generally, it's best for tasks that are read-only. That's the place where it works really well and you don't have to think about permissions and running forever and things like that.
For example, a linter that runs and doesn't fix any issues. Or for example, we're working on a thing where we use Claude with -p to generate the changelog for Claude. So every PR is just looking over the commit history and being like, "Okay, this makes it into the changelog, this doesn't."
For tasks where you want to write things, we usually recommend passing in a very specific set of permissions on the command line. So you can pass in --allowed-tools and then allow a specific tool — for example, not just bash, but for example git status or git diff. You give it a set of tools that it can use, or the edit tool. And --allowed-tools lets you, instead of the permission prompt (because you don't have that in non-interactive mode), pre-accept tool uses.
We'd also definitely recommend that you start small — test it on one test, make sure that it has reasonable behavior, iterate on your prompt, then scale it up to 10, make sure that it succeeds, or if it fails, analyze what the patterns of failures are, and gradually scale up from there. So definitely don't kick off a run to fix 100,000 tests.
XIV. Engineering Productivity Metrics
BORIS: Claude Code also makes a lot of quality work become a lot easier. For example, I have not manually written a unit test in many months. And we have a lot of unit tests. It's because Claude writes all the tests. Before, I felt like a jerk if on someone's PR I'm like, "Hey, can you write a test?" because they kind of know they should probably write a test. And somewhere in their head they made that trade-off where they just want to ship faster. And so you always feel like a jerk for asking. But now I always ask because Claude can just write the test. There's no human work. You just ask Claude to do it. And with writing tests becoming easier and with writing lint rules becoming easier, it's actually much easier to have high-quality code than it was before.
ALESSIO: What are the metrics that you believe in? A lot of people don't believe in 100% code coverage because sometimes that is optimizing for the wrong thing. What still makes sense?
BORIS: I think it's very engineering team dependent. I wish there was a one-size-fits-all answer. For some teams, test coverage is extremely important. For other teams, type coverage is very important — especially if you're working in a very strictly typed language and avoiding nulls in JavaScript and Python. Complexity gets a lot of flack, but it's still honestly a pretty good metric just because there isn't anything better in terms of ways to measure code quality.
ALESSIO: And productivity — obviously not lines of code, but do you care about measuring productivity?
BORIS: Lines of code honestly isn't terrible. It has downsides, but it's really hard to make anything better. It's the least terrible. The two that we're really trying to nail down are: one, decrease in cycle time — how much faster are your features shipping because you're using these tools. So that might be something like the time between first commit and when your PR is merged. It's very tricky to get right but one of the ones that we're targeting.
The other one that we want to measure more rigorously is the number of features that you wouldn't have otherwise built. We have a lot of channels where we get customer feedback, and one of the patterns that we've seen with Claude Code is that sometimes customer support or customer success will post, "Hey, this app has this bug," and then sometimes 10 minutes later one of the engineers on that team will be like, "Claude Code made a fix for it." And a lot of those situations — when you ping them and you're like, "Hey, that was really cool" — they were like, "Yeah, without Claude Code I probably wouldn't have done that because it would have been too much of a divergence from what I was otherwise going to do. It would have just ended up in this long backlog."
That was the other AGI-pilled moment for me. There was a really early version of Claude Code many, many months ago. And this one engineer at Anthropic, Jeremy, built a bot that looked through a particular feedback channel on Slack and hooked it up to Code to have Code automatically put up PRs with fixes to all the stuff. And some of the stuff — it couldn't fix every issue but it fixed a lot of the issues. And I was like — this was early on so I don't remember the exact number but it was surprisingly high, to the point where I became a believer in this kind of workflow. And I wasn't before.
XV. Balancing Feature Creation and Maintenance
SWYX: So PM isn't that scary too in a way where you can build too many things? It's almost like maybe you shouldn't build that many things. I think that's what I'm struggling with the most — it gives you the ability to create, create, create, but at some point you got to support. This is the Jurassic Park: "Your scientists were so preoccupied with whether they could..."
SWYX: But how do you make decisions now that the cost of actually implementing the thing is going down? As a PM, how do you decide what is actually worth doing?
CAT: We definitely still hold a very high bar for net new features. Most of the fixes were like, "Hey, this functionality is broken," or "There's a weird edge case that we hadn't addressed yet." So it was very much smoothing out the rough edges as opposed to building something completely net new. For net new features, we hold a pretty high bar that it's very intuitive to use. The new user experience is minimal. It's just obvious that it works.
We sometimes actually use Claude Code to prototype instead of using docs. So you'll have prototypes that you can play around with and that often gives us a faster feel for, "Hey, is this feature ready yet? Is this the right abstraction? Is this the right interaction pattern?" So it gets us faster to feeling really confident about a feature, but it doesn't circumvent the process of making sure that the feature definitely fits in the product vision.
BORIS: It's interesting how as it gets easier to build stuff, it changes the way that I write software. Before I would write a big design doc and think about a problem for a long time before building it. Now I'll just ask Claude Code to prototype three versions of it and I'll try the feature and see which one I like better. And that informs me much better and much faster than a doc would have.
XVI. Memory and the Future of Context
SWYX: I'm interested in memory. We talked about auto-compact and memory using hashtags and stuff. My impression is you like to say the simplest approach works, but I'm curious if you've seen any other requests that are interesting or internal hacks of memory that people have explored.
BORIS: There's a bunch of different approaches to memory. Most of them use external stores — there's Chroma, there's key-value or graph shapes.
SWYX: Are you a believer in knowledge graphs for this stuff?
BORIS: If you talked to me before I joined Anthropic and this team, I would have said yeah, definitely. But now actually I feel everything is the model — that's the thing that wins in the end. As the model gets better, it subsumes everything else. At some point the model will encode its own knowledge graph, its own KV store, if you just give it the right tools.
SWYX: In some ways, are we just coping for lack of context length? Are we doing things for memory now that if we had a 100-million-token context window we wouldn't care about?
BORIS: I would love to have 100-million-token context for sure. But here's a question — if you took all the world's knowledge and put it in your brain, is that something that you would want to do, or would you still want to record knowledge externally?
CAT: We've been seeing people play around with memory in quite interesting ways — like having Claude write a logbook of all the actions that it's done so that over time Claude develops this understanding of what your team does, what you do within your team, what your goals are, how you like to approach work. We would love to figure out what the most generalized version of this is so that we can share broadly. I think with things like Claude Code, it's actually less work to implement the feature and a lot of work to tune these features to make sure that they work well for general audiences across a broad range of use cases.
BORIS: A related problem to memory is how you get stuff into context. Originally we tried very early versions of Claude that used RAG — we were using Voyage, just off-the-shelf RAG, and that worked pretty well. We tried a few different versions of it. There was RAG and then we tried a few different kinds of search tools, and eventually we landed on just agentic search as the way to do stuff.
There were two big reasons, maybe three big reasons. One is it outperformed everything by a lot. And this was surprising. Just using regular code searching — glob, grep, just regular code search. The second was there's this whole indexing step that you have to do for RAG and there's a lot of complexity that comes with that because the code drifts out of sync and then there's security issues because this index has to live somewhere. So it's just a lot of liability. Agentic search just sidesteps all that. Essentially, at the cost of latency and tokens, you now have really awesome search without security downsides.
XVII. Sandboxing, Branching, and Agent Planning
SWYX: Your takes on sandboxing environments, branching, rewindability?
BORIS: I could talk for hours about this. Starting with sandboxing: ideally, the thing that we want is to always run code in a Docker container and then it has freedom. You can snapshot, rewind, do all this stuff. Unfortunately, working with a Docker container for everything is just a lot of work and most people aren't going to do it. And so we want some way to simulate some of these things without having to go full container.
There's some stuff you can do today. For example, something I'll do sometimes is if I have a planning question or a research type question, I'll ask Claude to investigate a few paths in parallel. You can do this today if you just ask it. Say, "I want to refactor X to do Y. Can you research three separate ideas for how to do it? Do it in parallel. Use three agents to do it." In the UI, when you see a "task" — that's actually a sub-Claude, a sub-agent that does this. Usually when I do something hairy, I'll ask it to investigate three times or five times or however many times in parallel, and then Claude will pick the best option and summarize that for you.
SWYX: But how does Claude pick the best option? Don't you want to choose?
BORIS: I think it depends on the problem. You can also ask Claude to present the options to you.
SWYX: How do you observe Claude Code failing?
CAT: There's definitely a lot of room for improvement in the models, which I think is very exciting. Most of our research team actually uses Claude Code day-to-day, and so it's been a great way for them to be very hands-on and experience the model failures, which makes it a lot easier for us to target these in model training and actually provide better models not just for Claude Code but for all of our coding customers.
One of the things about the latest Sonnet 3.7 is it's a very persistent model. It's very motivated to accomplish the user's goal, but it sometimes takes the user's goal very literally and doesn't always fulfill the implied parts of the request because it's so narrowed in on "I must get X done." And so we're trying to figure out, "Okay, how do we give it a bit more common sense so that it knows the line between trying very hard and no, the user definitely doesn't want that?"
BORIS: The classic example is like, "Hey, get this test to pass," and then five minutes later it's like, "All right, well, I hardcoded everything. The test passes." And I'm like, "No, that's not what I wanted." But that's the thing — it only gets better from here. These use cases work sometimes today, not every time. The model sometimes tries too hard, but it only gets better.
Context is a big one where if you have a very long conversation and you compact a few times, maybe some of your original intent isn't as strongly present as it was when you first started. So maybe the model forgets some of what you originally told it to do. We're really excited about things like larger effective context windows so that you can have these gnarly, really long, hundreds-of-thousands-of-tokens-long tasks and make sure that Claude Code is on track the whole way through. That would be a huge lift not just for Claude Code but for every coding company.
BORIS: What we find is that Claude Code doesn't have that much between-session memory or caching. It reforms the whole state from whole cloth every single time so as to make the minimum assumptions on the changes that can happen in between. Our best advice now for people who want to resume across sessions is to tell Claude, "Hey, write down the state of this session into this text doc" — probably not the CLAUDE.md but a different doc. And in your new session, tell Claude to read from that doc. But we plan to build in more native ways to handle this specific workflow.
There's a lot of different cases. Sometimes you don't want Claude to have the context — it's sort of like Git. Sometimes I just want a fresh branch that doesn't have any history. But sometimes I've been working on a PR for a while and I need all that historical context. So we kind of want to support all these cases.
One thing other people have done is ask Claude to commit after every change. You can just put that in the CLAUDE.md. Some people are asking Claude to create a worktree every time so that they could have a few Claudes running in parallel in the same repo. From our point of view, we want to support all of this. Claude Code is a primitive and it doesn't matter what your workflow is — it should just fit in.
XVIII. Future Roadmap
SWYX: You obviously do not have a separate Claude Code subscription. What's the road map? Is this just going to be a research preview for much longer? Are you going to turn it into an actual product? Is there going to be Claude Code Enterprise?
CAT: We have a permanent team on Claude Code. We're growing the team. We're really excited to support Claude Code in the long run. In terms of subscription, it's something that we've talked about. It depends a lot on whether or not most users would prefer that over pay-as-you-go. So far, pay-as-you-go has made it really easy for people to start experiencing the product because there's no upfront commitment. And it also makes a lot more sense with a more autonomous world in which people are scripting Claude Code a lot more. But we also hear the concern around, "Hey, I want more price predictability if this is going to be my go-to tool." So we're very much still figuring that out.
For enterprises, given that Claude Code is very much a productivity multiplier for ICs and most ICs can adopt it directly, we've been supporting enterprises as they have questions around security and productivity monitoring.
ALESSIO: Do you have a credible number for the productivity improvement? Like for people not at Anthropic — are we talking 30%? Some number would help justify things.
BORIS: We're working on getting this. Anecdotally, for me it's probably 2x my productivity. I'm an engineer that codes all day every day. For me it's probably 2x. I think there's some engineers at Anthropic where it's probably 10x their productivity. And then there's some people that haven't really figured out how to use it yet and they just use it to generate commit messages or something — that's maybe 10%. So I think there's probably a big range and we need to study more.
CAT: For reference, sometimes we're in meetings together and sales or compliance or someone is like, "Hey, we really need X feature." And then Boris will ask a few questions to understand the specs, and then 10 minutes later he's like, "All right, well, it's built. I'm going to merge it later. Anything else?" So it definitely feels far different than any other PM role I've had.
BORIS: Megan, the designer on our team — she is not a coder but she's putting up pull requests. She uses Code to do it. She designs the UI and she's landing PRs to our console product. It's not even just building on Claude Code — it's building across our product suite in our monorepo. Similarly, our data scientist uses Claude Code to write BigQuery queries. And there was a finance person that went up to me the other day and was like, "Hey, I've been using Claude Code." And I was like, "What? How did you even get it installed? You know how to use Git?" And they're like, "Yeah, I figured it out."
They take their data, put it in a CSV, and then they cat the CSV, pipe it into Code, and then ask it questions about the CSV. They've been using it for that.
SWYX: I know that there's a broad interest in people forking or customizing Claude Code. We have to ask — why is it not open source?
BORIS: We are investigating. So it's not yet. There's a lot of trade-offs that go into it. On one side, our team is really small and we're really excited for open source contributions if it was open source. But it's a lot of work to maintain everything. I maintain a lot of open source stuff and a lot of other people on the team do too, and it's just a lot of work — it's a full-time job managing contributions.
Generally our approach is — all the secret sauce, it's all in the model. And this is the thinnest possible wrapper over the model. We literally could not build anything more minimal. This is the most minimal thing. So there's just not that much in it.
XIX. Why Anthropic Excels at Developer Tools
SWYX: Why is Anthropic doing so well with developers? It seems like there's no centralized strategy — every time I talk to Anthropic people they're like, "Oh yeah, we just had this idea and we pushed it and it did well."
CAT: Everyone just wants to build awesome stuff. I think a lot of this trickles down from the model itself being very good at code generation. We're very much building off the backs of an incredible model — that's the only reason why Claude Code is possible. So much of the world is run via software and there's immense demand for great software engineers. And it's also something that you can do almost entirely with just a laptop. So it's an environment that's very suitable for LLMs. It's an area where we feel like you can unlock a lot of economic value by being very good at it.
BORIS: One anecdote that might be interesting — the night before the Code launch, we were going through to burn down the last few issues. The team was up pretty late. One thing that was bugging me for a while is we had this markdown rendering that we were using. The markdown rendering in Claude today is beautiful — really nice rendering in the terminal with bold, headings, and spacing. But we tried a bunch of off-the-shelf libraries — two or three or four different libraries — and nothing was quite perfect. Sometimes the spacing was off between a paragraph and a list, or the text wrapping wasn't quite correct, or the colors weren't perfect.
So the night before the release, at like 10 p.m., I'm like, "All right, I'm going to do this." I just asked Claude to write a markdown parser for me and it wrote it. It wasn't quite zero-shot, but after maybe one or two prompts, it got it. And that's the markdown parser that's in Code today. And that's the reason that markdown looks so beautiful.
CAT: It's interesting what the new bar is for implementing features — like this exact example where there's libraries out there that you normally reach for, that you find some dissatisfaction with, for literally whatever reason. You could just spin up an alternative and go off of that.
BORIS: AI has changed so much in the last year. A feature you might not have built before, or you might have used a library — now you can just do it yourself. The cost of writing code is going down and productivity is going up, and we just have not internalized what that really means yet. But I expect that a lot more people are going to start doing things like this — writing your own libraries or just shipping every feature.
SWYX: Has Claude Code been rewritten many times?
BORIS: Boris and the team have rewritten this like five times. Probably every 3 weeks, 4 weeks or something. And it just — all the pieces keep getting swapped out. It's like a ship of Theseus. Every piece keeps getting swapped out because Claude is so good at writing its own code. Most of the changes are to make things more simple — to share interfaces across different components. We just want to make sure that the context given to the model is in the purest form and that the harness doesn't intervene with the user's intent. Very much a lot of that is just removing things that could get in the way or that could confuse the model.
On the UX side, something that's been pretty tricky — and the reason we have a designer working on a terminal app — is that it's actually really hard to design for a terminal. There's not a lot of literature on this. Terminal is sort of new. There's a lot of really old terminal UIs that use curses and very sophisticated UI systems, but they all feel really antiquated by the UI standards of today. So it's taken a lot of work to figure out how exactly you make the app feel fresh and modern and intuitive in a terminal. And we've had to come up with a lot of that design language ourselves.
ALESSIO: Who do you want to hire?
BORIS: We don't have a particular profile. If you feel really passionate about coding and about the space, if you're interested in learning how models work and how terminals work and how all these technologies are involved — hit us up. Always happy to chat.
SWYX: Awesome. Well, thank you for coming on. This was fun.
BORIS: Thank you. Thanks for having us. This was fun.
Transcript source: "Latent Space" podcast — Claude Code: Anthropic's CLI Agent. Formatted for readability.