Table of Contents
- Introduction
- I. Feeling Behind as a Coder
- II. Software 3.0 Explained
- III. Agents as the Installer
- IV. MenuGen vs. Raw Prompts
- V. What's Obvious by 2026
- VI. Verifiability and Jagged Skills
- VII. Founder Advice and Automation
- VIII. From Vibe Coding to Agentic Engineering
- IX. Agents Everywhere and Learning
Introduction
STEPHANIE ZHAN: We are so excited for our very first special guest. He has helped build modern A.I., then explain modern A.I., and then occasionally rename modern A.I. He actually helped co-found OpenAI, right inside of this office. He was the one who got Autopilot working at Tesla back in the day, and he has a rare gift of making the most complex technical shifts feel both accessible and inevitable.
You all know him for having coined the term "vibe coding" last year, but just in the last few months he said something even more startling: that he's never felt more behind as a programmer. That's where we're starting today. Thank you, Andrej, for joining us.
I. Feeling Behind as a Coder
ANDREJ KARPATHY: Yeah. Hello. Excited to be here and to kick us off.
STEPHANIE: Okay. So just a couple of months ago, you said that you've never felt more behind as a programmer. That's startling to hear from you of all people. Can you help us unpack that? Was that feeling exhilarating or unsettling?
ANDREJ: Yeah, a mixture of both for sure. Well, first of all, I guess, like many of you, I've been using agentic tools — things like Claude Code, adjacent tools — for a while, maybe over the last year as they came out. And they were very good at, you know, chunks of code. Sometimes they would mess up and you'd have to edit them, and it was kind of helpful. Then I would say December was this clear inflection point. I was on a break, so I had a bit more time. I think many other people experienced something similar. I started to notice that with the latest models, the chunks just came out fine. And then I kept asking for more and it just came out fine. And then I couldn't remember the last time I'd corrected it. And then I just trusted the system more and more, and then I was vibe coding.
[laughter]
ANDREJ: And so it was kind of a — I do think it was a very stark transition. I think a lot of people... I tried to stress this on Twitter, or X, because I think a lot of people experienced A.I. last year as a ChatGPT-adjacent thing. But you really had to look again, as of December, because things changed fundamentally — especially on this agentic, coherent workflow, which really started to actually work. And so yeah, it was just that realization that had me go down this whole rabbit hole of just, you know, infinity side projects. My side projects folder is extremely full with lots of random things, and I'm just vibe coding all the time. So yeah, that kind of happened in December, I would say, and I've been looking at the repercussions of that ever since.
II. Software 3.0 Explained
STEPHANIE: You've talked a lot about this idea of LLMs as a new computer — that it isn't just better software, it's a whole new computing paradigm. Software 1.0 was explicit rules, software 2.0 was learned weights, software 3.0 is this. If that's actually true, what does a team build differently the day they actually believe this?
ANDREJ: Right. So, yeah, exactly. Software 1.0, I'm writing code. Software 2.0, I'm actually programming by creating data sets and training neural networks — so the programming is kind of like arranging data sets and maybe some objectives and neural network architectures. And then what happened is that basically, if you train one of these GPT models or LLMs on a sufficiently large set of tasks — implicitly, because by training on the internet you have to multitask all the things that are in the data set — these actually become kind of like a programmable computer in a certain sense. So software 3.0 is kind of about, you know, your programming now turns to prompting, and what's in the context window is your lever over the interpreter that is the LLM, which is kind of interpreting your context and performing computation in the digital information space. So I guess, yeah, that's kind of the transition. And I think there are a few examples that really drove it home for me.
III. Agents as the Installer
ANDREJ: For example, when Claude Code came out — when you want to install it, you would expect that normally this would be a bash script, a shell script. So you'd run the shell script to install Claude Code. But the thing is, in order to target lots of different platforms and lots of different types of computers, these shell scripts usually balloon up and become extremely complex. But you're still stuck in a software 1.0 universe of wanting to write the code. And actually, the Claude Code installation is a copy-paste of a bunch of text that you're supposed to give to your agent. So basically it's a little skill — you copy-paste this and give it to your agent, and it will install Claude Code. And the reason this is a lot more powerful is that you're now working in the software 3.0 paradigm, where you don't have to precisely spell out all the individual details of that setup. The agent has its own intelligence, it packages it up, it follows the instructions, looks at your environment and your computer, performs intelligent actions to make things work, debugs things in the loop — and it's just so much more powerful. So I think that's a very different kind of way of thinking about it: what is the piece of text to copy-paste to your agent? That's the programming paradigm now.
IV. MenuGen vs. Raw Prompts
ANDREJ: One more example that comes to mind, that is even more extreme, is when I was building MenuGen. So MenuGen is this idea where you come to a restaurant, they give you a menu, there are no pictures usually. So I don't know what a lot of these things are — usually around thirty percent of items I have no idea what they are, sometimes fifty percent. So I wanted to take a photo of the restaurant menu and get pictures of what those dishes might look like in a generic sense. And so I built — I vibe-coded this app — that basically lets you upload a photo and does all this stuff. It runs on Vercel, rerenders the menu, gives you all the items, and uses an image generator to basically OCR all the different titles, get pictures of them, and show it to you. And then I saw the software 3.0 version of this, which blew my mind. It's literally just: take your photo, give it to Gemini, and say, "Use Nana Banana to overlay the dishes onto the menu." And Nana Banana returned an image that is exactly the picture of the menu that I took, but it actually put into the pixels rendered images of the different dishes in the menu. And this blew my mind, because all of my MenuGen is spurious. It's working in the old paradigm — that app shouldn't exist. And the software 3.0 paradigm is a lot more raw: your neural network is doing more and more of the work, your prompt or context is just the image, the output is an image, and there's no need to have any of the app in between.
ANDREJ: So I think people have to kind of reframe things — not to work in the existing paradigm of what things existed and just think of it as a speed-up of what already exists. It's actually that new things are available now. And going back to your programming question — that's also an example of working in the old mindset, because it's not just about programming becoming faster. This is more general information processing that is now automatable. So it's not even just about code. Previous code worked over structured data, right? But for example, with my LLM knowledge bases project — basically, you get LLMs to create wikis for your organization or for yourself personally — this is not even a program. This is not something that could have existed before, because there was no code that would create a knowledge base from a bunch of facts. But now you can just take these documents and basically recompile them in a different way, reorder them, and create something that is new and interesting as a reframing of the data. And so these are new things that weren't possible. I keep trying to get back to that question: not only what can we do that already existed but faster, but what are the new opportunities — things that couldn't be possible before? And I almost think that's even more exciting.
V. What's Obvious by 2026
STEPHANIE: I love the MenuGen progression and dichotomy that you laid out. I'm sure many folks here followed your own progression in programming from last October to early January and February of this year. If you extrapolate further — what is the 2026 equivalent of building websites in the '90s, building mobile apps in the 2010s, building SaaS in the last cloud era? What will look completely obvious in hindsight that is still mostly unbuilt today?
ANDREJ: Well, going with the example of MenuGen, a lot of this code shouldn't exist — it's just a neural network doing most of the work. I do think the extrapolation looks very weird because you could basically imagine... you could imagine completely neural computers in a certain sense. You feed raw video or audio into basically what's a neural net, and it uses diffusion to render a UI that is kind of unique for that moment. And I kind of feel like in the early days of computing, people were a little bit confused as to whether computers would look like calculators or whether computers would look like neural nets — in the '50s and '60s it was not really obvious which way it would go. And of course we went down the calculator path and ended up building classical computing. And then neural nets are currently running virtualized on existing computers. But you could imagine — I think — that a lot of this will flip, and that the neural net becomes kind of the host process and the CPUs become kind of the co-processor. So we saw the diagram of how intelligence compute for neural networks is going to take over and become the dominant spend of flops. You could imagine something really weird and foreign, where neural nets are doing most of the heavy lifting, using tool use as a kind of historical appendage for some deterministic tasks, but what's really running the show is these neural nets. You can imagine something extremely foreign as the extrapolation, but I think we're going to get there sort of piece by piece. And that progression is TBD, I would say.
[laughter]
VI. Verifiability and Jagged Skills
STEPHANIE: I'd like to talk a little bit about this concept of verifiability — the fact that A.I. will automate faster and more easily in domains where the output can be verified. If that framework is right, what work is about to move much faster than people realize? And what professions do people think are safe but are actually highly verifiable?
ANDREJ: Yes. So I spent some time writing about verifiability. Basically, traditional computers can easily automate what you can specify in code. And this latest round of LLMs can easily automate what you can verify — in a certain sense — because the way this works is that when frontier labs are training these LLMs, these are giant reinforcement learning environments. So they are given verification rewards, and because of the way these models are trained, they end up basically progressing and creating these jagged entities that really peak in capability in verifiable domains like math and code and adjacent areas — and kind of stagnate and are a little rough around the edges when things are not in that space.
So I think the reason I wrote about verifiability is that I'm trying to understand why these things are so jagged. And some of it has to do with how the labs train the models, but I think some of it also has to do with the focus of the labs and what they happen to put into the data distribution. Because some things are significantly more valuable in the economy and end up creating more environments because the labs wanted to work in those settings. Code is a good example of that. There are probably lots of verifiable environments they could think of that didn't make it into the mix because they're just not that useful to have capability around.
But I think the big mystery for me is — the favorite example for a while was: how many letters are in "strawberry"? And the models would famously get this wrong. That's an example of jaggedness. The models now patch that, I think. But the new one is: I want to go to a car wash to wash my car and it's 50 meters away — should I drive or should I walk? And state-of-the-art models today will tell you to walk because it's so close. How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a 100,000-line codebase or find zero-day vulnerabilities, and yet tells me to walk to this car wash? This is insane.
[laughter]
And to whatever extent these models remain jagged, it's an indication that, number one, maybe something is slightly off, or number two, you need to actually be in the loop a little bit — you need to treat them as tools and stay in touch with what they're doing. So all of my writing about verifiability, long story short, is just trying to understand why these things are jagged. Is there any pattern to it? And I think it's some kind of combination of "verifiable" plus "labs care."
The Chess Data Anecdote
Maybe one more anecdote that is instructive: from GPT-3.5 to GPT-4, people noticed that chess improved a lot. A lot of people thought, "Oh well, it's just a progression of capabilities." But actually, I think — this is public information, I saw it on the internet — a huge amount of chess data made it into the pre-training set. And just because it's in the data distribution, the model improved a lot more than it would by default. So someone at OpenAI decided to add this data, and now you have a capability that just peaked a lot more. And so that's why I'm stressing this dimension of it: we are slightly at the mercy of whatever the labs are doing, whatever they happen to put into the mix. And you have to explore this thing they give you that has no manual. It works in certain settings, but maybe not in others. And you have to kind of explore it a little bit. If you're in the circuits that were part of the reinforcement learning, you fly. If you're in the circuits that are out of the data distribution, you're going to struggle, and you have to figure out which circuits you're in for your application. And if you're not in those circuits, then you really have to look at fine-tuning and doing some of your own work, because it's not necessarily going to come out of the LLM out of the box.
VII. Founder Advice and Automation
STEPHANIE: I'd love to come back to the concept of jagged intelligence in a little bit. If you are a founder today thinking about building a company — you're trying to solve a problem that you think is tractable, something in a domain that is verifiable — but you look around and think, "Oh my gosh, the labs have really started getting to escape velocity in the ones that seem most obvious: math, coding, and others." What would your advice be to the founders in the audience?
ANDREJ: So I think maybe that comes back to the previous question. Verifiability makes something tractable in the current paradigm because you can throw a huge amount of reinforcement learning at it. So maybe one way to see it is that this remains true even if the labs are not focusing on it directly. If you are in a verifiable setting where you could create these RL environments or examples, that actually sets you up to potentially do your own fine-tuning, and you might benefit from that. But that is fundamentally technology that just works — you can pull a lever if you have a huge amount of diverse data sets and RL environments. You can use your favorite fine-tuning framework and pull the lever and get something that actually works pretty well. So I don't know exactly what the examples of this might be. But I do think there are some very valuable reinforcement learning environments that people could think of that are not part of the mix. Yeah, I don't want to give away the answer, but there is one domain that I think is very — oh, okay. Sorry, I don't mean to vague-post on stage, but there are some examples of this.
STEPHANIE: On the flip side, what do you think still feels automatable only from a distance?
ANDREJ: I do think that ultimately almost everything can be made verifiable to some extent — some things easier than others. Because even for things like writing, you can imagine having a council of LLM judges and probably get something reasonable out of that kind of approach. So it's more about what's easy or hard. I do think that ultimately...
Everything.
[laughter]
Everything is automatable.
VIII. From Vibe Coding to Agentic Engineering
STEPHANIE: Amazing. Okay. So last year you coined the term "vibe coding," and today we're in a world that feels a little bit more serious — more agentic engineering. What do you think is the difference between the two, and what would you actually call what we're in today?
ANDREJ: Yeah. So I would say vibe coding is about raising the floor for everyone in terms of what they can do in software. The floor rises, everyone can vibe-code anything — and that's amazing, incredible. But then I would say agentic engineering is about preserving the quality bar of what existed before in professional software. You're not allowed to introduce vulnerabilities due to vibe coding. You're still responsible for your software, just as before — but can you go faster? And the spoiler is: you can. But how do you do that properly? So to me, agentic engineering — I call it that because I do think it's kind of an engineering discipline. You have these agents, which are these spiky entities. They're a bit fallible, a little stochastic, but they are extremely powerful. The question is: how do you coordinate them to go faster without sacrificing your quality bar? And doing that well and correctly is the realm of agentic engineering. So I kind of see them as different — one is about raising the floor, and the other is about extrapolating upward. And what I'm seeing is that there is a very high ceiling on agentic engineering capability. People used to talk about the 10x engineer. I think this is magnified a lot more — 10x is not the speed-up you gain. It does seem to me like people who are very good at this peak a lot more than 10x, from my perspective right now.
What A.I.-Native Coding Actually Looks Like
STEPHANIE: I really like that framing. One thing — when Sam Altman came to AI Ascent last year, one memorable thing he said was that people of different generations use ChatGPT differently. So if you're in your 30s, you use it as a Google search replacement. But if you're in your teens, ChatGPT is your gateway to the internet. What is the parallel here in coding today? If we were to watch two people code using Claude Code or Codex — one you'd consider mediocre at it, and one you would consider fully A.I.-native — how would you describe the difference?
ANDREJ: [clears throat] I mean, I think it's just trying to get the most out of the tools that are available, utilizing all of their features, investing in your own setup. Just like previously, engineers were used to basically getting the most out of the tools they use — whether it's Vim or VS Code, or now it's Claude Code or Codex or so on. So just investing in your setup and utilizing a lot of the tools that are available to you. And I think it kind of looks like that.
I do think a related thought is that a lot of people are maybe hiring for this right now, because they want to hire strong agentic engineers. What I'm seeing is that most people have still not refactored their hiring process for agentic engineering capability. Like, if you're giving out puzzles to solve, this is still the old paradigm. I would say that hiring for this has to look like: give me a really big project and see someone implement it. Let's say, write a Twitter clone — for agents — and then make it really good, make it really secure. Then have some agents simulate some activity on this Twitter clone. And then I'm going to use 10 Codex agents to try to break the website that you deployed. They're going to try to basically break it, and they should not be able to break it. Maybe it looks like that, right? Watching people in that setting, building bigger projects, utilizing the tooling — that's maybe what I would look at, for the most part.
The Intern Problem — Taste, Judgment, and Oversight
STEPHANIE: And as agents do more, what human skill do you think becomes more valuable, not less?
ANDREJ: So, yeah, it's a good question. Right now, the answer is that the agents are kind of like these intern entities. It's remarkable — you basically still have to be in charge of the aesthetics, the judgment, the taste, and a little bit of oversight. Maybe one of my favorite examples of the weirdness of agents is with MenuGen: you sign up with a Google account, but you purchase credits using a Stripe account. Both of them have email addresses. And my agent actually tried to — when you purchase credits, it assigned them using the email address from Stripe to the Google account. Like, there wasn't a persistent user ID that it was trying to match up; it was trying to match up email addresses. But you could use a different email address for your Stripe and your Google, and it basically would not associate the funds. And so this is the kind of thing that agents still make mistakes about — like, why would you use email addresses to try to cross-correlate the funds? They can be arbitrary. You can use different emails. This is such a weird thing to do.
So I think people have to be in charge of the spec, the plan. And I actually don't even like "plan mode" — I mean, obviously it's very useful, but I think there's something more general here where you have to work with your agent to design a spec that is very detailed. Maybe it's basically the docs — get the agents to write them, and you're in charge of the oversight and the top-level categories, but the agents are doing a lot of the work under the hood. And so you're not caring about some of the details. As an example, with arrays or tensors in neural networks — there's a ton of details between PyTorch and NumPy and all the different little API details with pandas and so on. And I've already forgotten things like keepdims versus keep_dim, or whether it's dim or axis, or reshape or permute or transpose — I don't remember this stuff anymore, because you don't have to. This is the kind of detail that's handled by the intern, because they have very good recall. But you still have to know, for example, that there's an underlying tensor, there's an underlying view, and you can manipulate the view of the same storage or you can have different storage — which would be less efficient — and so you still have to understand what this stuff is doing at a fundamental level, so that you're not copying memory around unnecessarily. But the details of the APIs are now handed off. So you're in charge of the taste, the engineering, the design — making sure it makes sense, asking for the right things, saying "these have to be unique user IDs that we're going to tie everything to." You're doing the design and direction, and the agents are doing the fill-in-the-blanks. That's currently kind of where we are, and I think that's what everyone is seeing.
Animals vs. Ghosts — What Kind of Thing Is an LLM?
STEPHANIE: Do you think there's a chance that this taste and judgment matters less over time, or will the ceiling just keep rising?
ANDREJ: Yeah, it's a good question. I mean, I'm hoping that it improves. I think probably the reason it doesn't improve right now is, again, it's not part of the RL. There's probably no aesthetics cost or reward, or it's not good enough, or something like that. And I do think that when you actually look at the code, sometimes I get a little bit of a heart attack because it's not like super amazing code necessarily all the time — it's very bloated, there's a lot of copy-paste, there are awkward abstractions that are brittle. It works, but it's just really gross. And I do hope that this can improve in future models.
A good example also is this microGPT project, where I was trying to simplify LLM training to be as simple as possible. The models hate this. They can't do it. I kept trying to prompt an LLM to "simplify more, simplify more," and it just can't. You feel like you're outside of the RL circuits. It feels like you're pulling teeth — it's not light-speed. So I do think that people still remain in charge of this. But I do think that there's nothing fundamental preventing it from improving. It's just that the labs haven't done it yet, almost.
STEPHANIE: Yeah. So I'd love to come back to this idea of jagged forms of intelligence. You wrote a little bit about this in a very thought-provoking piece around animals versus ghosts — the idea that we're not building animals, we are summoning ghosts. These are jagged forms of intelligence shaped by data and reward functions, but not by intrinsic motivation, fun, curiosity, or empowerment — things that came about via evolution. Why does that framing matter, and what does it actually change about how you build, deploy, evaluate, or even trust them?
ANDREJ: Yeah. So I think the reason I wrote about this is that I'm trying to wrap my head around what these things are. Because if you have a good model of what they are — or are not — then you're going to be more competent at using them. And I do think that... I'm not sure if it actually has real practical power.
[laughter]
I think it's a little bit of philosophizing. But I do think it's just coming to terms with the fact that these things are not animal intelligences. Like, if you yell at them, they're not going to work better or worse — it doesn't have any impact. And it's all just kind of like statistical simulation circuits where the substrate is pre-training — so statistics — and then RL is bolted on top, which kind of increases the appendages. And maybe it's just a mindset — of what I'm coming into, or what's likely to work or not likely to work, or how to modify it. But I don't actually have "here are the five obvious outcomes of how to make your system better." It's more just being suspicious of it, and figuring things out over time.
IX. Agents Everywhere and Learning
STEPHANIE: That's where it starts. Okay, so you are so deep in working with agents that don't just chat — they have real permissions, they have local context, they actually take action on your behalf. What does the world look like when we all start to live in that world?
ANDREJ: Yeah. I think a lot of people here are excited about what this agent-native environment looks like — and everything has to be rewritten. Everything is still fundamentally written for humans and has to be reworked. I still use, most of the time, when I use different frameworks or libraries or things like that, docs that are fundamentally written for humans. This is my favorite pet peeve. Why are people still telling me what to do? Like, I don't want to do anything. What is the thing I should copy-paste to my agent?
[laughter]
So it's just — every time I'm told, you know, "go to this URL" or something like that, it's just like — ugh.
[laughter]
So everyone is excited about how do we decompose the workloads that need to happen into fundamentally sensors over the world, actuators over the world. How do we make it agent-native? Basically describe things to agents first, and then have a lot of automation around data structures that are very legible to LLMs. So I'm hoping that there's a lot of agent-first infrastructure out there. And for MenuGen — famously, when I wrote the blog post about MenuGen — a lot of the trouble was not even writing the code for MenuGen. It was deploying it on Vercel, because I had to work with all these different services, string them up, go to their settings and menus, configure my DNS — and it was just so annoying. That's a good example of something I would hope could change: I give a prompt to an LLM, "build MenuGen," and then don't have to touch anything, and it's deployed on the internet. I think that would be a good test of whether our infrastructure is becoming more and more agent-native.
And then ultimately, I do think we're going towards a world where there's agent representation for people and for organizations — I'll have my agent talk to your agent to figure out some of the details of our meetings or things like that.
[laughter]
I do think that's roughly where things are going. But yeah, I think everyone here is excited about that.
What Still Remains Worth Learning Deeply
STEPHANIE: I really like the visual analogy of sensors and actuators. I actually hadn't thought of that. That's super interesting.
ANDREJ: Right?
STEPHANIE: Okay, I think we have to end on a question about education. Because you are probably one of the very best in the world at making complex technical concepts simple, and you are deeply thoughtful about how we design education around it. What still remains worth learning deeply when intelligence gets cheap, as we move into the next era of A.I.?
ANDREJ: Yeah. There was a tweet that blew my mind recently, and I keep thinking about it every other day. It was something along the lines of: "You can outsource your thinking, but you can't outsource your understanding." And I think that's really nicely put. Because I'm still part of the system, and information still has to make it into my brain. And I feel like I'm becoming a bottleneck — just in terms of knowing what we're trying to build, why it's worth doing, how to direct my agents, and so on. So I do still think that something has to direct the thinking and the processing, and that's still kind of fundamentally constrained by understanding. And this is one reason I was also very excited about LLM knowledge bases — because I feel like that's a way for me to process information. And any time I see a different projection onto information, I always feel like I gain insight. It's really just a lot of prompts for me to do synthetic data generation over some fixed data. I really enjoy it — whenever I read an article, I have my wiki being built up from these articles, and I love asking questions about things. And I think that ultimately these are tools to enhance understanding. And this is still kind of a bottleneck, because you can't be a good director if the LLMs certainly don't excel at understanding — you are still uniquely in charge of that. So yeah, I think tools to that effect are incredibly interesting and exciting.
STEPHANIE: I'm excited to be back here in a couple of years and to see if we've been fully automated out of the loop and they actually take care of understanding as well. Thank you so much for joining us, Andrej. We really appreciate it.
[applause]
Transcript source: AI Ascent 2026, hosted by Sequoia Capital. Edited for readability from the original speech-to-text recording.










