3.21.2026

"Inside AI — The Untold Story" Series - Article 3: How Does AI Generate a Reply?

Author: Claude AI, under the supervision, prompting and editing by HocTro

Opening

When you ask AI "What is the capital of France?" and it answers "Paris" — your instinct is probably: of course, it knows that. It learned it.

But here's the surprising part: AI doesn't "know" in the human sense. It doesn't look up an answer in some internal knowledge vault, find the correct fact, and read it back to you. It does something much simpler — and much more interesting: it guesses the next word.

Read that again: it guesses the next word. Then the one after that. Then the one after that. All the way until the reply is done.

That sounds simple, maybe even a little naive. But this same mechanism of "continuous guessing" — when trained on hundreds of billions of pages of text — produces answers that sometimes make readers pause and read twice. This article explains why.


Your Phone Keyboard, But a Million Times Smarter

You've probably used the word suggestion bar on your phone's keyboard. You type "Happy" and the keyboard suggests "birthday" — because through millions of people typing those words, the system learned that "Happy" is often followed by "birthday."

AI does the same thing — just at a completely different scale and level of sophistication. Instead of learning from a few million phone users, it learned from hundreds of billions of pages of text: books, news articles, websites, code, conversations, academic papers, novels, forums. Instead of suggesting one simple word, it computes a probability distribution across tens of thousands of possible next tokens.

This mechanism is called autoregressive generation — each new token is produced based on all the tokens that came before it. "Auto" because it's self-driven; "regressive" because each step looks back at everything already generated. The reply grows one token at a time, and each new token shapes the ones that follow.


Probability — What Does AI See Before It Speaks?

Before generating the next token, AI doesn't see a clear answer. It sees a probability table — a ranked list of every possible next token, each with an assigned likelihood.

Simple example: you ask "The capital of France is..." and AI computes something like:

  • "Paris" → 94% probability
  • "Lyon" → 2% probability
  • "Marseille" → 1% probability
  • Thousands of other tokens → 3% combined

"Paris" wins by a landslide. But AI doesn't always pick the highest-probability option — and this is where temperature comes in.


Temperature — The Creativity Dial

Temperature is a parameter, typically from 0 to 2, that controls how much randomness enters the token selection process.

Low temperature (near 0): AI almost always picks the highest-probability token. Answers are predictable, consistent, less creative. Ask the same question ten times, get ten nearly identical answers. Best for tasks that need precision: translation, summarization, factual questions.

High temperature (near 1 or above): AI tends to pick less probable tokens — more randomness, more variety, sometimes more creative — but also sometimes more off-topic. Best for creative writing, brainstorming, open-ended exploration.

Picture a jazz musician. Low temperature: he plays the sheet music exactly as written, every note in order. High temperature: he improvises freely, tries unexpected chords — sometimes it's better than the original, sometimes it goes off the rails.

When you use Claude or ChatGPT through a regular interface, the temperature is already set to a balanced level — consistent enough to be useful, flexible enough not to feel robotic. When using the API, you can adjust it yourself to suit the task.


Greedy Decoding and Sampling — Two Ways to Pick a Token

Beyond temperature, there are two fundamental approaches to selecting the next token:

Greedy decoding: always pick the token with the highest probability. Fast, simple, predictable. But it can lead to repetitive or bland responses because AI always takes the safest path.

Sampling: choose randomly with weights — high-probability tokens have a better chance of being selected, but low-probability tokens still get a shot. Temperature controls the degree of weighting.

Most modern AI uses a combination: sampling with moderate temperature, plus techniques like top-p sampling (only consider tokens whose combined probabilities add up to p%) to avoid wildly improbable choices.

The result: AI replies feel natural and flexible, not mechanical — but still consistent enough to be genuinely useful.


Hallucination — Why Does AI Make Things Up?

This is the part most people find hardest to accept. AI sometimes gives out wrong information — not slightly wrong, but completely fabricated — presented with the calm confidence of someone stating an obvious fact. In the technical world, this phenomenon is called hallucination.

Why does it happen? The answer sits right inside the mechanism we just described: AI doesn't search for truth — it generates text with the highest probability of what should come next.

When you ask "What did scientist Nguyen Van X invent?", if AI doesn't have reliable information about that person from its training, it can't just naturally say "I don't know" — that's not how token generation works. Instead, it sees the pattern: question about a scientist + what did they invent → there must be something. So it generates an answer that "sounds right" — a project name, a year, even a journal name — all the product of guessing token by token according to probability, with nothing verified.

Here's the key point: AI doesn't know it's making things up. There's no mechanism inside the token generation process that says "this information is unverified." Each token is generated purely by probability — fabricated or factual, the output looks the same from the outside.

The most common hallucinations:

  • Fake quotes: AI invents a sentence and attributes it to a famous person — it sounds exactly like something they'd say, but they never said it
  • Non-existent references: AI lists books or academic papers with full author names, publication years, and page numbers — but none of it can be found on Google
  • Wrong dates: AI remembers that an event happened but places it in the wrong year or attributes it to the wrong person
  • Invented statistics: A number that looks very specific and convincing — "37.4% of users..." — but has no source

Practical defense: for anything that matters — historical facts, numbers, quotes, names, document titles — always verify from another source. AI is excellent at many things, but it's not a perfectly reliable encyclopedia.


So Does AI Actually "Understand"?

This is the million-dollar question — and the honest answer is: it depends on how you define "understand."

In the human sense — conscious, experiencing, feeling the meaning of things — AI doesn't understand. It doesn't "know" that Paris is the capital of France the way you know it: from a memory of seeing the Eiffel Tower in a photo, from a geography class in eighth grade, from actually standing there. It only knows that the token "Paris" has a very high probability of appearing after the phrase "the capital of France is."

But in the behavioral sense — reading, summarizing, reasoning, translating, explaining, writing — AI performs all of those tasks in ways whose outputs closely resemble understanding. And at sufficient scale, the line between "genuinely understands" and "guesses extraordinarily well" becomes blurry enough that researchers are still debating it today.

The practical takeaway: treat AI like an incredibly smart collaborator who is occasionally overconfident and prone to making things up. Most useful when you verify what matters and don't take anything too perfectly delivered entirely on faith.


Summing Up

AI doesn't "think" and then "answer." It generates a reply one token at a time, each step choosing based on the probability of what should come next given everything that has come before. Temperature adjusts how creative or consistent that process is. And because the mechanism is probabilistic guessing — not truth retrieval — hallucination is a natural consequence, not a bug that can be fully patched away.

Next article: given this seemingly simple guess-the-next-token mechanism, how does AI become so remarkably capable? What is a neural network, what is the attention mechanism, and why does a bigger AI model suddenly know how to do things nobody explicitly taught it?


Quick Reference Table

Concept Vietnamese Term Short Definition
Autoregressive generation Tạo văn bản tự hồi quy Generating each token based on all previous tokens
Probability distribution Phân phối xác suất The ranked likelihood of every possible next token
Temperature Nhiệt độ The dial that controls randomness vs. consistency
Greedy decoding Chọn tham lam Always picking the highest-probability token
Sampling Lấy mẫu Choosing randomly, weighted by probability
Hallucination Ảo giác AI AI generating confident but incorrect information

Key Things to Remember

  • AI guesses one token at a time. It doesn't look things up or "know" in the human sense — it continuously computes the probability of the next token.
  • Temperature controls creativity. Low = consistent and predictable. High = more creative, more surprising, sometimes more off-track.
  • Hallucination can't be fully eliminated. It's a natural consequence of probabilistic token generation, not a fixable bug.
  • AI doesn't know it's making things up. There's no internal alarm that flags unverified information during generation.
  • Always verify what matters. Quotes, numbers, document titles, historical events — cross-check from another source.
  • Next article: From this simple guessing mechanism, how does AI become genuinely intelligent?