Góc Học Trò - Hoctro's Place: "Inside AI — The Untold Story" Series

4.01.2026

"Inside AI — The Untold Story" Series - Article 5: How Does AI Learn?

Author: Claude AI, under the supervision, prompting and editing by HocTro

Opening

You know AI is capable because billions of parameters were adjusted through trillions of learning steps. But learned from what? How? And why does AI answer politely, admit when it doesn't know something, and try to be helpful — rather than just spewing raw text at random?

Training a modern language AI involves three distinct stages, each with a different goal. Understanding these three stages helps you understand why AI has the personality it has — and why it still has significant limitations.

Stage 1: Pretraining — Reading Everything Humans Have Ever Written

Pretraining is the first stage and the most expensive one. The goal is simple: expose AI to as much human-written text as possible so it can learn how language works.

Training data comes from many sources:

Internet content: A large portion of the public web — websites, blogs, forums, Wikipedia, news — is crawled (automatically collected) and fed in. Datasets like Common Crawl contain petabytes of text from billions of web pages.
Books: Millions of digitized books, from novels to scientific textbooks, medical literature, law, and history.
Code: Billions of lines of code from GitHub and open-source repositories — this is why AI can write programs.
Academic papers: Research articles, dissertations, journals — helping AI learn structured reasoning.
Conversations and forums: Reddit, Stack Overflow, and other platforms — helping AI learn natural human communication.

But it's not just "grab everything and throw it in." The data is carefully filtered and cleaned: spam, low-quality content, duplicate text, and a significant portion of harmful material are removed. The ratio of sources is also adjusted — books and academic papers, for instance, are weighted more heavily because they tend to have more precise language and clearer reasoning than social media posts.

How pretraining works: Remarkably simple in principle. The system takes a passage of text, hides the end of it, asks the model to guess the next token, compares the guess to the correct answer, and adjusts the weights accordingly. Repeat — trillions of times. No complex labeling, no elaborate right/wrong — just "guess the next token, then adjust."

What you get after pretraining: a model that can produce coherent text, understands context, and knows a great deal about the world. But it's not yet useful in a practical sense. Ask it a question and it might respond by continuing the text in the style of the internet — which could be helpful, or could be garbage, depending on which pattern it matched.

The Cost of Pretraining

Pretraining a large model is staggeringly expensive.

GPT-4 (estimated, as OpenAI doesn't publish exact figures): roughly $50–100 million in compute costs alone. Meta's LLaMA 3 (open-source model): estimated $30–60 million. Google's Gemini Ultra: estimated in the hundreds of millions.

Where does that money go? Running thousands of specialized GPU/TPU chips continuously for weeks to months, consuming enough electricity to power a small city. This is why large AI models can only be built by companies with enormous capital.

Stage 2: Fine-Tuning — Teaching AI How to Behave

After pretraining, the model needs to be shaped into something that behaves correctly as an assistant. This is fine-tuning.

In this stage, the training data is no longer raw internet text. Instead, people create high-quality datasets of example question-and-answer pairs:

Question: "Explain compound interest to a 10-year-old."
Model answer: [a clear, simple explanation with a relatable example]

Tens of thousands — sometimes hundreds of thousands — of these pairs are created by professional writers. The model is trained further on this dataset to learn how to respond in a helpful question-answer format, how to explain things clearly, and how to decline harmful requests.

Fine-tuning is also costly, but far less so than pretraining — the dataset is smaller and the model already has a strong foundation. Typically measured in the millions rather than tens of millions of dollars.

Stage 3: RLHF — Real People Teaching AI to Be Better

This is the most interesting stage and the least widely known.

RLHF — Reinforcement Learning from Human Feedback — is the technique that created the leap from "a model that can generate text" to "a model that is genuinely helpful and reasonably safe."

The RLHF process works like this:

Step 1 — Collect comparisons: The model generates several different versions of an answer to the same question. Human raters read both and choose which one is better. For example: the model produces Answer A and Answer B to "What is compound interest?" — the rater picks A because it's clearer, has a real-world example, and contains no errors.

Step 2 — Train a reward model: From hundreds of thousands of these comparison pairs, a secondary model called the reward model is trained — it learns to predict "is this response good from the user's perspective?"

Step 3 — Optimize using RL: The main model is adjusted to generate responses that the reward model scores highly — and to avoid responses the reward model scores poorly. This is the "reinforcement learning" part: the model learns to "play the game" in a way that earns the most points from the reward model.

The result of RLHF: AI becomes more helpful, safer, and less prone to generating harmful content. It even learns subtle things like "when a user seems upset, don't open with a bullet-point list."

RLHF isn't perfect — reward models can be "gamed" (the main model learns to sound good without being actually correct), and rater preferences reflect their cultural background. But compared to no RLHF, the difference is very significant.

The People Behind the RLHF Data

A detail that doesn't get talked about enough: most of the rating work in RLHF is not done by engineers at Anthropic or OpenAI — it's done by human raters working through digital labor platforms, often in developing countries like Kenya, the Philippines, or India, at relatively low wages.

This is worth sitting with. The intelligence and safety of AI models used by billions of people was built in part on the labor of unsung workers who spent hours reading and evaluating disturbing content so AI could learn to avoid it.

After Training: The Model Doesn't Update Automatically

One important point: once training is complete, the model "freezes." It does not keep learning from your daily conversations. If you tell Claude about something that happened today, Claude doesn't remember that for other users — or even for you in a future conversation.

This is why AI has a training cutoff — a date at which its data stops. Anything that happened after that date, AI simply doesn't know, unless you tell it directly in the current conversation.

Anthropic and other AI companies release new model versions periodically — each new version is trained again from (near) scratch with more recent data. It's not a "software update" in the usual sense — it's a full retraining.

Summing Up: Three Stages That Make an AI

Stage	Name	Goal	Data
1	Pretraining	Learn language and knowledge	Hundreds of billions of pages of raw text
2	Fine-tuning	Learn how to be helpful	Tens of thousands of high-quality Q&A pairs
3	RLHF	Learn user preferences	Hundreds of thousands of human comparison ratings

Next article: where does all of this actually happen? The buildings full of thousands of specialized GPU chips, consuming electricity like a city, scattered across the world — that's the data center, and it's where AI lives.

Quick Reference Table

Concept	Vietnamese Term	Short Definition
Pretraining	Tiền huấn luyện	Learning from hundreds of billions of pages of raw text
Fine-tuning	Tinh chỉnh	Learning to respond helpfully from curated examples
RLHF	Học từ phản hồi người	Adjusting based on human comparison ratings
Reward model	Model phần thưởng	A secondary model that predicts "is this response good?"
Human rater	Người chấm điểm	A real person who evaluates and compares AI responses
Training cutoff	Ngày cắt hạn	The date after which AI has no knowledge of world events
Web crawling	Thu thập dữ liệu web	Automatically collecting text from across the internet

Key Things to Remember

AI training has three stages: pretraining (learning language), fine-tuning (learning to be helpful), and RLHF (learning user preferences).
Pretraining costs tens to hundreds of millions of dollars. This is why only large, well-funded companies can build large models.
RLHF is why AI is polite and safer — not naturally, but because real people rated responses and the model learned from those ratings.
AI does not learn from your conversations. After training is complete, the model is frozen.
Training cutoff is real. AI knows nothing about events after that date unless you provide the information yourself.
Next article: Where does all this training actually happen — the data centers where AI "lives."