You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(2/4) LLM: Data, Transformers, and Relentless Compute - DEV Community
1
+
(2/3) LLM: Data, Transformers, and Relentless Compute - DEV Community
2
2
Forem Feed
3
3
Follow new Subforems to improve your feed
4
4
DEV Community
@@ -72,20 +72,18 @@ Share to Mastodon
72
72
Report Abuse
73
73
Jimin Lee
74
74
Posted on Sep 13
75
-
(2/4) LLM: Data, Transformers, and Relentless Compute
75
+
(2/3) LLM: Data, Transformers, and Relentless Compute
76
76
#data
77
77
#ai
78
78
#machinelearning
79
79
#llm
80
-
LLM (4 Part Series)
80
+
LLM (3 Part Series)
81
81
1
82
-
(1/4) LLM: How LLMs Became the Bedrock of Modern AI
82
+
(1/3) LLM: How LLMs Became the Bedrock of Modern AI
83
83
2
84
-
(2/4) LLM: Data, Transformers, and Relentless Compute
84
+
(2/3) LLM: Data, Transformers, and Relentless Compute
85
85
3
86
-
(3/4) LLM: Inside the Transformer
87
-
4
88
-
(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
86
+
(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
89
87
This post was written in April 2023, so some parts may now be a bit outdated. However, most of the key ideas about LLMs remain just as relevant today.
90
88
Large Language Models
91
89
So, what happens when a regular Language Model gets bigger? You get a Large Language Model (LLM).
@@ -218,16 +216,120 @@ Perfect alignment:
218
216
The job of an LM is “predict the next word.”
219
217
The mechanism of the Transformer decoder is auto-regressive next-word prediction.
220
218
If we want a plain LM that predicts the next word in the same language, we make a small but important tweak…
221
-
In the next post, I’ll cover the background that made LLMs possible, including a closer look at Transformers that I couldn’t fully explore this time.
222
-
LLM (4 Part Series)
219
+
Encoder-Only, Decoder-Only
220
+
The full Encoder–Decoder Transformer is powerful, but not everyone needs both halves. Researchers asked: What if we only used the encoder? What if we only used the decoder?
221
+
Encoder-Only
222
+
The most famous encoder-only model? BERT.
223
+
BERT keeps just the encoder stack. Sometimes all you need is a good representation of text (context vectors), not generation.
224
+
Great for classification tasks:
225
+
Is this review positive or negative?
226
+
Does this sentence contain a person’s name?
227
+
Classification works on embeddings. Better embeddings → better classifiers. BERT looks at text bidirectionally, encodes whole sentences, and produces rich representations. Plug them into a classifier and accuracy jumps.
228
+
Is BERT a language model? Strictly, no — it doesn’t do auto-regressive next-word prediction. It’s trained as a masked language model (predict the missing word), which is different from traditional LMs.
229
+
Decoder-Only
230
+
On the other side: GPT.
231
+
GPT (GPT-2/3, ChatGPT, GPT-4…) keeps only the decoder stack.
232
+
Why drop the encoder? If your goal is just next-word prediction — the pure LM task — you can feed the decoder with the text so far and let it continue auto-regressively.
233
+
Input: “The flowers by the roadside are blooming”
234
+
Decoder predicts: “beautifully.”
235
+
That prediction feeds back in, and generation continues.
236
+
This is why GPT and its cousins (LaMDA, PaLM, LLaMA, Claude, etc.) follow the decoder-only recipe. It’s the simplest and most direct way to scale LMs into generative engines.
237
+
Encoder + Decoder
238
+
Models like T5 and BART keep the full structure and shine at clear input → output transformations (translation, summarization, etc.).
239
+
Encoder vs. Decoder
240
+
Historically, encoder-only exploded first (BERT) because many NLP tasks were classification-heavy. Decoder-only models initially looked like “nonsense generators.”
241
+
Key difference:
242
+
Encoder-only models can’t generate text.
243
+
Decoder-only models can — and with scale, their potential is enormous. Even classification can be reframed as generation (“The review is … [positive/negative]”).
244
+
That’s why decoder-only LMs became the dominant LLMs.
245
+
A Long Tradition
246
+
Transformers didn’t invent encoder–decoder. Before 2017, RNNs/LSTMs/GRUs were the standard way to build it. Transformers replaced RNNs.
247
+
Biggest reason people cite: Self-Attention.
248
+
Why Do Transformers Work So Well? Self-Attention
249
+
Two concepts are central:
250
+
The Encoder–Decoder structure
251
+
Self-Attention
252
+
Let’s start with Attention itself.
253
+
Attention
254
+
Attention first showed up in RNN-based seq2seq models. Recall the pipeline:
255
+
Input → Encoder → Context → Decoder → Output
256
+
The decoder generates tokens one by one. Early models used a fixed Context for every step, but different output words need to “look back” at different parts of the input.
257
+
Example:
258
+
“나는 어제 학교에 갔습니다.” → “I went to school yesterday.”
259
+
If the model could focus on 갔습니다 (went) and 어제 (yesterday) at the right time, it would more reliably pick “went” (past tense) over “go.”
260
+
That’s Attention: at each step, re-weight which parts of the input matter most.
261
+
Self-Attention
262
+
Seq2seq Attention asks: Which parts of the source should I attend to while generating the target?
263
+
Self-Attention asks: Within a single sentence, which words should each word attend to?
264
+
Example:
265
+
“The animal didn’t cross the street because it was too tired.”
266
+
Here, “it” should link strongly to “animal”, but also relates to “tired.”
267
+
Why is this powerful for LMs?
268
+
To predict “bloomed” in “The flowers by the roadside … bloomed,” “flowers” should get the highest weight.
269
+
To pick tense, “yesterday” matters more than “school.”
270
+
Self-Attention lets the model discover this automatically.
271
+
Multi-Head Self-Attention
272
+
Language has multiple relationship types:
273
+
Grammatical (subject ↔ verb)
274
+
Semantic (animal ↔ it)
275
+
Attributes (it ↔ tired)
276
+
One attention map can’t capture every view. The fix: run multiple attention heads in parallel, each with a different “view.”
277
+
Under the hood, word embeddings are split into subspaces (chunks of numbers). Each head attends within a different subspace, encouraging different aspects (grammar, meaning, style) to emerge.
278
+
Instead of one spotlight, give the model a dozen flashlights, each shining on a different relationship.
279
+
That’s the magic of Multi-Head Self-Attention — one of key reasons Transformers dethroned RNNs.
280
+
175B? 540B? What Do Parameter Counts Actually Mean?
281
+
You’ll often hear sizes like 175B (GPT-3) or 540B (PaLM). These are the number of parameters — the weights in the Transformer.
282
+
More parameters → more capacity. Hence the popular (but flawed) shortcut:
283
+
Bigger model → better performance.
284
+
In reality, performance depends on more than size:
285
+
How much data was used?
286
+
How high-quality was that data?
287
+
Were the hyperparameters tuned well?
288
+
How long (and how thoroughly) was the model trained?
289
+
So why do parameter counts dominate? They’re easy to understand.
290
+
If someone asks, “Which model is better, A or B?” you could unpack data quality, training steps, and optimizers… or say:
291
+
“Model A is 70B. Model B is 200B. Model B is better.”
292
+
It’s not necessarily true — but it’s simple.
293
+
⚠️ Pro tip: If someone talks about model quality only in terms of parameter count, be cautious. They either don’t fully understand, or they’re trying to sell you something.
294
+
Transformer in a Nutshell
295
+
Transformers were designed for Sequence-to-Sequence tasks.
296
+
The most common form is the Encoder–Decoder structure.
To capture different perspectives (grammar, semantics, style), Transformers use Multi-Head Self-Attention.
301
+
Compute Power
302
+
The last ingredient: compute.
303
+
LLMs wouldn’t exist without massive progress in hardware and infrastructure:
304
+
GPUs (and TPUs) unlocked massively parallel training. GPUs were the rocket fuel of the deep learning boom, and today Nvidia still dominates with CUDA, optimized libraries, and cutting-edge hardware.
305
+
Parallel training techniques allow hundreds (or thousands) of GPUs to train a single model in sync.
306
+
Cloud infrastructure made it practical. Buying racks of GPUs is brutally expensive — and they start depreciating the moment you unbox them. Renting from AWS, Azure, or GCP lets teams scale without opening a hardware graveyard in the office.
307
+
In short: faster chips + smarter software + elastic cloud = the horsepower that makes LLMs possible.
308
+
Why LLMs Happened Now
309
+
We’ve walked through the three big ingredients:
310
+
Data: Web-scale text + self-supervised learning → oceans of training material.
Each piece alone would’ve been impressive. Put together, they sparked a step-change.
314
+
A decade ago, we had:
315
+
Limited datasets (a few gigabytes at most).
316
+
Algorithms (RNNs, LSTMs) that struggled with long sequences.
317
+
GPUs that couldn’t realistically handle 100B+ parameter models.
318
+
Today, we have:
319
+
Tens of terabytes of training data at our fingertips.
320
+
Transformer architectures that scale beautifully.
321
+
GPU/TPU clusters that can train trillion-parameter models.
322
+
No single breakthrough “invented” LLMs. It was the intersection of trends — data, algorithms, compute — that finally clicked into place.
323
+
That’s why LLMs feel like they appeared “all of a sudden.” The truth is, researchers were laying the groundwork for years. The moment the three factors aligned, the field exploded.
324
+
And that’s where we are now: riding the wave of models that are bigger, smarter, and more capable than anyone thought possible five years ago.
325
+
In the next post, I’ll dive into zero-shot, few-shot, prompting, and the rest of the story.
326
+
LLM (3 Part Series)
223
327
1
224
-
(1/4) LLM: How LLMs Became the Bedrock of Modern AI
328
+
(1/3) LLM: How LLMs Became the Bedrock of Modern AI
225
329
2
226
-
(2/4) LLM: Data, Transformers, and Relentless Compute
330
+
(2/3) LLM: Data, Transformers, and Relentless Compute
227
331
3
228
-
(3/4) LLM: Inside the Transformer
229
-
4
230
-
(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
332
+
(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
231
333
Top comments (0)
232
334
Subscribe
233
335
Personal
@@ -250,15 +352,10 @@ My name is Jimin.
250
352
Joined
251
353
Sep 13, 2025
252
354
More from Jimin Lee
253
-
(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
355
+
(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
254
356
#llm
255
357
#machinelearning
256
-
(3/4) LLM: Inside the Transformer
257
-
#deeplearning
258
-
#architecture
259
-
#ai
260
-
#llm
261
-
(1/4) LLM: How LLMs Became the Bedrock of Modern AI
358
+
(1/3) LLM: How LLMs Became the Bedrock of Modern AI
0 commit comments