Skip to content

Commit 38fa125

Browse files
chore: Update crawled data
1 parent b22534f commit 38fa125

File tree

66 files changed

+1091
-725
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+1091
-725
lines changed

crawled_output/.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ Latest in Transportation
9696
In Brief
9797
Tesla board chair calls debate over Elon Musk’s $1T pay package ‘a little bit weird’
9898
Anthony Ha
99-
2 hours ago
99+
3 hours ago
100100
Transportation
101101
Ram ends EV pickup truck plans
102102
Kirsten Korosec

crawled_output/24-llm-data-transformers-and-relentless-compute-c4a#comments.txt

Lines changed: 119 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
(2/4) LLM: Data, Transformers, and Relentless Compute - DEV Community
1+
(2/3) LLM: Data, Transformers, and Relentless Compute - DEV Community
22
Forem Feed
33
Follow new Subforems to improve your feed
44
DEV Community
@@ -72,20 +72,18 @@ Share to Mastodon
7272
Report Abuse
7373
Jimin Lee
7474
Posted on Sep 13
75-
(2/4) LLM: Data, Transformers, and Relentless Compute
75+
(2/3) LLM: Data, Transformers, and Relentless Compute
7676
#data
7777
#ai
7878
#machinelearning
7979
#llm
80-
LLM (4 Part Series)
80+
LLM (3 Part Series)
8181
1
82-
(1/4) LLM: How LLMs Became the Bedrock of Modern AI
82+
(1/3) LLM: How LLMs Became the Bedrock of Modern AI
8383
2
84-
(2/4) LLM: Data, Transformers, and Relentless Compute
84+
(2/3) LLM: Data, Transformers, and Relentless Compute
8585
3
86-
(3/4) LLM: Inside the Transformer
87-
4
88-
(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
86+
(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
8987
This post was written in April 2023, so some parts may now be a bit outdated. However, most of the key ideas about LLMs remain just as relevant today.
9088
Large Language Models
9189
So, what happens when a regular Language Model gets bigger? You get a Large Language Model (LLM).
@@ -218,16 +216,120 @@ Perfect alignment:
218216
The job of an LM is “predict the next word.”
219217
The mechanism of the Transformer decoder is auto-regressive next-word prediction.
220218
If we want a plain LM that predicts the next word in the same language, we make a small but important tweak…
221-
In the next post, I’ll cover the background that made LLMs possible, including a closer look at Transformers that I couldn’t fully explore this time.
222-
LLM (4 Part Series)
219+
Encoder-Only, Decoder-Only
220+
The full Encoder–Decoder Transformer is powerful, but not everyone needs both halves. Researchers asked: What if we only used the encoder? What if we only used the decoder?
221+
Encoder-Only
222+
The most famous encoder-only model? BERT.
223+
BERT keeps just the encoder stack. Sometimes all you need is a good representation of text (context vectors), not generation.
224+
Great for classification tasks:
225+
Is this review positive or negative?
226+
Does this sentence contain a person’s name?
227+
Classification works on embeddings. Better embeddings → better classifiers. BERT looks at text bidirectionally, encodes whole sentences, and produces rich representations. Plug them into a classifier and accuracy jumps.
228+
Is BERT a language model? Strictly, no — it doesn’t do auto-regressive next-word prediction. It’s trained as a masked language model (predict the missing word), which is different from traditional LMs.
229+
Decoder-Only
230+
On the other side: GPT.
231+
GPT (GPT-2/3, ChatGPT, GPT-4…) keeps only the decoder stack.
232+
Why drop the encoder? If your goal is just next-word prediction — the pure LM task — you can feed the decoder with the text so far and let it continue auto-regressively.
233+
Input: “The flowers by the roadside are blooming”
234+
Decoder predicts: “beautifully.”
235+
That prediction feeds back in, and generation continues.
236+
This is why GPT and its cousins (LaMDA, PaLM, LLaMA, Claude, etc.) follow the decoder-only recipe. It’s the simplest and most direct way to scale LMs into generative engines.
237+
Encoder + Decoder
238+
Models like T5 and BART keep the full structure and shine at clear input → output transformations (translation, summarization, etc.).
239+
Encoder vs. Decoder
240+
Historically, encoder-only exploded first (BERT) because many NLP tasks were classification-heavy. Decoder-only models initially looked like “nonsense generators.”
241+
Key difference:
242+
Encoder-only models can’t generate text.
243+
Decoder-only models can — and with scale, their potential is enormous. Even classification can be reframed as generation (“The review is … [positive/negative]”).
244+
That’s why decoder-only LMs became the dominant LLMs.
245+
A Long Tradition
246+
Transformers didn’t invent encoder–decoder. Before 2017, RNNs/LSTMs/GRUs were the standard way to build it. Transformers replaced RNNs.
247+
Biggest reason people cite: Self-Attention.
248+
Why Do Transformers Work So Well? Self-Attention
249+
Two concepts are central:
250+
The Encoder–Decoder structure
251+
Self-Attention
252+
Let’s start with Attention itself.
253+
Attention
254+
Attention first showed up in RNN-based seq2seq models. Recall the pipeline:
255+
Input → Encoder → Context → Decoder → Output
256+
The decoder generates tokens one by one. Early models used a fixed Context for every step, but different output words need to “look back” at different parts of the input.
257+
Example:
258+
“나는 어제 학교에 갔습니다.” → “I went to school yesterday.”
259+
If the model could focus on 갔습니다 (went) and 어제 (yesterday) at the right time, it would more reliably pick “went” (past tense) over “go.”
260+
That’s Attention: at each step, re-weight which parts of the input matter most.
261+
Self-Attention
262+
Seq2seq Attention asks: Which parts of the source should I attend to while generating the target?
263+
Self-Attention asks: Within a single sentence, which words should each word attend to?
264+
Example:
265+
“The animal didn’t cross the street because it was too tired.”
266+
Here, “it” should link strongly to “animal”, but also relates to “tired.”
267+
Why is this powerful for LMs?
268+
To predict “bloomed” in “The flowers by the roadside … bloomed,” “flowers” should get the highest weight.
269+
To pick tense, “yesterday” matters more than “school.”
270+
Self-Attention lets the model discover this automatically.
271+
Multi-Head Self-Attention
272+
Language has multiple relationship types:
273+
Grammatical (subject ↔ verb)
274+
Semantic (animal ↔ it)
275+
Attributes (it ↔ tired)
276+
One attention map can’t capture every view. The fix: run multiple attention heads in parallel, each with a different “view.”
277+
Under the hood, word embeddings are split into subspaces (chunks of numbers). Each head attends within a different subspace, encouraging different aspects (grammar, meaning, style) to emerge.
278+
Instead of one spotlight, give the model a dozen flashlights, each shining on a different relationship.
279+
That’s the magic of Multi-Head Self-Attention — one of key reasons Transformers dethroned RNNs.
280+
175B? 540B? What Do Parameter Counts Actually Mean?
281+
You’ll often hear sizes like 175B (GPT-3) or 540B (PaLM). These are the number of parameters — the weights in the Transformer.
282+
More parameters → more capacity. Hence the popular (but flawed) shortcut:
283+
Bigger model → better performance.
284+
In reality, performance depends on more than size:
285+
How much data was used?
286+
How high-quality was that data?
287+
Were the hyperparameters tuned well?
288+
How long (and how thoroughly) was the model trained?
289+
So why do parameter counts dominate? They’re easy to understand.
290+
If someone asks, “Which model is better, A or B?” you could unpack data quality, training steps, and optimizers… or say:
291+
“Model A is 70B. Model B is 200B. Model B is better.”
292+
It’s not necessarily true — but it’s simple.
293+
⚠️ Pro tip: If someone talks about model quality only in terms of parameter count, be cautious. They either don’t fully understand, or they’re trying to sell you something.
294+
Transformer in a Nutshell
295+
Transformers were designed for Sequence-to-Sequence tasks.
296+
The most common form is the Encoder–Decoder structure.
297+
Variants exist: Encoder-only (BERT), Decoder-only (GPT), Encoder+Decoder (T5, BART).
298+
To generate language, you need a Decoder.
299+
A core innovation is Self-Attention.
300+
To capture different perspectives (grammar, semantics, style), Transformers use Multi-Head Self-Attention.
301+
Compute Power
302+
The last ingredient: compute.
303+
LLMs wouldn’t exist without massive progress in hardware and infrastructure:
304+
GPUs (and TPUs) unlocked massively parallel training. GPUs were the rocket fuel of the deep learning boom, and today Nvidia still dominates with CUDA, optimized libraries, and cutting-edge hardware.
305+
Parallel training techniques allow hundreds (or thousands) of GPUs to train a single model in sync.
306+
Cloud infrastructure made it practical. Buying racks of GPUs is brutally expensive — and they start depreciating the moment you unbox them. Renting from AWS, Azure, or GCP lets teams scale without opening a hardware graveyard in the office.
307+
In short: faster chips + smarter software + elastic cloud = the horsepower that makes LLMs possible.
308+
Why LLMs Happened Now
309+
We’ve walked through the three big ingredients:
310+
Data: Web-scale text + self-supervised learning → oceans of training material.
311+
Algorithms: Transformers (self-attention, scalable stacks) replaced RNNs.
312+
Compute: GPUs/TPUs + cloud infrastructure → enough horsepower to train monster models.
313+
Each piece alone would’ve been impressive. Put together, they sparked a step-change.
314+
A decade ago, we had:
315+
Limited datasets (a few gigabytes at most).
316+
Algorithms (RNNs, LSTMs) that struggled with long sequences.
317+
GPUs that couldn’t realistically handle 100B+ parameter models.
318+
Today, we have:
319+
Tens of terabytes of training data at our fingertips.
320+
Transformer architectures that scale beautifully.
321+
GPU/TPU clusters that can train trillion-parameter models.
322+
No single breakthrough “invented” LLMs. It was the intersection of trends — data, algorithms, compute — that finally clicked into place.
323+
That’s why LLMs feel like they appeared “all of a sudden.” The truth is, researchers were laying the groundwork for years. The moment the three factors aligned, the field exploded.
324+
And that’s where we are now: riding the wave of models that are bigger, smarter, and more capable than anyone thought possible five years ago.
325+
In the next post, I’ll dive into zero-shot, few-shot, prompting, and the rest of the story.
326+
LLM (3 Part Series)
223327
1
224-
(1/4) LLM: How LLMs Became the Bedrock of Modern AI
328+
(1/3) LLM: How LLMs Became the Bedrock of Modern AI
225329
2
226-
(2/4) LLM: Data, Transformers, and Relentless Compute
330+
(2/3) LLM: Data, Transformers, and Relentless Compute
227331
3
228-
(3/4) LLM: Inside the Transformer
229-
4
230-
(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
332+
(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
231333
Top comments (0)
232334
Subscribe
233335
Personal
@@ -250,15 +352,10 @@ My name is Jimin.
250352
Joined
251353
Sep 13, 2025
252354
More from Jimin Lee
253-
(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
355+
(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
254356
#llm
255357
#machinelearning
256-
(3/4) LLM: Inside the Transformer
257-
#deeplearning
258-
#architecture
259-
#ai
260-
#llm
261-
(1/4) LLM: How LLMs Became the Bedrock of Modern AI
358+
(1/3) LLM: How LLMs Became the Bedrock of Modern AI
262359
#llm
263360
#machinelearning
264361
💎 DEV Diamond Sponsors

0 commit comments

Comments
 (0)