4uffin
diff --git a/‎crawled_output/.txt‎
Lines changed: 1 addition & 1 deletion b/‎crawled_output/.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crawled_output/24-llm-data-transformers-and-relentless-compute-c4a#comments.txt‎
Lines changed: 119 additions & 22 deletions b/‎crawled_output/24-llm-data-transformers-and-relentless-compute-c4a#comments.txt‎
Lines changed: 119 additions & 22 deletions
@@ -96,7 +96,7 @@ Latest in Transportation
 In Brief
 Tesla board chair calls debate over Elon Musk’s $1T pay package ‘a little bit weird’
 Anthony Ha
-2 hours ago
+3 hours ago
 Transportation
 Ram ends EV pickup truck plans
 Kirsten Korosec
 
@@ -1,4 +1,4 @@
-(2/4) LLM: Data, Transformers, and Relentless Compute - DEV Community
+(2/3) LLM: Data, Transformers, and Relentless Compute - DEV Community
 Forem Feed
 Follow new Subforems to improve your feed
 DEV Community
@@ -72,20 +72,18 @@ Share to Mastodon
 Report Abuse
 Jimin Lee
 Posted on Sep 13
-(2/4) LLM: Data, Transformers, and Relentless Compute
+(2/3) LLM: Data, Transformers, and Relentless Compute
 #data
 #ai
 #machinelearning
 #llm
-LLM (4 Part Series)
+LLM (3 Part Series)
 1
-(1/4) LLM: How LLMs Became the Bedrock of Modern AI
+(1/3) LLM: How LLMs Became the Bedrock of Modern AI
 2
-(2/4) LLM: Data, Transformers, and Relentless Compute
+(2/3) LLM: Data, Transformers, and Relentless Compute
 3
-(3/4) LLM: Inside the Transformer
-4
-(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
+(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
 This post was written in April 2023, so some parts may now be a bit outdated. However, most of the key ideas about LLMs remain just as relevant today.
 Large Language Models
 So, what happens when a regular Language Model gets bigger? You get a Large Language Model (LLM).
@@ -218,16 +216,120 @@ Perfect alignment:
 The job of an LM is “predict the next word.”
 The mechanism of the Transformer decoder is auto-regressive next-word prediction.
 If we want a plain LM that predicts the next word in the same language, we make a small but important tweak…
-In the next post, I’ll cover the background that made LLMs possible, including a closer look at Transformers that I couldn’t fully explore this time.
-LLM (4 Part Series)
+Encoder-Only, Decoder-Only
+The full Encoder–Decoder Transformer is powerful, but not everyone needs both halves. Researchers asked: What if we only used the encoder? What if we only used the decoder?
+Encoder-Only
+The most famous encoder-only model? BERT.
+BERT keeps just the encoder stack. Sometimes all you need is a good representation of text (context vectors), not generation.
+Great for classification tasks:
+Is this review positive or negative?
+Does this sentence contain a person’s name?
+Classification works on embeddings. Better embeddings → better classifiers. BERT looks at text bidirectionally, encodes whole sentences, and produces rich representations. Plug them into a classifier and accuracy jumps.
+Is BERT a language model? Strictly, no — it doesn’t do auto-regressive next-word prediction. It’s trained as a masked language model (predict the missing word), which is different from traditional LMs.
+Decoder-Only
+On the other side: GPT.
+GPT (GPT-2/3, ChatGPT, GPT-4…) keeps only the decoder stack.
+Why drop the encoder? If your goal is just next-word prediction — the pure LM task — you can feed the decoder with the text so far and let it continue auto-regressively.
+Input: “The flowers by the roadside are blooming”
+Decoder predicts: “beautifully.”
+That prediction feeds back in, and generation continues.
+This is why GPT and its cousins (LaMDA, PaLM, LLaMA, Claude, etc.) follow the decoder-only recipe. It’s the simplest and most direct way to scale LMs into generative engines.
+Encoder + Decoder
+Models like T5 and BART keep the full structure and shine at clear input → output transformations (translation, summarization, etc.).
+Encoder vs. Decoder
+Historically, encoder-only exploded first (BERT) because many NLP tasks were classification-heavy. Decoder-only models initially looked like “nonsense generators.”
+Key difference:
+Encoder-only models can’t generate text.
+Decoder-only models can — and with scale, their potential is enormous. Even classification can be reframed as generation (“The review is … [positive/negative]”).
+That’s why decoder-only LMs became the dominant LLMs.
+A Long Tradition
+Transformers didn’t invent encoder–decoder. Before 2017, RNNs/LSTMs/GRUs were the standard way to build it. Transformers replaced RNNs.
+Biggest reason people cite: Self-Attention.
+Why Do Transformers Work So Well? Self-Attention
+Two concepts are central:
+The Encoder–Decoder structure
+Self-Attention
+Let’s start with Attention itself.
+Attention
+Attention first showed up in RNN-based seq2seq models. Recall the pipeline:
+Input → Encoder → Context → Decoder → Output
+The decoder generates tokens one by one. Early models used a fixed Context for every step, but different output words need to “look back” at different parts of the input.
+Example:
+“나는 어제 학교에 갔습니다.” → “I went to school yesterday.”
+If the model could focus on 갔습니다 (went) and 어제 (yesterday) at the right time, it would more reliably pick “went” (past tense) over “go.”
+That’s Attention: at each step, re-weight which parts of the input matter most.
+Self-Attention
+Seq2seq Attention asks: Which parts of the source should I attend to while generating the target?
+Self-Attention asks: Within a single sentence, which words should each word attend to?
+Example:
+“The animal didn’t cross the street because it was too tired.”
+Here, “it” should link strongly to “animal”, but also relates to “tired.”
+Why is this powerful for LMs?
+To predict “bloomed” in “The flowers by the roadside … bloomed,” “flowers” should get the highest weight.
+To pick tense, “yesterday” matters more than “school.”
+Self-Attention lets the model discover this automatically.
+Multi-Head Self-Attention
+Language has multiple relationship types:
+Grammatical (subject ↔ verb)
+Semantic (animal ↔ it)
+Attributes (it ↔ tired)
+One attention map can’t capture every view. The fix: run multiple attention heads in parallel, each with a different “view.”
+Under the hood, word embeddings are split into subspaces (chunks of numbers). Each head attends within a different subspace, encouraging different aspects (grammar, meaning, style) to emerge.
+Instead of one spotlight, give the model a dozen flashlights, each shining on a different relationship.
+That’s the magic of Multi-Head Self-Attention — one of key reasons Transformers dethroned RNNs.
+175B? 540B? What Do Parameter Counts Actually Mean?
+You’ll often hear sizes like 175B (GPT-3) or 540B (PaLM). These are the number of parameters — the weights in the Transformer.
+More parameters → more capacity. Hence the popular (but flawed) shortcut:
+Bigger model → better performance.
+In reality, performance depends on more than size:
+How much data was used?
+How high-quality was that data?
+Were the hyperparameters tuned well?
+How long (and how thoroughly) was the model trained?
+So why do parameter counts dominate? They’re easy to understand.
+If someone asks, “Which model is better, A or B?” you could unpack data quality, training steps, and optimizers… or say:
+“Model A is 70B. Model B is 200B. Model B is better.”
+It’s not necessarily true — but it’s simple.
+⚠️ Pro tip: If someone talks about model quality only in terms of parameter count, be cautious. They either don’t fully understand, or they’re trying to sell you something.
+Transformer in a Nutshell
+Transformers were designed for Sequence-to-Sequence tasks.
+The most common form is the Encoder–Decoder structure.
+Variants exist: Encoder-only (BERT), Decoder-only (GPT), Encoder+Decoder (T5, BART).
+To generate language, you need a Decoder.
+A core innovation is Self-Attention.
+To capture different perspectives (grammar, semantics, style), Transformers use Multi-Head Self-Attention.
+Compute Power
+The last ingredient: compute.
+LLMs wouldn’t exist without massive progress in hardware and infrastructure:
+GPUs (and TPUs) unlocked massively parallel training. GPUs were the rocket fuel of the deep learning boom, and today Nvidia still dominates with CUDA, optimized libraries, and cutting-edge hardware.
+Parallel training techniques allow hundreds (or thousands) of GPUs to train a single model in sync.
+Cloud infrastructure made it practical. Buying racks of GPUs is brutally expensive — and they start depreciating the moment you unbox them. Renting from AWS, Azure, or GCP lets teams scale without opening a hardware graveyard in the office.
+In short: faster chips + smarter software + elastic cloud = the horsepower that makes LLMs possible.
+Why LLMs Happened Now
+We’ve walked through the three big ingredients:
+Data: Web-scale text + self-supervised learning → oceans of training material.
+Algorithms: Transformers (self-attention, scalable stacks) replaced RNNs.
+Compute: GPUs/TPUs + cloud infrastructure → enough horsepower to train monster models.
+Each piece alone would’ve been impressive. Put together, they sparked a step-change.
+A decade ago, we had:
+Limited datasets (a few gigabytes at most).
+Algorithms (RNNs, LSTMs) that struggled with long sequences.
+GPUs that couldn’t realistically handle 100B+ parameter models.
+Today, we have:
+Tens of terabytes of training data at our fingertips.
+Transformer architectures that scale beautifully.
+GPU/TPU clusters that can train trillion-parameter models.
+No single breakthrough “invented” LLMs. It was the intersection of trends — data, algorithms, compute — that finally clicked into place.
+That’s why LLMs feel like they appeared “all of a sudden.” The truth is, researchers were laying the groundwork for years. The moment the three factors aligned, the field exploded.
+And that’s where we are now: riding the wave of models that are bigger, smarter, and more capable than anyone thought possible five years ago.
+In the next post, I’ll dive into zero-shot, few-shot, prompting, and the rest of the story.
+LLM (3 Part Series)
 1
-(1/4) LLM: How LLMs Became the Bedrock of Modern AI
+(1/3) LLM: How LLMs Became the Bedrock of Modern AI
 2
-(2/4) LLM: Data, Transformers, and Relentless Compute
+(2/3) LLM: Data, Transformers, and Relentless Compute
 3
-(3/4) LLM: Inside the Transformer
-4
-(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
+(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
 Top comments (0)
 Subscribe
 Personal
@@ -250,15 +352,10 @@ My name is Jimin.
 Joined
 Sep 13, 2025
 More from Jimin Lee
-(4/4) LLM: In-Context Learning, Hype, and the Road Ahead
+(3/3) LLM: In-Context Learning, Hype, and the Road Ahead
 #llm
 #machinelearning
-(3/4) LLM: Inside the Transformer
-#deeplearning
-#architecture
-#ai
-#llm
-(1/4) LLM: How LLMs Became the Bedrock of Modern AI
+(1/3) LLM: How LLMs Became the Bedrock of Modern AI
 #llm
 #machinelearning
 💎 DEV Diamond Sponsors