“If I can’t grep the source, how can I trust the program?”
— Every veteran coder, at least once
You’ve lived through assembler, C, PERL one-liners, Java app-servers, and maybe even the JavaScript awakening. Now you open ChatGPT, ask a question, and a wall of well-formed prose appears. Where are the ifs, fors, and seg-faults you can chase with a debugger? This two-part essay is a guided tour from the deterministic world you know to the probabilistic engine under an LLM’s hood.
Part 1 – The Trees: What an LLM Is
1. Tokens & Tokenization
Token = the smallest unit the model “sees.”
Old-school analogy: bytes in a compiled binary.
In our toy English tokenizer, each word is a token.
Real tokenizers split on sub-words, so “incredible” becomes:
“In the beginning” → [“In”, “the”, “begin”, “##ning”] → [ 42 , 17 , 991 , 2049 ]
2. Integer IDs → Embedding Vectors
The integer 42 itself is meaningless. The first layer of a transformer is an embedding table – a giant array that maps each token ID to a dense vector.
E[id] → [ 0.12, -0.07, …, 3.14 ] # a learned point in space
During training, vector values shift so that:
- Tokens used in similar contexts drift closer together (synonyms).
- Opposing words may point in opposite directions (antonyms).
- Rare/domain-specific tokens carve their own neighborhoods.
Think of the embedding space as a 3D starfield – just with thousands of dimensions. Training nudges the stars until textual gravity matches linguistic reality.
3. Positional Embeddings
Unlike source code with line numbers, raw token IDs carry no order info. So the model adds a second vector for “I’m token #7 in the sentence.” Combine both vectors and the model now knows what the word is and where it sits.
4. Self-Attention: Dynamic Pointer Arithmetic
Classic code uses fixed pointers: arr[i].
Self-attention invents pointers on the fly. For each token the model asks:
“Given my current meaning, which other tokens in this sentence should I look at, and how much?”
It builds Query, Key, and Value vectors and computes weighted sums. The weights (attention scores) are recalculated for every sentence.
5. Stack ’em High
A modern GPT stacks dozens of identical blocks:
[Attention → Feed-Forward → LayerNorm → Residual] × N
Gradients flow backward through this stack, adjusting weights so the predicted token becomes more accurate over time.
6. What the Model Learns
The model internalizes:
- Synonymy – tokens that can swap with minimal disruption.
- Antonymy – tokens that shift the sentence’s meaning or direction.
- Semantic similarity – clustering of topics or contexts.
- Multi-token logic – idioms, patterns, syntactic dependencies.
- Probability – all outcomes are ranked, not chosen absolutely.
In short, the embedding space encodes language “intuition” and the attention layers provide flexible, context-sensitive logic.
Part 2 – The Forest: Bridging the Gap for Our Veteran Coder
Meet Bill, who could optimize a linked-list cache while you were still mastering printf. Bill’s puzzle:
“I know programming. Why can’t I map any of it to this ‘AI’ thing?”
1. Paradigm Shift Table
Classical Programming Large Language Model Explicit rules (if, while) Implicit patterns in weights Deterministic runs Stochastic sampling Debugger = inspect stack Debugger = inspect gradients or saliency maps Compile-link-run Train → sample State in variables & structs State in vectors & tensors Control flow graph Attention graph (rebuilt every prompt)
2. Relatable Analogies
- Embeddings = Symbol Tables – both map names to internal representations.
- Gradient Descent = Auto-Refactoring – the model rewrites itself thousands of times to improve a single prediction.
- Attention = Runtime Pointer Patch – like inserting a new jump instruction in hot loops based on context.
3. Why Bill Feels Lost
- Opacity – No source to grep, no functions to trace.
- Non-determinism – Same input, different outputs.
- Data > Code – Performance comes from corpus size + GPU hours, not algorithm elegance.
- Scale – One forward pass = the effort of many traditional systems combined.
4. Grounding Tips for Old-School Minds
- Try running a 2-layer transformer on a toy dataset in Colab.
- Visualize embedding space with PCA to see how similar words cluster.
- Plot attention heads – see verbs attending to subjects, adjectives to nouns.
- Dust off your linear algebra and probability: they’re your new debugging tools.
5. A Closing Thought for Bill
You haven’t fallen behind — the terrain has just changed. In the ’70s, we taught computers to follow rules. In the 2020s, we teach them to find their own. But deep down, it’s still math, still logic, still the thrill of taming complexity with structure.
“In 1970 we taught computers to follow rules.
In 2020 we taught them to invent the rules.
The syntax changed, the spirit of hacking remains.”