Table of Contents
Neural networks can memorize perfectly yet understand nothing, until, after thousands of additional training steps, generalization emerges abruptly and completely.
This phenomenon, called grokking, fundamentally challenges how we think about machine learning. First documented in a 2022 OpenAI paper, grokking reveals that the relationship between memorization and generalization is far stranger than traditional machine learning theory suggests.
A network trained on modular division can achieve 100% training accuracy in under 1,000 optimization steps while maintaining random-chance test performance—then suddenly achieve perfect generalization only after 1,000,000 steps.
The discovery has spawned hundreds of follow-up papers and offers a window into universal principles governing how complex systems find coherent solutions.
Grokking may represent an instance of a profound pattern: given sufficient time and the right conditions, systems naturally evolve toward states where inconsistent, spurious patterns cancel out while true, coherent patterns survive and strengthen.
The discovery that memorization isn't the end of the story
In January 2022, Alethea Power and colleagues at OpenAI published "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets," documenting a phenomenon that contradicted conventional wisdom about neural network training. Working with small transformers trained on algorithmic tasks—modular arithmetic operations like addition, subtraction, and division on a prime modulus—the researchers observed something remarkable. When training on division mod 97 with only 50% of the data as training examples, networks reached 99.9% training accuracy in fewer than 1,000 steps but maintained validation accuracy at random chance (around 1%) for another 100,000 steps before suddenly achieving perfect generalization.
The experimental setup was deliberately simple: a two-layer decoder-only transformer with roughly 400,000 non-embedding parameters, trained with AdamW optimizer and notably aggressive weight decay of 1.0. The input format was straightforward—equations like "⟨a⟩⟨÷⟩⟨b⟩⟨=⟩⟨c⟩" where each element was a separate token. What made this surprising was not just the delay but its scale: generalization occurred 1,000× later than memorization, with smaller training sets requiring exponentially more optimization steps.
The word "grokking" comes from Robert Heinlein's 1961 novel Stranger in a Strange Land, where it means to understand something so thoroughly and intuitively that one becomes one with it. The term proved apt: these networks weren't gradually improving their understanding—they were memorizing blindly, then suddenly achieving complete comprehension.
The discovery struck at core assumptions in machine learning. Traditional understanding held that once a network overfits, continued training only worsens generalization. Early stopping—halting training when validation performance plateaus—was standard practice. Grokking revealed that for certain problems, the exact opposite is true: patience beyond apparent convergence is the key to genuine understanding.
Mechanistic interpretability reveals the Fourier algorithm hidden within
The most detailed understanding of grokking comes from mechanistic interpretability research, particularly Neel Nanda's groundbreaking 2023 work that fully reverse-engineered the algorithm a transformer learns when grokking modular addition. This research revealed that networks don't discover arbitrary solutions—they find elegant mathematical algorithms that exploit deep structure in the problem.
For modular addition (computing a + b mod p), the grokked network implements a Fourier multiplication algorithm using discrete Fourier transforms and trigonometric identities. The embedding layer maps each input number i to coordinates (cos(2πik/p), sin(2πik/p)) for several key frequencies k—essentially representing each number as a position on a circle. The attention and MLP layers then combine these using the trigonometric identity cos(α + β) = cos(α)cos(β) − sin(α)sin(β), which converts addition into rotation composition. The output layer reads off the result from the final rotation angle.
This circuit emerges through three distinct phases. During memorization (epochs 0-1,400), the network quickly learns to map training inputs to outputs through brute-force lookup. During circuit formation (epochs 1,400-9,400), the generalizing Fourier circuit gradually develops in the background—invisibly, since test accuracy remains flat. During cleanup (epochs 9,400-14,000), weight decay aggressively prunes the memorization components, and test accuracy suddenly jumps to near-perfect.
The critical insight is that grokking only appears sudden externally. Inside the network, the generalizing circuit forms gradually and continuously. The apparent phase transition reflects the moment when the efficient generalizing circuit finally dominates over the inefficient memorizing circuit—like a crystal nucleating from a supersaturated solution.
Circuit competition explains why efficient solutions eventually win
The most compelling theoretical framework, developed by DeepMind researchers Vikrant Varma and colleagues, explains grokking through competition between memorizing and generalizing circuits. Both circuit types coexist within the network during training, competing to control the network's predictions.
The memorizing circuit is dense—it uses many parameters to store individual training examples through lookup-table-style computation. It learns quickly because memorizing individual examples requires no structural discovery. However, its parameter cost scales with the number of training examples. The generalizing circuit is sparse—it implements the true underlying algorithm (like the Fourier multiplication circuit) with fixed parameter overhead regardless of dataset size. It learns slowly because discovering this structure is difficult, but once found, it is dramatically more efficient.
Weight decay acts as the arbiter of this competition. By penalizing large weights, regularization creates pressure favoring solutions that achieve the same output with fewer parameters. Early in training, the memorizing circuit dominates because it learns faster. But as training continues, weight decay gradually shifts the balance. The memorizing circuit becomes increasingly expensive to maintain while the generalizing circuit, despite its slow development, remains cheap. Eventually a tipping point arrives: the generalizing circuit becomes dominant, and weight decay rapidly prunes the now-unnecessary memorizing components.
This framework makes testable predictions that have been confirmed experimentally. Ungrokking occurs when a grokked network is subsequently trained on a smaller dataset—reducing data below the critical threshold makes memorization more efficient again, and the network actually reverses from generalization back to memorization. Semi-grokking occurs at the critical dataset size boundary, where both circuits are equally efficient, producing delayed but only partial generalization.
The physics of grokking mirrors phase transitions in matter
Physics provides powerful frameworks for understanding grokking, and the analogy runs deeper than metaphor. Research from Rubin, Seroussi, and Ringel at Hebrew University established that grokking is mathematically analogous to a first-order phase transition in mean field theory—the same framework that describes water freezing into ice or ferromagnets spontaneously magnetizing below the Curie temperature.
In this framework, the distribution of network weights can be described in terms of an order parameter Φ measuring the overlap between learned and teacher weights. The key quantity is an effective potential S̃(Φ) analogous to free energy. Phase transitions occur when this potential develops multiple minima. Initially, only the Φ = 0 minimum exists (the memorizing solution). As training progresses, a new minimum at |Φ| > 0 appears and deepens (the generalizing solution). The grokking transition occurs when the system "nucleates" into this new minimum—like water droplets forming in supersaturated vapor.
An alternative physics interpretation, proposed by Zhang and colleagues in 2025, frames grokking as computational glass relaxation rather than a barrier-limited phase transition. In this view, rapid early training is like quenching a liquid into a glassy state—the network gets trapped in a non-equilibrium memorization configuration. Grokking then represents slow entropic relaxation toward a higher-entropy equilibrium state. This explains why more careful training procedures (like Wang-Landau molecular dynamics optimizers) can eliminate grokking entirely by avoiding glass formation.
Both interpretations converge on a crucial insight: the generalizing solution occupies a higher-entropy region of the parameter space. While memorization requires precise, specific weight configurations (low entropy, like an ordered crystal), generalization can be achieved by many equivalent configurations (high entropy, like a gas). Thermodynamic reasoning suggests systems naturally drift toward higher-entropy states over long timescales—grokking is the neural network analog of this universal principle.
Why weight decay is thermodynamically necessary for grokking
The role of regularization in grokking becomes clear through thermodynamic analogy. Weight decay γ acts as inverse temperature: larger weight decay corresponds to lower effective temperature, creating stronger bias toward low-energy (simple) solutions. The LU mechanism, identified by Liu, Michaud, and Tegmark, provides the quantitative details.
When weight norm is plotted against training loss, the landscape resembles an "L"—training loss drops quickly to zero regardless of weight magnitude. When weight norm is plotted against test loss, the landscape resembles a "U"—test loss is high at both very low and very high weight norms, with optimal generalization in a "Goldilocks zone" at intermediate values. Without regularization, a network that converges at high weight norm has no gradient pressure to move—it's stuck on a flat manifold. Weight decay provides the necessary force, exponentially driving down weight magnitude over time: w(t) ≈ exp(−γt)w₀.
The time to grok scales as t ∝ 1/γ—stronger weight decay accelerates grokking because it more rapidly drives the network toward the Goldilocks zone. But weight decay alone is insufficient; the dataset must also be small enough that memorization is costly yet large enough that generalization is learnable. This explains why grokking occurs most dramatically on small algorithmic datasets where these conditions are naturally satisfied.
Grokking occurs far beyond modular arithmetic
While grokking was discovered on algorithmic tasks, subsequent research has demonstrated it is universal across architectures and domains. The 2024 paper "Deep Networks Always Grok and Here is Why" by Humayun, Balestriero, and Baraniuk at Rice University showed grokking occurs in CNNs trained on CIFAR-10, ResNets on Imagenette, and even non-neural models like Gaussian processes and linear regression with spurious features.
Grokking has been observed on MNIST digit classification, IMDb sentiment analysis, molecular property prediction on the QM9 dataset, permutation group operations (S5 and S6), polynomial regression, and greatest common divisor computation. The phenomenon appears wherever there exists a distinction between efficient generalizing solutions and inefficient memorizing solutions—which may be nearly everywhere.
Perhaps most striking is grokking's connection to emergence in large language models. Work by Wang and colleagues at Ohio State demonstrated that transformers can learn complex implicit reasoning only through grokking. Their fully grokked transformers achieved near-perfect accuracy on compositional reasoning tasks where GPT-4-Turbo and Gemini-1.5-Pro performed poorly. The 2024 unified framework by Huang and colleagues connects grokking, double descent, and emergent abilities in LLMs through the same circuit competition dynamics—suggesting that sudden capability gains in large models may share deep mechanistic roots with grokking.
Accelerating grokking and predicting its onset
If grokking represents a slow journey toward true understanding, a natural question is whether we can accelerate it. The 2024 paper "Grokfast" from Seoul National University achieved greater than 50× speedup through a remarkably simple intervention: amplifying slow-varying gradient components while dampening fast-varying ones. The intuition is that generalizing circuits contribute slow, consistent gradient signals while memorizing circuits contribute fast, noisy signals. By filtering gradients through an exponential moving average and amplifying the low-frequency component, researchers dramatically accelerated the discovery of generalizing solutions.
Other acceleration approaches include transferring embeddings from grokked networks (GrekTransfer), using Kolmogorov-Arnold representations, and data augmentation with synthetic facts. The 2025 work on real-world multi-hop reasoning showed that augmenting knowledge graphs with synthetic data—essentially creating conditions favorable to grokking—achieved 95-100% accuracy on complex factual reasoning benchmarks.
Equally important is predicting grokking before it occurs. Research has identified several progress measures: the Fourier gini coefficient (measuring sparsity in frequency space), restricted loss (performance using only key frequencies), feature rank, and activation sparsity. Most practically, early oscillations in Fourier spectral signatures can predict grokking long before test accuracy begins to rise—offering hope that patience will be rewarded.
Kolmogorov complexity and the emergence of simplicity
The compression perspective on grokking connects to fundamental principles in computational learning theory. Work by DeMoss and colleagues used Kolmogorov complexity—the length of the shortest program that produces a given output—as a measure of network complexity during training. They observed a characteristic rise and fall: complexity increases during memorization as the network encodes specific examples, then sharply decreases during grokking as it discovers the simpler underlying pattern.
This connects to the Minimum Description Length (MDL) principle: the best model minimizes the sum of model complexity plus the complexity of encoding data given the model. Memorization is expensive because it requires explicitly encoding each training example. Generalization is cheap because a simple algorithm describes all inputs, training and test alike. Weight decay implements an implicit MDL prior, favoring solutions with lower complexity.
The Kolmogorov perspective reveals why grokking produces not just any generalizing solution, but specifically simple ones. Networks don't just escape memorization—they find elegant algorithms like Fourier multiplication that achieve maximal compression. This is not merely efficient; it suggests neural networks under proper training conditions are implicitly performing a kind of Solomonoff induction, searching for the simplest hypothesis consistent with observations.
The interference interpretation and truth emergence
The framework that sees grokking as an instance of a universal principle—where false patterns eventually cancel and true patterns survive—has deep mathematical grounding. In the Fourier interpretation of modular arithmetic, the generalizing circuit is literally built from constructive interference of periodic patterns at key frequencies. The memorizing circuit, in contrast, represents incoherent superposition without structure—noise that contributes nothing systematic.
Weight decay provides the selection pressure that allows this interference to manifest. Without it, both patterns coexist indefinitely. With it, the inefficient (incoherent, memorizing) patterns gradually decay while efficient (coherent, generalizing) patterns persist and strengthen. This is analogous to how standing waves emerge from interference: modes that constructively reinforce become dominant while modes that destructively interfere cancel out.
This perspective extends beyond neural networks. In physics, systems minimize free energy and thereby find configurations where local fluctuations average out while global structure persists. In statistics, maximum likelihood estimation finds parameters where random noise cancels and systematic signal remains. In ensemble methods, combining diverse models causes individual errors to cancel while shared truth amplifies. Grokking may be a specific manifestation of a principle that transcends any particular domain: truth is what survives interference.
The meditation analogy is apt. Just as extended contemplation allows mental noise to settle and genuine insight to emerge, extended training allows memorization noise to decay and generalizing structure to crystallize. The network isn't learning something new at the moment of grokking—it has been gradually building the generalizing circuit all along. Grokking is the moment when this hidden progress finally becomes visible, when accumulated coherence crosses the threshold to dominance.
Deep networks find the lowest-energy coherent state
The thermodynamic view provides perhaps the most unified understanding. The loss function plus regularization penalty defines an effective energy landscape. Training with stochastic gradient descent performs something like Brownian motion on this landscape—random exploration subject to a drift toward lower energy. Flat, broad valleys (corresponding to generalizing solutions) have vastly larger volume than narrow, sharp minima (corresponding to memorizing solutions).
Statistical mechanics tells us systems naturally drift toward states with larger phase space volume—that is, higher entropy configurations. Memorizing solutions require precise, specific weight configurations; any perturbation destroys their function. Generalizing solutions occupy broad basins; many nearby configurations implement the same algorithm. Given sufficient time, Brownian dynamics will find these robust, high-entropy regions.
Weight decay accelerates this by acting as temperature control, increasing the relative probability of low-energy states. It's analogous to slowly cooling a liquid: rapid quenching produces disordered glass (memorization), while slow annealing produces ordered crystals (generalization). The 2025 Ising model work demonstrated grokking in actual physical spin systems, confirming that these aren't just analogies—they're the same mathematics applied to different substrates.
Implications for understanding neural network learning
Grokking offers several profound insights about how neural networks learn. First, optimization and generalization are separable. Achieving zero training loss is necessary but not sufficient for understanding; the network may simply be memorizing. True generalization requires something more—either longer training, proper regularization, or both.
Second, simplicity emerges spontaneously under the right conditions. Neural networks are not simply fitting training data; they are implicitly searching for simple, compressible solutions. This connects to deep questions about why deep learning works at all—why should gradient descent on high-dimensional landscapes find generalizing solutions? The answer may be that generalizing solutions are thermodynamically favored: they occupy larger regions of parameter space and become increasingly probable over time.
Third, sudden capability gains may be predictable. The discovery of progress measures that track hidden circuit formation suggests that emergence in large models might be anticipated before it manifests in performance. This has significant implications for AI safety: if we can track the gradual development of dangerous capabilities before they become active, we have a window for intervention.
Fourth, patience may be underrated in machine learning practice. Standard practice emphasizes early stopping to prevent overfitting, but grokking suggests this may sacrifice deep understanding for shallow memorization. For tasks that admit both memorizing and generalizing solutions, extended training with appropriate regularization may achieve qualitatively superior outcomes.
Conclusion: Grokking as a window into universal principles
Grokking began as a curiosity observed on toy algorithmic tasks but has become a lens through which to view fundamental questions about learning, understanding, and the emergence of structure. The phenomenon demonstrates that neural networks are not merely pattern matchers—they are capable of discovering genuine algorithms, of finding the simplest explanation consistent with observations, of achieving true comprehension rather than surface memorization.
The theoretical understanding that has emerged connects grokking to universal principles operating across domains: phase transitions in physics, energy minimization in thermodynamics, compression in information theory, interference in wave mechanics, and perhaps even insight in cognition. These connections suggest grokking is not an idiosyncratic property of neural networks but an instance of something deeper—the tendency of complex systems under appropriate conditions to evolve toward states where noise cancels and signal survives.
Whether through the lens of circuit competition, phase transitions, or entropy maximization, the core insight remains: truth is robust. Patterns that genuinely reflect underlying structure persist and strengthen under continued examination. Patterns that are merely coincidental—artifacts of specific training examples—cannot sustain themselves and eventually decay. Grokking shows us this process in neural networks. It may also be showing us something about the nature of understanding itself.
AI Collaboration - Claude Opus 4.5