Your model is burning GPU cycles trying to remember that "The White House" is the U.S. President's residence. It shouldn't have to.
Picture your senior backend engineer spending an entire sprint refactoring a hot-path API endpoint — not because the business logic was wrong, but because the system kept re-deriving the same database schema on every request instead of caching it. Nobody writes that code on purpose. It just accumulates. And yet, this is exactly what every modern LLM does, at massive scale, on every single forward pass.
Standard Transformer models are forced to reconstruct static facts from scratch every time they are invoked. "Paris is the capital of France." "Diana, Princess of Wales." "The Big Apple is a nickname for New York City." Every one of these trivially-rote completions burns the same expensive attention and feed-forward computation as a nuanced multi-step reasoning task. The model has no O(1) cache. It has no lookup table. It just... computes.
DeepSeek's Engram module fixes this architectural blind spot by introducing something embarrassingly simple in concept: a massive, offloaded n-gram embedding table that the model can query in a single hash-indexed lookup, combined with a context-aware gate that decides whether to actually use what was retrieved. The result is a model that separates Memorization from Reasoning at the architecture level — and pays for the new lookup hardware by removing an equivalent amount of brute-force computation elsewhere.
This article dissects Engram from first principles: the intuition, the math, the full architecture, a numerical walkthrough, a PyTorch implementation, and a frank look at when this is brilliant and when it's a headache.
Table of Contents
- The Core Problem: LLMs Are Terrible at Remembering
- The Mental Model: Open-Book vs. Closed-Book
- The Math, Built From Scratch
- The Full Architecture
- Manual Walkthrough: Tracing "The White House"
- PyTorch Implementation
- Where Engram Lives Inside the Transformer
- The Speed Paradox: Why Isn't This Slow?
- The Numbers: Does It Actually Work?
- Trade-offs and When Not to Use It
- Conclusion
The Core Problem
LLMs are built on the Transformer architecture. At its core, a Transformer is a deep reasoning machine: it runs an input sequence through dozens of self-attention layers, each of which learns to relate tokens to each other and build richer contextual representations.
The problem is that we also ask this reasoning machine to act as a static fact warehouse.
When a language model is trained, it absorbs both:
- Dynamic knowledge — how to reason, infer, plan, code, translate.
- Static facts — "Mount Everest is 8,849 meters tall," "HTTP 404 means Not Found," "Diana, Princess of... Wales."
Both types of knowledge are compressed into the same billions of floating-point weights. And when the model needs to retrieve a static fact, it doesn't do a dictionary lookup — it reconstructs the fact by running the input through the full forward pass, layer by layer, across dozens of attention heads.
This is like hiring a chess grandmaster to tell you what time zone New York is in. The grandmaster can do it. But it wastes their capacity, and it forces you to pay for a grandmaster when a wall clock would suffice.
The consequences compound at scale:
- To store more facts, you need more parameters — more memory, more compute.
- More parameters mean slower inference, higher costs.
- The model still hallucinates, because facts are entangled with reasoning in the same weights.
The U-shaped Scaling Law identified in the Engram paper makes this concrete: as models scale, an increasing fraction of their capacity is devoted to rote memorization of static associations — a fundamentally inefficient use of matrix multiplications.
The Mental Model: Open-Book vs. Closed-Book
The cleanest analogy comes from exam design.
The Standard LLM is a closed-book student. Before the exam, this student memorized every textbook, every encyclopedia, every Wikipedia page. When asked "Who was the first Roman Emperor?", they close their eyes, mentally simulate the collapse of the Roman Republic, reconstruct the political timeline, and arrive at "Augustus." It works. But it is exhausting and expensive — the student burned the same cognitive effort reconstructing a static fact as they would solving a logic proof.
Engram gives the model a cheat sheet. Alongside the exam, this student has a magic index card system. When they read the phrase "first Roman Emperor," they don't simulate history — they glance at the card labeled "first Roman Emperor," read "Augustus," and write it down. Their reasoning brainpower is now available for the hard essay questions.
In Engram's language:
- The cheat sheet is a massive embedding table, indexed by hashed n-grams.
- The "is this card relevant?" check is a context-aware gate that compares the retrieved memory against the model's current hidden state.
- The free brainpower goes back into deeper attention layers for actual reasoning.
The critical insight is that this isn't just a retrieval system bolted onto the outside of a model — it's an architectural component that replaces a portion of the brute-force parameters used for memorization, keeping the total compute budget (FLOPs) roughly constant.
The Math, Built From Scratch
Phase 1 — The Base Case (1-gram Lookup)
Start with the simplest possible case. The model is at position $t$, looking at a single token — say, "Apple" with token ID 105.
We want a vector that represents the static concept of "Apple." The most naive approach: build a lookup table of size vocab_size and index directly into it.
$$ e_t = E[x_t] $$
Where $E$ is an embedding matrix and $x_t$ is the token ID. This is just a standard embedding lookup — $O(1)$, no math beyond a table read.
The problem: If our vocabulary is 100,000 tokens, the table is manageable. But what if we want to remember phrases — "Big Apple," "Apple Watch," "Apple Cider Vinegar"? Each combination is a distinct concept. We can't enumerate all possible phrases.
The solution: hashing. We use a hash function to map a phrase to a fixed-size index space of $M$ slots.
$$ \text{index} = \text{Hash}(x_{t-n+1}, \ldots, x_t) \bmod M $$
The table size $M$ is fixed (e.g., 10 million slots). The hash function is deterministic and extremely fast — just integer arithmetic.
Phase 2 — Handling Phrases and Collisions (N-gram + Multi-Head Hashing)
N-grams generalize the single-token lookup to phrases of length $n$.
At position $t$, instead of looking at just $xt$, we look at the last $n$ tokens: $(x{t-n+1}, \ldots, x_t)$. For a trigram ($n=3$), at the word "Fox" in the sequence [The, Quick, Brown, Fox], the context window is [Quick, Brown, Fox].
The phrase gets hashed to a single integer, which indexes into the embedding table:
$$\text{index} = \varphi_n(x_{t-n+1}, \ldots, x_t) \bmod M$$
The collision problem emerges immediately. We are mapping an effectively infinite space (all possible phrases) into a fixed table of $M$ slots. "Quick Brown Fox" might hash to the same slot as "Lazy Brown Dog." The model would retrieve a vector that conflates both phrases — noise, not signal.
Multi-head hashing solves this elegantly. Instead of one hash function, we use $K$ different hash functions — each one a different linear combination of token IDs with different prime multipliers. It is statistically very unlikely that two different phrases will collide under all $K$ hash functions simultaneously.
- Head 1: $h_1 = \varphi_{n,1}(\text{phrase}) \bmod M$ → retrieve vector $e_{1}$
- Head 2: $h_2 = \varphi_{n,2}(\text{phrase}) \bmod M$ → retrieve vector $e_{2}$
The retrieved vectors are concatenated: raw_memory = [e_1 || e_2]. Multi-gram orders ($n=2$ and $n=3$) are similarly concatenated, giving the model both bigram and trigram context simultaneously.
Phase 3 — Formal Equations
With the components in place, here is the full formal pipeline for position $t$.
Step 1 — Retrieval:
For n-gram order $n$ and hash head $k$:
$$ z_{t,n,k} = \varphi_{n,k}(x_{t-n+1}, \dots, x_t) $$
$$ \mathbf{e}_{t,n,k} = \mathbf{E}_{n,k}[z_{t,n,k}] $$
All retrieved vectors across all orders and heads are concatenated to form $\mathbf{e}_t$.
Step 2 — Projection:
The raw memory vector is projected into the model's Key and Value spaces via learned weight matrices:
$$ k_t = W_k \cdot e_t \qquad v_t = W_v \cdot e_t $$
Step 3 — Context-Aware Gating:
This is the critical "is this memory actually useful right now?" check. The model's current hidden state $h_t$ (a running representation of everything it has processed so far) acts as the Query. The projected key $k_t$ from the retrieved memory acts as the Key. Their similarity determines the gate value:
$$ \alpha_t = \sigma!\left(\frac{\text{RMSNorm}(h_t)^\top\, \text{RMSNorm}(k_t)}{\sqrt{d}}\right) $$
- $\alpha_t$ is a scalar between 0 and 1.
- $\sigma$ is the Sigmoid function.
- $\sqrt{d}$ is a scaling factor to prevent the dot product from exploding in large-dimension spaces.
RMSNormnormalizes both vectors to unit scale before comparison for numerical stability.
Step 4 — Gated Output:
$$ \tilde{v}_t = \alpha_t \cdot v_t $$
The memory value is scaled down (or eliminated entirely) based on contextual relevance.
Step 5 — Convolutional Refinement and Residual Addition:
$$ Y = \text{SiLU}(\text{Conv1D}(\text{RMSNorm}(\tilde{v}_t))) + \tilde{v}_t $$
A short depthwise 1D convolution with kernel size 4 (as specified in the paper) smooths the gated memories across neighboring positions in the sequence, capturing local temporal patterns. The SiLU activation introduces non-linearity. The result is added back to the residual stream:
$$ H_t \leftarrow H_t + Y $$
The convolution matters more than it appears. Without it, each position's memory is retrieved and gated entirely independently. The Conv1D allows the module to leverage the continuity of the sequence — if the surrounding positions also retrieved strong memories, that signal bleeds in constructively.
Key variable glossary:
| Symbol | Meaning |
|---|---|
| $t$ | Current position in the token sequence |
| $n$ | N-gram order (e.g., $n=3$ = trigram) |
| $x_t$ | Token ID at position $t$ |
| $e_t$ | Retrieved static memory embedding |
| $h_t$ | AI's current hidden state (dynamic context) |
| $W_k, W_v$ | Learnable projection matrices |
| $\alpha_t$ | Scalar gate value (0 to 1) |
| $\tilde{v}_t$ | Dynamically filtered memory vector |
| $d$ | Embedding dimension (for dot-product scaling) |
Phase 4 — The Loss Function and Training Signal
Engram does not introduce a new loss function. It plugs into the standard Cross-Entropy Loss used for next-token prediction.
At every training step, the model predicts the probability distribution over the vocabulary for the next token. Cross-Entropy measures how wrong that prediction is:
$$ \mathcal{L} = -\log P(\text{correct next token}) $$
If the true next word is "could" and the model predicted it with probability $0.10$, the loss is $-\log(0.10) \approx 2.3$ — a large penalty. If it predicted with probability $0.99$, the loss is $-\log(0.99) \approx 0.01$ — nearly zero.
This loss flows backwards via backpropagation and updates:
- $W_k$ and $W_v$ — so the gate learns what constitutes "relevant memory."
- The embedding table $\mathbf{E}$ — so each table slot learns to encode a useful concept for the phrases that hash to it.
The training data is straightforward: a standard text corpus (Wikipedia, books, code) broken into overlapping sequences. The model receives tokens as input features and the same sequence shifted one step forward as labels. No special annotation is needed.
The Full Architecture
Here is how data flows through the Engram module. Two diagrams: macro view and the gating zoom-in.
Manual Walkthrough: Tracing "The White House"
Let's run an exact numerical trace through the gating block. We are processing position $t=3$, the token "House," in the sequence ["The" (ID:1), "White" (ID:2), "House" (ID:3)].
Setup:
- N-gram order: $n=2$ (bigrams)
- Hash heads: $K=2$
- Backbone hidden state: $h_3 = [0.8, 0.2]$ (represents "Political building context")
- Table size: 100 slots
Step 1 — N-Gram Extraction:
At "House", the bigram window is ["White", "House"].
Step 2 — Multi-Head Hashing:
Hash Head 1: Hash("White House") % 100 → Row 42
Hash Head 2: Hash("White House") % 100 → Row 15
Step 3 — Table Lookup:
Row 42: [0.1, 0.9] ← "Presidential" concept
Row 15: [0.5, 0.5] ← Noise / partial collision
Raw Memory (e_t): [0.1, 0.9, 0.5, 0.5] (concatenated)
Step 4 — Projection (simplified — assume projection matrices are learned identity-like):
Key (k_t) = ProjectKey(e_t) = [0.9, 0.1] ← "Government" direction
Value (v_t) = ProjectValue(e_t) = [1.0, 0.0] ← "Useful government facts"
Step 5 — Gating Calculation:
Dot product of $h_3 = [0.8, 0.2]$ and $k_t = [0.9, 0.1]$:
$$ 0.8 \times 0.9 + 0.2 \times 0.1 = 0.72 + 0.02 = 0.74 $$
$$ \alpha_t = \sigma(0.74) \approx 0.68 $$
The gate is 68% open. High confidence. The model's "political building" context aligns with the "government" memory key.
Step 6 — Fusion:
$$ \tilde{v}_t = 0.68 \times [1.0, 0.0] = [0.68, 0.0] $$
Step 7 — Residual Addition:
$$ H_{\text{new}} = h_3 + \tilde{v}_t = [0.8, 0.2] + [0.68, 0.0] = [1.48, 0.2] $$
The model has injected factual knowledge about "The White House" into its hidden state — without any attention computation, just arithmetic and a table read.
What if the context was "The farmer grew a very Big Apple"?
Here, $h_t$ would be something like $[0.1, 0.9]$ — oriented toward "Fruit/Agriculture." The same "Big Apple" lookup might return a "New York City" vector $[1.0, 0.0]$.
Dot product: $0.1 \times 1.0 + 0.9 \times 0.0 = 0.1$
$$ \alpha_t = \sigma(0.1) \approx 0.52 $$
The gate barely opens. The "New York City" memory is almost entirely suppressed. The model correctly ignores irrelevant retrieved content.
PyTorch Implementation
Below is a pedagogically clear implementation of the Engram module. Every architectural decision maps directly to the equations and diagrams above. Shape annotations are included throughout.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class EngramModule(nn.Module):
"""
[ARCHITECTURE: The full Engram Module as described in the macro view diagram]
A conditional memory module that:
1. Extracts N-gram windows from input tokens
2. Hashes each window to table indices (multi-head)
3. Retrieves static memory vectors from the embedding table
4. Gates the memory based on contextual relevance (hidden state)
5. Injects the filtered memory into the backbone residual stream
"""
def __init__(
self,
vocab_size: int = 32000,
dim_model: int = 512,
dim_memory: int = 128,
num_heads: int = 2,
num_slots: int = 10_000_000, # Size of the "cheat sheet" table
ngram_orders: list = [2, 3], # Bigrams and trigrams
):
super().__init__()
self.dim_model = dim_model
self.num_heads = num_heads
self.ngram_orders = ngram_orders
self.num_slots = num_slots
# [ARCHITECTURE: "Offloaded Engram Memory Hierarchy"]
# In production, this table is stored in CPU RAM (can be 10-100GB).
# The CPU prefetches required rows while the GPU handles earlier layers.
# Here we represent it as a standard nn.Embedding for demonstration.
self.memory_table = nn.Embedding(num_slots, dim_memory)
# Total raw memory dimension after concatenating all heads and n-gram orders
# Example: 2 orders * 2 heads * 128 dim_memory = 512
self.total_memory_dim = len(ngram_orders) * num_heads * dim_memory
# [ARCHITECTURE: Zoom-in "Linear W_k" and "Linear W_v"]
# W_k projects memory into Key space (used for gating / relevance check)
# W_v projects memory into Value space (used for actual injection)
self.w_k = nn.Linear(self.total_memory_dim, dim_model, bias=False)
self.w_v = nn.Linear(self.total_memory_dim, dim_model, bias=False)
# Normalization layers for numerical stability before dot product
# (Paper Section 2.3, Eq 4)
self.norm_h = nn.RMSNorm(dim_model)
self.norm_k = nn.RMSNorm(dim_model)
# [ARCHITECTURE: "Depthwise Conv1D & SiLU Activation"]
# Kernel size 4 per the paper, depthwise (groups=dim_model) to keep cost low.
# Smoothes gated memories across neighboring sequence positions.
self.conv = nn.Conv1d(
in_channels=dim_model,
out_channels=dim_model,
kernel_size=4,
padding=3,
groups=dim_model, # Depthwise: no cross-channel mixing
)
def hash_ngrams(self, input_ids: torch.Tensor) -> torch.Tensor:
"""
[ARCHITECTURE: "N-Gram Hash Logic" + "Multi-Head Hashing"]
For each position t, slides a window of size n over the token sequence,
then applies K different hash functions to get K table indices.
This is a simplified polynomial rolling hash for clarity.
In production (C++/CUDA kernel), this uses multiplicative-XOR hashing
and runs orders of magnitude faster.
Returns: [SHAPE: (Batch, Seq_Len, Total_Heads)] — integer indices
"""
B, T = input_ids.shape
indices_list = []
for n in self.ngram_orders:
# Pad left with zeros so every position t has a full n-gram window
# [SHAPE: (Batch, Seq_Len + n - 1)]
padded = F.pad(input_ids, (n - 1, 0), value=0)
# Create sliding windows: at each position, grab the last n tokens
# [SHAPE: (Batch, Seq_Len, n)]
windows = padded.unfold(dimension=1, size=n, step=1)
for k in range(self.num_heads):
# Different prime multipliers simulate different hash functions.
# This ensures collision sets differ across heads.
multiplier = 997 * (k + 1)
# Simple polynomial hash: weight tokens by multiplier, sum, modulo
# [SHAPE: (Batch, Seq_Len)]
hash_val = (windows * multiplier).sum(dim=-1)
idx = hash_val % self.num_slots
indices_list.append(idx)
# Stack into a single tensor of indices
# [SHAPE: (Batch, Seq_Len, num_orders * num_heads)]
return torch.stack(indices_list, dim=-1)
def forward(
self,
input_ids: torch.Tensor, # [SHAPE: (Batch, Seq_Len)]
hidden_state: torch.Tensor, # [SHAPE: (Batch, Seq_Len, Dim_Model)]
) -> torch.Tensor:
"""
Runs the full Engram pipeline and returns the memory update vector Y.
The caller adds Y to the residual stream: H_new = H + Y
"""
B, T, D = hidden_state.shape
# ── Step 1: Retrieve Static Memory ──────────────────────────────────
# [ARCHITECTURE: "N-Gram Hash Logic" -> "Table Lookup" -> "Concatenation"]
# Get integer table indices for every position
# [SHAPE: (Batch, Seq_Len, Total_Heads)]
lookup_indices = self.hash_ngrams(input_ids)
# Look up embedding vectors for every index
# [SHAPE: (Batch, Seq_Len, Total_Heads, Dim_Memory)]
raw_memory = self.memory_table(lookup_indices)
# Flatten heads into one long vector per token (concatenation)
# [SHAPE: (Batch, Seq_Len, Total_Memory_Dim)]
e_t = raw_memory.view(B, T, -1)
# ── Step 2: Context-Aware Gating (Eq 3 & 4) ─────────────────────────
# [ARCHITECTURE: "Context-Aware Gating" block in Zoom-In diagram]
# Project raw memory to Key (for relevance check) and Value (for injection)
k_t = self.w_k(e_t) # [SHAPE: (Batch, Seq_Len, Dim_Model)]
v_t = self.w_v(e_t) # [SHAPE: (Batch, Seq_Len, Dim_Model)]
# Normalize h and k before dot product (prevents exploding gradients)
h_norm = self.norm_h(hidden_state)
k_norm = self.norm_k(k_t)
# Dot product similarity between the model's context and the memory key.
# keepdim=True produces shape (Batch, Seq_Len, 1) — one gate scalar per position.
scores = (
torch.sum(h_norm * k_norm, dim=-1, keepdim=True)
/ math.sqrt(self.dim_model)
)
# Squeeze score into [0, 1] gate value via Sigmoid
# [SHAPE: (Batch, Seq_Len, 1)]
alpha_t = torch.sigmoid(scores)
# ── Step 3: Apply Gate and Fuse ─────────────────────────────────────
# Scale the value vector by the gate (irrelevant memories → near-zero)
# [SHAPE: (Batch, Seq_Len, Dim_Model)]
gated_memory = alpha_t * v_t
# ── Step 4: Depthwise Convolutional Refinement (Eq 5) ───────────────
# [ARCHITECTURE: "Depthwise Conv1D & SiLU Activation"]
# Conv1D expects (Batch, Channels, Length) — transpose in/out
conv_input = gated_memory.transpose(1, 2) # (B, Dim, T)
conv_out = self.conv(conv_input) # (B, Dim, T + padding)
# Trim the causal padding overhang (kernel_size=4, padding=3 → trim 3)
conv_out = conv_out[:, :, :-3]
# Transpose back and apply SiLU non-linearity, add residual
# [SHAPE: Y -> (Batch, Seq_Len, Dim_Model)]
Y = F.silu(conv_out.transpose(1, 2)) + gated_memory
# Y is returned. The backbone adds it to its residual stream: H_new = H + Y
return Y
# ── Usage Example ────────────────────────────────────────────────────────────
batch_size = 2
seq_len = 10
vocab_size = 32000
dim_model = 512
# Raw token IDs from tokenizer
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
# Hidden state from the previous Transformer layer (Layer 1 output)
hidden_state = torch.randn(batch_size, seq_len, dim_model)
# Instantiate Engram
engram = EngramModule(
vocab_size=vocab_size,
dim_model=dim_model,
dim_memory=128,
num_heads=2,
num_slots=1_000_000,
ngram_orders=[2, 3], # Use both bigrams and trigrams
)
# Forward pass → returns the memory update vector
memory_update = engram(input_ids, hidden_state)
# Residual connection simulating the Transformer backbone
final_state = hidden_state + memory_update
print("Input token IDs shape: ", input_ids.shape) # (2, 10)
print("Memory update shape: ", memory_update.shape) # (2, 10, 512)
print("Final hidden state shape:", final_state.shape) # (2, 10, 512)
print("Engram module executed successfully.")
Key Takeaway: The shape annotations are the map. Every transformation has a defined input and output shape. The hash_ngrams function is the only conceptually tricky part — everything else is standard PyTorch operations (linear layers, sigmoid, conv1d, residual addition). In production, hash_ngrams is replaced by a CUDA kernel for speed.
How the Embedding Table is Trained
A question that naturally arises: "Where do those embedding table values actually come from?"
At initialization, nn.Embedding fills every row with random noise — meaningless floating-point numbers. The table is blank.
During training, the Cross-Entropy loss flows backward through the residual connection, through the SiLU/Conv1D, through the gate, and arrives at the projection matrices $W_k$, $W_v$, and the embedding table itself. The optimizer sees: "Row 30 (which 'Big Apple' hashes to) contributed to predicting 'Tree' instead of 'City.' Nudge the numbers at Row 30 to be more 'City-like.'"
After billions of training steps and trillions of tokens, Row 30 has organically evolved from random noise into a vector that encodes "New York City" in the model's mathematical language. The cheat sheet writes itself.
You can inspect the learned values directly:
# Access the raw weight matrix
# [SHAPE: (num_slots, dim_memory)]
table_weights = engram.memory_table.weight
# Inspect what Row 30 has learned after training
print(table_weights[30])
# Initially: tensor([-0.4231, 0.1829, ...]) ← random noise
# After training: something semantically meaningful for the phrase hashing to slot 30
Where Engram Lives Inside the Transformer
Knowing how the module works is only half the story. Where you put it inside the Transformer stack is equally deliberate.
The Standard vs. Engram Transformer Block
A standard Transformer layer processes: Input → [Attention] → [Feed-Forward / MoE] → Output
When a layer is designated an Engram layer, it becomes: Input → [Engram Lookup + Gating] → [Residual Add] → [Attention] → [MoE / FFN] → Output
[ Previous Layer Output (H) ]
|
v
+---------------------------+
| ENGRAM MODULE | ← 1. Hash "The White House", retrieve vector
| (Lookup + Gating) | ← 2. Gate with current context
+---------------------------+
|
v
+---------------------------+
| RESIDUAL ADDITION | ← 3. H_new = H + Engram_output
+---------------------------+ (H now "knows" White House = Govt)
|
v
+---------------------------+
| ATTENTION BLOCK | ← 4. Attention runs with enriched H
| (Self-Attention) | Can link "veto" to "Executive Branch"
+---------------------------+
|
v
+---------------------------+
| MoE / FFN |
+---------------------------+
|
v
[ To Next Layer ]
Why Before Attention?
This placement is a deliberate architectural decision justified in Section 2.3 and 6.2 of the paper.
By injecting factual memory before the Attention block, you enrich the context that Attention operates on. If the sentence is "The White House issued a veto," and Engram pre-loads the concept "Executive Branch" before attention runs, the attention heads can immediately form the association between "veto" and "Executive Branch" — they don't have to derive it from weights alone.
Placing Engram after Attention would mean the heavy neural network already did the hard work, then you correct it with retrieved memory. That's backwards — you burned the compute budget before the cheat sheet arrived.
Sparse Placement in Depth
Critically, you do not put Engram in every Transformer layer. The paper finds that sparse, strategic placement works best — typically 2 injection points in a 30-layer model:
| Layer Range | Type |
|---|---|
| Layer 0 | Standard |
| Layer 2 | Engram + Standard (Injection Point 1) |
| Layers 3–14 | Standard |
| Layer 15 | Engram + Standard (Injection Point 2) |
| Layers 16–30 | Standard |
Why Layer 2 specifically?
- Layer 0 is too early. The model has just tokenized the input. It has no hidden state yet, so the gating mechanism has nothing to compare against. The gate would open/close almost randomly.
- Layer 2 is the Goldilocks zone. After 1–2 layers of attention, the hidden state has already established basic contextual understanding — it knows "Big Apple" is likely about cities, not fruit, based on surrounding tokens. The gate can make an informed decision. Critically, the model is still early enough in the stack that saving the deeper layers from rote memorization is maximally valuable.
The Speed Paradox: Why Isn't This Slow?
When engineers first see "external memory table," their instinct is: latency. Fetching data from CPU RAM into GPU memory sounds like a bottleneck.
The short answer: it adds storage, not depth. And it prefetches asynchronously.
The Deterministic Prefetch Advantage
Standard MoE routing is dynamic — which expert to activate depends on the output of the previous layer. You can't know in advance which Expert Layer 2 needs until Layer 1 finishes computing.
Engram routing is deterministic — the lookup index depends only on the raw input tokens, which are available from Step 0.
Here is the actual execution timeline:
- Step 0: User sends "The White House issued a veto."
- Step 0.1 (CPU, instant): The CPU computes all n-gram hashes for the entire sequence. It now knows: Layer 2 will need Row 42, Layer 15 will need Row 918.
- Step 1 (GPU): GPU begins computing Layer 0, Layer 1...
- Simultaneously (CPU→GPU): CPU fetches Row 42 and Row 918 from the embedding table and begins transferring them to GPU VRAM.
- Step 2 (GPU, Layer 2): GPU arrives at the Engram layer. The data is already in VRAM, waiting. Zero stall.
The paper benchmarks confirm this: even with a 100 Billion parameter embedding table offloaded to CPU RAM, inference slowdown is just 2.8%.
The Iso-FLOPs Budget
The other half of the answer is that Engram doesn't add computation to the model — it redistributes it.
The authors removed a proportional number of active MoE experts from the standard layers to compensate for the $W_k$ and $W_v$ projection cost in the Engram module. The total FLOPs per token stays roughly constant. Engram is not free, but it pays for itself by replacing the most inefficient uses of those FLOPs (deep reconstruction of static facts) with a more efficient mechanism (O(1) table reads + lightweight gating).
The Numbers: Does It Actually Work?
The paper benchmarks a standard 27B Mixture-of-Experts model against a 27B Engram model with identical total parameters and identical training compute. Iso-parameters, iso-FLOPs.
| Benchmark | Baseline MoE | Engram | Delta |
|---|---|---|---|
| MMLU (General Intelligence) | Baseline | +3.4 points | ✅ |
| ARC-Challenge (Reasoning) | Baseline | +3.7 points | ✅ |
| Long Context Retrieval (Needle-in-Haystack) | 84.2% | 97.0% | ✅ |
| Inference Latency (100B param table) | 1.00x | 1.028x | ~neutral |
The long-context result is particularly striking. Standard models strugg with "needle in a haystack" tasks because they must use attention over the entire context to locate a specific fact. Engram can retrieve an exact n-gram match directly from the table, bypassing positional attention entirely for static associations.
The U-shaped Scaling Law from Figure 3 of the paper explains why: standard models allocate parameters to both memorization and reasoning in a fixed ratio determined by the training corpus. Moving 20–25% of parameters into the Engram table — where they specialize exclusively in fast static retrieval — allows the remaining parameters to specialize exclusively in reasoning. Both halves get better at their respective jobs.
Trade-offs and When Not to Use It
Engram is a compelling architectural advancement, but it is not a universal upgrade. Here is where the costs accumulate.
Memory Consumption Is Severe
The embedding table is not small. The paper scales it to 100 Billion parameters — approximately 200GB at float16. Even a modest Engram deployment requires tens of gigabytes of high-bandwidth CPU RAM just for the table, on top of the standard GPU VRAM for model weights.
For teams running inference on consumer hardware or cost-sensitive cloud deployments, this is the primary blocker. The prefetch mechanism solves the latency problem, not the memory problem.
It Cannot Be Retrofitted
You cannot take a trained GPT-4-class model and bolt Engram onto it. The gating mechanism, the projection matrices $W_k$ and $W_v$, and the embedding table itself all need to be trained jointly from scratch. The model must learn simultaneously: what concept to store in each table slot, how to project keys for gating, and when to trust the retrieved memory. These behaviors are entangled and co-dependent.
This means Engram is a pre-training architectural choice. If you adopt it, you adopt it from day zero.
The Table Is Static After Training
The embedding table encodes knowledge frozen at training time. It cannot be updated post-deployment without retraining. For domains where facts change frequently — live financial data, current events — the table will drift out of sync with reality. A Retrieval-Augmented Generation (RAG) system, by contrast, can access a real-time database.
Conclusion
Engram attacks a real inefficiency that the field has largely accepted as inevitable: LLMs burning deep computational resources on shallow memorization tasks. The architectural response is clean — a hash-indexed embedding table that handles rote recall in $O(1)$, combined with a learned gate that ensures retrieved memories are only injected when contextually appropriate.
What the Engram paper fundamentally demonstrates is that the strict separation between storage and computation — a principle that senior engineers apply to every well-designed backend system — has been systematically violated in LLM architecture by necessity, not by design. As hardware, training infrastructure, and memory management improve, expect this separation to become a standard building block rather than a research novelty.
Your LLM is a brilliant reasoner. Stop making it be its own encyclopedia.
Reference: DeepSeek Engram Paper
