LLM inference is memory-bandwidth-bound, not compute-bound, and current DRAM technology cannot keep pace with model scaling

hardware0 views
During the autoregressive token generation phase of LLM inference, each token requires reading the entire model's weights from memory, making performance limited by DRAM bandwidth (GB/s) rather than compute (FLOPS). AI chip compute power has grown 80x over the past decade while memory bandwidth has grown only 17x, creating a widening 'memory wall.' So what? GPU utilization during inference sits at 10-30% because the arithmetic units idle while waiting for data to arrive from HBM, meaning companies are paying for $30K-$200K GPUs but using only a fraction of their computational capability. So what? The cost per token for serving models like GPT-4 or Claude remains high ($0.01-$0.06 per 1K tokens for large models), making many potential AI applications economically unviable, particularly real-time agent systems that require thousands of tokens per interaction. So what? Startups building AI-native products face unit economics where inference costs consume 40-70% of revenue, making profitability structurally difficult without massive scale. So what? This bottleneck forces model architects to adopt aggressive quantization, pruning, and distillation that degrade output quality, creating a direct tradeoff between cost and intelligence. So what? The most capable AI systems remain locked behind API providers who can amortize the memory bandwidth cost across millions of users, concentrating AI capability in a handful of companies. This persists because DRAM physics limit how fast data can be read from memory cells, HBM stacking improves bandwidth but at extreme cost ($100+ per GB vs $3-5/GB for standard DDR5), and the fundamental architecture of transformer models requires full weight reads per token with no known algorithmic workaround.

Evidence

Google engineers stated publicly that 'network latency and memory trump compute' for inference (SDxCentral). ArXiv paper 'Mind the Memory Gap' (2503.08311) proves LLM inference remains memory-bound even at large batch sizes. WinBuzzer (Jan 2026) reported 'Memory Bottleneck Emerges as Main LLM Inference Challenge.' TrendForce estimates HBM demand grew 130%+ YoY in 2025 with 70%+ growth expected in 2026. Current systems sustain 1000-2500 tokens/sec but 10,000 tokens/sec requires fundamental hardware changes.

Comments