LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?

This paper introduces a framework to classify algorithmic innovations in LLMs as compute-dependent or compute-independent, demonstrating through small-scale GPT-2 experiments that compute-independent advancements like FlashAttention can yield up to 3.5× compute-equivalent gains even under hardware constraints, challenging the efficacy of hardware-focused AI regulation.

Large Language Model, Transformer, Efficiency, Scaling Laws, Pre-training

Teddy Foley, Spencer Guo, Henry Josephson, Anqi Qu, Jack Sanderson

Existential Risk Laboratory, The University of Chicago

Generated by grok-3

Background Problem

The rapid progress of Large Language Models (LLMs) has been driven by both increased computational resources and algorithmic innovations, yet the relative impact of each remains unclear, especially under potential hardware restrictions imposed by regulatory measures like export controls on advanced chips. This paper investigates whether LLM capabilities can advance without further hardware scaling, addressing the critical question of whether compute constraints can effectively slow AI development or if algorithmic breakthroughs can sustain progress independently. This has significant implications for AI governance, forecasting, and investment strategies in a landscape where hardware access might be limited.

Method

The authors propose a novel classification framework to distinguish between compute-dependent algorithmic innovations (which yield benefits primarily at high compute levels, e.g., Transformer architecture, Mixture-of-Experts) and compute-independent innovations (which improve efficiency across all compute scales, e.g., Rotary Positional Embedding, FlashAttention, Layer Normalization). They introduce a metric called Compute-Equivalent Gain (CEG), defined as the ratio of compute cost of a baseline model to that of an equally performant, more efficient model ( $CEG = \frac{C_b}{C_e}$ ), to quantify the additional compute needed to achieve similar performance without the algorithmic advancement. The framework is applied through a case study analysis of past innovations and validated via small-scale training experiments on a downsized GPT-2 model to measure performance gains and CEG at low compute levels.

Experiment

The experiments utilized a scaled-down version of GPT-2 (nanoGPT) trained on the OpenWebText dataset for 50,000 iterations due to resource constraints, testing algorithms like Layer Normalization, Rotary Positional Embedding (RoPE), FlashAttention, and Multi-Query Attention (MQA). The setup measured cross-entropy validation loss and CEG compared to a baseline model, with GPU utilization (MFU) recorded to estimate total FLOPs. Results showed compute-independent algorithms (LayerNorm, RoPE, FlashAttention) provided significant CEG (up to 3.5× combined), while compute-dependent MQA showed negligible gains (CEG of 0.91), aligning with the hypothesis that compute-independent innovations are effective even at low scales. However, the small scale limits generalizability to frontier models, and the exclusion of some algorithms (e.g., Mixture-of-Experts, Sparse Attention) due to implementation challenges weakens the comprehensiveness. The experimental design is reasonable for resource constraints but lacks depth in testing synergistic effects beyond a single combination and does not address dataset or inference-time impacts.

Further Thoughts

The distinction between compute-dependent and compute-independent innovations opens a fascinating avenue for rethinking AI development trajectories, especially in light of potential hardware plateaus. I’m particularly intrigued by the policy implications: if compute-independent innovations can be discovered at smaller scales, as suggested, this could democratize AI research to some extent, but the paper’s observation that well-resourced actors are better positioned to automate such discoveries (via AI agents or large-scale experimentation) hints at a widening gap between resource-rich and resource-poor entities. This ties into broader discussions in AI ethics and fairness—how do we ensure equitable access to algorithmic advancements if compute access remains a bottleneck? Additionally, the focus on attention mechanisms as the primary compute consumer (and thus a key area for compute-dependent gains) aligns with recent trends in vision and multimodal models, where attention-based architectures dominate. Future work could extend this framework to those domains, potentially revealing whether compute dependencies vary across AI subfields. Lastly, the omission of inference-time techniques like chain-of-thought prompting is a missed opportunity, as these could represent a significant compute-independent lever for capability gains, especially in real-world deployment scenarios.