If we zoom out a little, what is an LLM trying to accomplish mathematically? Or better yet,
Ques. What sort of morphing happens to the input information given to the blackbox we call LLM?
This blackbox interacts with information in two different phases: during training and inference. Just know that the morphing it does to the information during training time is a superset of that during inference.
Ans. LLM is trying to minimize conditional entropy, the real goal of all intelligent systems.
Observing a system means observing some property of that system. One very usefull such property is Entropy. Let's quickly go over Entropy, Conditional Entropy, Encoding, and Optimal Encoding.
Conditional Entropy:
In information theory, "Entropy" tells you how unpredictable something is, on average.
Entropy is the expected number of bits required to encode the outcome of a random variable under an optimal encoding scheme.
Conditional Entropy:
Conditional entropy measures the uncertainty remaining in after observing :
It can also be expressed as:
where is the joint entropy.
Minimizing Conditional Entropy(LLM):
An LLM can be viewed as a communication channel where:
- Input : tokens from the training corpus
- Output : next token predictions
- The model parameters define the channel
The training objective minimizes the cross-entropy loss, which is equivalent to minimizing the conditional entropy where represents the model's internal representation of the input.
However, this view is incomplete. The true information-theoretic objective involves finding an optimal balance between compression and prediction - this is where the Information Bottleneck principle becomes essential.
Information Bottleneck Principle
The Information Bottleneck (IB) principle, introduced by Tishby et al. (1999), frames learning as a trade-off between two competing objectives:
- Compression: Minimize the information the retained representation carries about the input
- Prediction: Maximize the information the retained representation carries about the relevant output
Formally, we seek to optimize the IB Lagrangian:
where:
- is the mutual information between input and representation (to be minimized)
- is the mutual information between representation and target (to be maximized)
- is a Lagrange multiplier controlling the trade-off
For , we preserve all information about (pure prediction) For , we compress as much as possible (pure compression)
In the context of neural networks, each layer learns a stochastic mapping from the input to its representation. Training via SGD on the cross-entropy loss implicitly optimizes an IB objective where is related to the inverse temperature in statistical physics formulations.
Channel Capacity Interpretation
Now we connect this to channel capacity. Consider the neural network as a communication channel:
- Input: Training samples
- Channel: The stochastic mapping defined by network weights
- Output: Network representation
- Noise: Induced by the stochastic nature of learning (SGD noise, dropout, etc.)
The channel capacity of this stochastic mapping is defined as:
This represents the maximum rate at which information can be reliably transmitted from input to representation through the network.
Channel capacity view of LLM training through the information bottleneck.
During training, we're not merely minimizing but rather operating near the channel capacity limit while preserving predictive information. The IB principle tells us that optimal learning occurs when we operate at:
This creates an interesting dynamic: as training progresses, the network initially increases (fitting phase) then decreases it (compression phase), all while trying to maximize .
Saturation and the Limits of Learning
The concept of saturation emerges naturally from this framework. When we say a network has "saturated" in its learning capacity, we mean it has approached the channel capacity limit:
Beyond this point, additional training cannot increase the mutual information between input and representation. Any further reduction in loss must come from better utilization of the fixed information budget - i.e., increasing for a fixed .
This explains several empirical phenomena:
- Plateaus in training loss: When hits capacity, further optimization focuses on representation quality
- Double descent: As model capacity increases, so does channel capacity , allowing better fits
- Generalization bound connection: The IB framework provides tighter generalization bounds than VC-dimension or Rademacher complexity approaches
In the next section, we'll connect these abstract information quantities to measurable properties of neural networks and derive practical implications for architecture design and training protocols.
Empirical Connections and Practical Implications
The channel capacity framework provides powerful explanations for several empirically observed phenomena in deep learning:
1. Scaling Laws and Channel Capacity
Recent work has shown that language model performance follows predictable power-law scaling with respect to model size, dataset size, and compute. In our framework:
- Model capacity directly relates to the maximum achievable channel capacity
- Dataset size affects the empirical distribution we're trying to model
- Compute determines how closely we can approach the theoretical channel capacity
The scaling laws can be interpreted as describing how close finite-trained networks come to the information-theoretic limit:
where is some monotonic function and represents the utilization ratio of the available channel capacity.
2. Emergent Abilities as Phase Transitions
The sudden emergence of abilities at certain scale thresholds can be viewed as phase transitions in the information plane. As model size increases:
- Channel capacity increases continuously
- At critical thresholds, new discrete representations become encodable within the channel
- This leads to discontinuous jumps in for specific tasks
This explains why abilities like multi-step reasoning or tool use appear abruptly rather than gradually.
3. The Compression Phase and Generalization
The observed compression phase during training (where decreases after peaking) aligns with the IB theory's prediction that optimal representations discard input information irrelevant to the task. Networks that fail to compress properly often exhibit:
- Memorization rather than generalization
- Poor out-of-distribution performance
- Sensitivity to input perturbations
Practical Implications for Architecture Design
Understanding neural networks through the channel capacity lens suggests several design principles:
- Capacity Matching: Architecture should be chosen such that its potential channel capacity matches the complexity of the target distribution
- Information Flow Optimization: Skip connections and normalization layers can be viewed as mechanisms to preserve information flow across layers
- Regularization as Capacity Control: Techniques like dropout, weight decay, and early stopping effectively modify the channel's noise characteristics, thereby controlling usable capacity
- Training Protocol Design: Learning rate schedules can be interpreted as annealing protocols that allow the system to explore the information plane optimally
Conclusion
Reframing neural network training through the lens of channel capacity and the Information Bottleneck principle provides a unified framework for understanding:
- Why deep learning works despite being vastly overparameterized
- How scaling laws emerge from fundamental information limits
- What constitutes optimal learning in finite-sample regimes
- How to diagnose and improve training dynamics
The most powerful insight is that intelligent systems don't merely minimize prediction error—they navigate the complex trade-off between representation sufficiency and minimality, constantly operating near the limits of what their architectural channels can reliably transmit.
This perspective shifts our focus from empirical heuristics to information-theoretic first principles, offering a more robust foundation for advancing both our theoretical understanding and practical capabilities in AI.
Takeaways:
Your ability to think about intelligence will(already have) develop(ed) unhealthy level of dependence on the attention like mechanisms, and LLM like systems. The only way to break free and progress humanity faster is to leave the dear knowledge you've grown comfort around.
Only keep the information theory level understanding close to your heart and that won't harm your ability in any way possible. It let's me stay sharp, creative, and agile.