As of today, all deep learning model architectural research papers are filled with language retrieval scores. You give a long token sequence as input, and test it on retrieving something that came, say 10K tokens ago. The mathematical nature that's then treated as the basis of (generalized) intelligence, emerging from the architecture, is its Validation Loss' scaling laws. The idea is simple: What if we see a more generalized trend in the Big-Oh complexity of the curve between: Validation Loss, Training Data, Training Compute, and Model Size?
And we observed a well consistent trend for transformer based LLMs, which between training loss () and number of model parameters (), is:
However, there is one paper that tries to change it. It's called COOM. It's a Doom game version with more randomizations to test continual learning ability in vision-action modality. It's my favorite architectural benchmark till date. If this didn't already get publish in 2023/4 then I would have created and published it in 2025. Me and a few friends of mine were training foundational models build on different architectures in RL environments for just that problem. Although our philosophy was a little different.

Metrics for Vision-Action RL Systems

Overview
This document defines the quantitative metrics used to evaluate vision-action reinforcement learning systems. Each metric is formally defined with explicit notation to eliminate ambiguity.
Core Performance Metrics
1. Vision-Action RL Benchmarks
DeepLab & Gymnasium Performance Score
where:
- : Normalized success rate (0-1) on DeepLab environment task
- : Normalized score (0-1) on Gymnasium environment task
- : Number of benchmark environments
- : Weight coefficients ()
- : Action prediction of the policy network at time
2. Scaling Laws
Compute-Optimal Performance Scaling
where:
- : Final loss after training
- : Dataset size (number of transitions)
- : Model parameter count
- : Irreducible loss (Bayes error)
- : Scaling coefficients
- : Scaling exponents (typically ~0.3-0.5)
- : Multi-step action prediction for scaling analysis
3. Continual Learning Metrics
Forward Transfer (FWT)
Backward Transfer (BWT)
where:
- : Performance on task after learning task
- : Total number of tasks in sequence
- : Performance of model trained only on task
- : Action prediction on previously learned tasks
4. Context Window Efficiency
Effective Context Utilization
where:
- : Maximum available context length
- : Average actually utilized context length
- : Expected performance with context length
- : Action prediction using contextual information
5. Activation Sparsity
Layer-wise Sparsity Ratio
Overall Network Sparsity
where:
- : Activation vector at layer
- : L0 norm (count of non-zero elements)
- : Dimensionality of layer
- : Total number of layers
- : Action prediction under sparsity constraints
6. Effective Memory Horizon
Task Success Retention
Success Rate Decay Model
where:
- : Time steps (or episodes) since learning
- : Probability of successful task completion at delay
- : Memory decay time constant
- : Asymptotic retention level
- : Action prediction using recalled experiences
7. Generalization Gap
Train vs. Test Performance Difference
Normalized Generalization Gap
where:
- : Training environment distribution
- : Test environment distribution (unseen variations)
- : Expected return under policy
- : Expected return across distribution
- : Action prediction in generalization settings
8. Success weighted by Path Length (SPL)
Navigation Efficiency Metric
where:
- : Success indicator (1 if successful, 0 otherwise) for episode
- : Actual path length taken by agent
- : Optimal (shortest) path length
- : Number of evaluation episodes
- : Navigation action prediction
Measurement Protocols
Baseline Requirements
- All metrics measured over ≥100 episodes per condition
- 95% confidence intervals reported for all means
- Cross-validation across ≥5 random seeds
- Statistical significance tested (p < 0.05)
Reporting Format
metric_name:
value: float
confidence_interval: [lower, upper]
sample_size: int
computation_time: seconds
notes: "Any implementation details"
Implementation Notes
- All action predictions are outputs of the policy network
- Success thresholds defined per environment (typically ≥0.95 of maximum score)
- Context window efficiency measured with progressive masking
- Activation sparsity measured at inference time with ReLU activations
- Memory horizon evaluated with increasing delay between learning and testing
References
- DeepLab: Chen et al., 2018
- Gymnasium: Towers et al., 2023
- SPL: Anderson et al., 2018
- Scaling Laws: Kaplan et al., 2020
- Continual Learning: Lopez-Paz & Ranzato, 2017
Last updated: February 2, 2026
Definitions subject to refinement with empirical validation