How do you compare two different architectures? (not models)

2/2/2026

As of today, all deep learning model architectural research papers are filled with language retrieval scores. You give a long token sequence as input, and test it on retrieving something that came, say 10K tokens ago. The mathematical nature that's then treated as the basis of (generalized) intelligence, emerging from the architecture, is its Validation Loss' scaling laws. The idea is simple: What if we see a more generalized trend in the Big-Oh complexity of the curve between: Validation Loss, Training Data, Training Compute, and Model Size?

And we observed a well consistent trend for transformer based LLMs, which between training loss (LL) and number of model parameters (α\alpha), is: LNαL \propto N^{-\alpha}

However, there is one paper that tries to change it. It's called COOM. It's a Doom game version with more randomizations to test continual learning ability in vision-action modality. It's my favorite architectural benchmark till date. If this didn't already get publish in 2023/4 then I would have created and published it in 2025. Me and a few friends of mine were training foundational models build on different architectures in RL environments for just that problem. Although our philosophy was a little different.

Observation — properties and metrics

Metrics for Vision-Action RL Systems

Metric — a measure for something

Overview

This document defines the quantitative metrics used to evaluate vision-action reinforcement learning systems. Each metric is formally defined with explicit notation to eliminate ambiguity.

Core Performance Metrics

1. Vision-Action RL Benchmarks

DeepLab & Gymnasium Performance Score

P=1Ni=1N(αRiDeepLab+βRiGym)P = \frac{1}{N} \sum_{i=1}^{N} \left( \alpha \cdot R_i^{\text{DeepLab}} + \beta \cdot R_i^{\text{Gym}} \right)

where:

  • RiDeepLabR_i^{\text{DeepLab}}: Normalized success rate (0-1) on DeepLab environment task ii
  • RiGymR_i^{\text{Gym}}: Normalized score (0-1) on Gymnasium environment task ii
  • NN: Number of benchmark environments
  • α,β\alpha, \beta: Weight coefficients (α+β=1\alpha + \beta = 1)
  • a^t\hat{a}_t: Action prediction of the policy network at time tt

2. Scaling Laws

Compute-Optimal Performance Scaling

L(D,N)=E0+ADα+BNβL(D, N) = E_0 + \frac{A}{D^\alpha} + \frac{B}{N^\beta}

where:

  • L(D,N)L(D, N): Final loss after training
  • DD: Dataset size (number of transitions)
  • NN: Model parameter count
  • E0E_0: Irreducible loss (Bayes error)
  • A,BA, B: Scaling coefficients
  • α,β\alpha, \beta: Scaling exponents (typically ~0.3-0.5)
  • a^t+Δ\hat{a}_{t+\Delta}: Multi-step action prediction for scaling analysis

3. Continual Learning Metrics

Forward Transfer (FWT)

FWT=1T1i=2T(Pi(i)Pibaseline)\text{FWT} = \frac{1}{T-1} \sum_{i=2}^{T} \left( P_i^{(i)} - P_i^{\text{baseline}} \right)

Backward Transfer (BWT)

BWT=1T1i=1T1(Pi(T)Pi(i))\text{BWT} = \frac{1}{T-1} \sum_{i=1}^{T-1} \left( P_i^{(T)} - P_i^{(i)} \right)

where:

  • Pi(j)P_i^{(j)}: Performance on task ii after learning task jj
  • TT: Total number of tasks in sequence
  • PibaselineP_i^{\text{baseline}}: Performance of model trained only on task ii
  • a^prev\hat{a}_{\text{prev}}: Action prediction on previously learned tasks

4. Context Window Efficiency

Effective Context Utilization

ηctx=E[Performance(Lused)]E[Performance(Lmax)]×LusedLmax\eta_{\text{ctx}} = \frac{\mathbb{E}[\text{Performance}(L_{\text{used}})]}{\mathbb{E}[\text{Performance}(L_{\text{max}})]} \times \frac{L_{\text{used}}}{L_{\text{max}}}

where:

  • LmaxL_{\text{max}}: Maximum available context length
  • LusedL_{\text{used}}: Average actually utilized context length
  • E[Performance(L)]\mathbb{E}[\text{Performance}(L)]: Expected performance with context length LL
  • a^ctx\hat{a}_{\text{ctx}}: Action prediction using contextual information

5. Activation Sparsity

Layer-wise Sparsity Ratio

Sl=1hl0dlS_l = 1 - \frac{\|\mathbf{h}_l\|_0}{d_l}

Overall Network Sparsity

S=1Ll=1LSl×dlj=1LdjS = \frac{1}{L} \sum_{l=1}^{L} S_l \times \frac{d_l}{\sum_{j=1}^{L} d_j}

where:

  • hl\mathbf{h}_l: Activation vector at layer ll
  • 0\|\cdot\|_0: L0 norm (count of non-zero elements)
  • dld_l: Dimensionality of layer ll
  • LL: Total number of layers
  • a^sparse\hat{a}_{\text{sparse}}: Action prediction under sparsity constraints

6. Effective Memory Horizon

Task Success Retention

Heff=max{TSuccessRate(T)>0.95}H_{\text{eff}} = \max\{T \mid \text{SuccessRate}(T) > 0.95\}

Success Rate Decay Model

SuccessRate(T)=exp(Tτ)+S\text{SuccessRate}(T) = \exp\left(-\frac{T}{\tau}\right) + S_{\infty}

where:

  • TT: Time steps (or episodes) since learning
  • SuccessRate(T)\text{SuccessRate}(T): Probability of successful task completion at delay TT
  • τ\tau: Memory decay time constant
  • SS_{\infty}: Asymptotic retention level
  • a^memory\hat{a}_{\text{memory}}: Action prediction using recalled experiences

7. Generalization Gap

Train vs. Test Performance Difference

ΔG=EDtrain[R(π)]EDtest[R(π)]\Delta G = \mathbb{E}_{\mathcal{D}_{\text{train}}}[R(\pi)] - \mathbb{E}_{\mathcal{D}_{\text{test}}}[R(\pi)]

Normalized Generalization Gap

ΔGnorm=ΔGEDtrain[R(π)]\Delta G_{\text{norm}} = \frac{\Delta G}{\mathbb{E}_{\mathcal{D}_{\text{train}}}[R(\pi)]}

where:

  • Dtrain\mathcal{D}_{\text{train}}: Training environment distribution
  • Dtest\mathcal{D}_{\text{test}}: Test environment distribution (unseen variations)
  • R(π)R(\pi): Expected return under policy π\pi
  • ED[R(π)]\mathbb{E}_{\mathcal{D}}[R(\pi)]: Expected return across distribution D\mathcal{D}
  • a^gen\hat{a}_{\text{gen}}: Action prediction in generalization settings

8. Success weighted by Path Length (SPL)

Navigation Efficiency Metric

SPL=1Ni=1NSi×limax(li,li)SPL=1Ni=1NSi×limax(li,li)\text{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \times \frac{l_i^*}{\max(l_i, l_i^*)}\\ \text{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \times \frac{l_i^*}{\max(l_i, l_i^*)}

where:

  • SiS_i: Success indicator (1 if successful, 0 otherwise) for episode ii
  • lil_i: Actual path length taken by agent
  • lil_i^*: Optimal (shortest) path length
  • NN: Number of evaluation episodes
  • a^navigate\hat{a}_{\text{navigate}}: Navigation action prediction

Measurement Protocols

Baseline Requirements

  1. All metrics measured over ≥100 episodes per condition
  2. 95% confidence intervals reported for all means
  3. Cross-validation across ≥5 random seeds
  4. Statistical significance tested (p < 0.05)

Reporting Format

metric_name:
  value: float
  confidence_interval: [lower, upper]
  sample_size: int
  computation_time: seconds
  notes: "Any implementation details"

Implementation Notes

  • All action predictions a^\hat{a} are outputs of the policy network πθ(ot)\pi_\theta(o_t)
  • Success thresholds defined per environment (typically ≥0.95 of maximum score)
  • Context window efficiency measured with progressive masking
  • Activation sparsity measured at inference time with ReLU activations
  • Memory horizon evaluated with increasing delay between learning and testing

References

  1. DeepLab: Chen et al., 2018
  2. Gymnasium: Towers et al., 2023
  3. SPL: Anderson et al., 2018
  4. Scaling Laws: Kaplan et al., 2020
  5. Continual Learning: Lopez-Paz & Ranzato, 2017

Last updated: February 2, 2026
Definitions subject to refinement with empirical validation

0 views0 comments