How do you compare two different architectures? (not models)

2/2/2026

As of today, all deep learning model architectural research papers are filled with language retrieval scores. You give a long token sequence as input, and test it on retrieving something that came, say 10K tokens ago. The mathematical nature that's then treated as the basis of (generalized) intelligence, emerging from the architecture, is its Validation Loss' scaling laws. The idea is simple: What if we see a more generalized trend in the Big-Oh complexity of the curve between: Validation Loss, Training Data, Training Compute, and Model Size?

And we observed a well consistent trend for transformer based LLMs, which between training loss ( $L$ ) and number of model parameters ( $\alpha$ ), is: $L \propto N^{-\alpha}$

However, there is one paper that tries to change it. It's called COOM. It's a Doom game version with more randomizations to test continual learning ability in vision-action modality. It's my favorite architectural benchmark till date. If this didn't already get publish in 2023/4 then I would have created and published it in 2025. Me and a few friends of mine were training foundational models build on different architectures in RL environments for just that problem. Although our philosophy was a little different.

Observation — properties and metrics

Metrics for Vision-Action RL Systems

Metric — a measure for something

Overview

This document defines the quantitative metrics used to evaluate vision-action reinforcement learning systems. Each metric is formally defined with explicit notation to eliminate ambiguity.

Core Performance Metrics

1. Vision-Action RL Benchmarks

DeepLab & Gymnasium Performance Score

P = \frac{1}{N} \sum_{i=1}^{N} \left( \alpha \cdot R_i^{\text{DeepLab}} + \beta \cdot R_i^{\text{Gym}} \right)

where:

$R_i^{\text{DeepLab}}$ : Normalized success rate (0-1) on DeepLab environment task $i$
$R_i^{\text{Gym}}$ : Normalized score (0-1) on Gymnasium environment task $i$
$N$ : Number of benchmark environments
$\alpha, \beta$ : Weight coefficients ( $\alpha + \beta = 1$ )
$\hat{a}_t$ : Action prediction of the policy network at time $t$

2. Scaling Laws

Compute-Optimal Performance Scaling

L(D, N) = E_0 + \frac{A}{D^\alpha} + \frac{B}{N^\beta}

where:

$L(D, N)$ : Final loss after training
$D$ : Dataset size (number of transitions)
$N$ : Model parameter count
$E_0$ : Irreducible loss (Bayes error)
$A, B$ : Scaling coefficients
$\alpha, \beta$ : Scaling exponents (typically ~0.3-0.5)
$\hat{a}_{t+\Delta}$ : Multi-step action prediction for scaling analysis

3. Continual Learning Metrics

Forward Transfer (FWT)

\text{FWT} = \frac{1}{T-1} \sum_{i=2}^{T} \left( P_i^{(i)} - P_i^{\text{baseline}} \right)

Backward Transfer (BWT)

\text{BWT} = \frac{1}{T-1} \sum_{i=1}^{T-1} \left( P_i^{(T)} - P_i^{(i)} \right)

where:

$P_i^{(j)}$ : Performance on task $i$ after learning task $j$
$T$ : Total number of tasks in sequence
$P_i^{\text{baseline}}$ : Performance of model trained only on task $i$
$\hat{a}_{\text{prev}}$ : Action prediction on previously learned tasks

4. Context Window Efficiency

Effective Context Utilization

\eta_{\text{ctx}} = \frac{\mathbb{E}[\text{Performance}(L_{\text{used}})]}{\mathbb{E}[\text{Performance}(L_{\text{max}})]} \times \frac{L_{\text{used}}}{L_{\text{max}}}

where:

$L_{\text{max}}$ : Maximum available context length
$L_{\text{used}}$ : Average actually utilized context length
$\mathbb{E}[\text{Performance}(L)]$ : Expected performance with context length $L$
$\hat{a}_{\text{ctx}}$ : Action prediction using contextual information

5. Activation Sparsity

Layer-wise Sparsity Ratio

S_l = 1 - \frac{\|\mathbf{h}_l\|_0}{d_l}

Overall Network Sparsity

S = \frac{1}{L} \sum_{l=1}^{L} S_l \times \frac{d_l}{\sum_{j=1}^{L} d_j}

where:

$\mathbf{h}_l$ : Activation vector at layer $l$
$\|\cdot\|_0$ : L0 norm (count of non-zero elements)
$d_l$ : Dimensionality of layer $l$
$L$ : Total number of layers
$\hat{a}_{\text{sparse}}$ : Action prediction under sparsity constraints

6. Effective Memory Horizon

Task Success Retention

H_{\text{eff}} = \max\{T \mid \text{SuccessRate}(T) > 0.95\}

Success Rate Decay Model

\text{SuccessRate}(T) = \exp\left(-\frac{T}{\tau}\right) + S_{\infty}

where:

$T$ : Time steps (or episodes) since learning
$\text{SuccessRate}(T)$ : Probability of successful task completion at delay $T$
$\tau$ : Memory decay time constant
$S_{\infty}$ : Asymptotic retention level
$\hat{a}_{\text{memory}}$ : Action prediction using recalled experiences

7. Generalization Gap

Train vs. Test Performance Difference

\Delta G = \mathbb{E}_{\mathcal{D}_{\text{train}}}[R(\pi)] - \mathbb{E}_{\mathcal{D}_{\text{test}}}[R(\pi)]

Normalized Generalization Gap

\Delta G_{\text{norm}} = \frac{\Delta G}{\mathbb{E}_{\mathcal{D}_{\text{train}}}[R(\pi)]}

where:

$\mathcal{D}_{\text{train}}$ : Training environment distribution
$\mathcal{D}_{\text{test}}$ : Test environment distribution (unseen variations)
$R(\pi)$ : Expected return under policy $\pi$
$\mathbb{E}_{\mathcal{D}}[R(\pi)]$ : Expected return across distribution $\mathcal{D}$
$\hat{a}_{\text{gen}}$ : Action prediction in generalization settings

8. Success weighted by Path Length (SPL)

Navigation Efficiency Metric

\text{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \times \frac{l_i^*}{\max(l_i, l_i^*)}\\ \text{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \times \frac{l_i^*}{\max(l_i, l_i^*)}

where:

$S_i$ : Success indicator (1 if successful, 0 otherwise) for episode $i$
$l_i$ : Actual path length taken by agent
$l_i^*$ : Optimal (shortest) path length
$N$ : Number of evaluation episodes
$\hat{a}_{\text{navigate}}$ : Navigation action prediction

Measurement Protocols

Baseline Requirements

All metrics measured over ≥100 episodes per condition
95% confidence intervals reported for all means
Cross-validation across ≥5 random seeds
Statistical significance tested (p < 0.05)

Reporting Format

metric_name:
  value: float
  confidence_interval: [lower, upper]
  sample_size: int
  computation_time: seconds
  notes: "Any implementation details"

Implementation Notes

All action predictions $\hat{a}$ are outputs of the policy network $\pi_\theta(o_t)$
Success thresholds defined per environment (typically ≥0.95 of maximum score)
Context window efficiency measured with progressive masking
Activation sparsity measured at inference time with ReLU activations
Memory horizon evaluated with increasing delay between learning and testing

References

DeepLab: Chen et al., 2018
Gymnasium: Towers et al., 2023
SPL: Anderson et al., 2018
Scaling Laws: Kaplan et al., 2020
Continual Learning: Lopez-Paz & Ranzato, 2017

Last updated: February 2, 2026
Definitions subject to refinement with empirical validation

0 views0 comments