Whitepaper

Comparative Behavioral Stability of Large Language Models

A 37-model study that compares frontier model stability using deterministic behavioral analysis rather than capability-only evaluation.

18 Jun 2025Manuscriptvalidatedstabilitymodel-comparisonfrontier-models
CIJ Labs

Abstract

This paper argues that capability and stability are different properties. Using two evaluation regimes across 37 models, it profiles how consistently models behave rather than how well they score on standard benchmarks.

What it establishes

  • Model comparison should include behavioral consistency.
  • Frontier systems can look strong on capability while remaining unstable in behavior.
  • Stability deserves to be treated as a standalone operational property.

Why it matters

This is one of the most directly useful papers for builders. It reframes what model evaluation should include when the cost of silent change is borne in production.