Comparative Behavioral Stability of Large Language Models | Whitepapers

Abstract

This paper argues that capability and stability are different properties. Using two evaluation regimes across 37 models, it profiles how consistently models behave rather than how well they score on standard benchmarks.

What it establishes

Model comparison should include behavioral consistency.
Frontier systems can look strong on capability while remaining unstable in behavior.
Stability deserves to be treated as a standalone operational property.

Why it matters

This is one of the most directly useful papers for builders. It reframes what model evaluation should include when the cost of silent change is borne in production.