Release context
This study is the public-facing companion to the comparative stability paper. It translates the main idea into a dated release: model comparison should include behavioral consistency, not just task performance.
What changed in emphasis
- Stability is treated as its own property.
- Drift is framed as operationally important even when outputs still appear broadly correct.
- Findings are summarized in a way builders can act on quickly.
Why it matters
When teams evaluate only task accuracy, they can miss the more practical question: whether the model behaves the same way tomorrow as it did yesterday. This findings format exists to make that question visible.