Field note · 011Jan 202611 min

Evaluating tone is harder than evaluating truth

You can write a regex for whether the model got the date right. You cannot write a regex for whether it sounded like you.

AL
The Acorn Labs bench
Southwark · SE1

Factual evals are a solved enough problem that we no longer think about them. You have a golden answer, you have a model answer, you have a fuzzy match. You move on. Tone evals are not like that, and pretending they are is the single most expensive mistake we see teams make in their first six months of shipping AI features.

The taste-as-spec problem

When you ask an editor 'is this on brand?', they will answer in about 200 milliseconds and they will be right. When you ask them to write down the rule that produced the answer, they will sit with you for two hours and produce something that is, at best, 60% of what is in their head. That gap is where most tone evals live.

Tone is a 200ms feeling and a two-hour explanation. Both are real. Only one fits in a unit test.

What works for us

  • Pairwise comparisons against a held-out sample of the customer's own writing.
  • A human-in-the-loop rubric of 6–8 axes, rated 1–5, not a single score.
  • An LLM-as-judge ensemble that we have calibrated against the human rater first, not last.

None of this is glamorous. All of it is the actual work.

end