Home/Blog/ai
aiaiengineering13 min read

An eval harness that survives contact with a real user base

How to build LLM evaluation infrastructure that catches regressions before users do — and stays maintained.

DK
Daniel Kim
Editor at Skill Trek
APR 17, 2026
An eval harness that survives contact with a real user base

The dirty secret of LLM evaluation is that most eval harnesses measure the wrong thing. They test the model on a fixed golden dataset, catch up to the last known failure mode, and miss everything that actually happens in production.

What a production eval harness actually needs

Real eval infrastructure runs on live traffic traces, not curated datasets. It needs three layers: offline evals on a golden set (catches obvious regressions), shadow evals on sampled production traffic (catches distribution shift), and online evals that score live outputs against a rubric (catches the long tail).

eval.py
import anthropic

client = anthropic.Anthropic()

def eval_groundedness(response: str, context: str) -> float:
    result = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=10,
        messages=[{"role": "user", "content": f"Rate groundedness 0-1.\nContext: {context}\nResponse: {response}\nScore:"}]
    )
    return float(result.content[0].text.strip())
DK

Daniel Kim

Applied ML engineer. Writes about LLMs, RAG, and production AI systems.

More from Daniel Kim