Research · Applied AI
Evaluating agents you cannot fully observe
Early thinking
A note, not a result. How do you evaluate an agent when you cannot see most of what it did?
A single-model prompt is easy to grade. You have an input and an output. An agent is different. It takes many steps, calls tools, and changes state in systems you do not own. The interesting failures happen between the steps.
Where the failures hide
Three kinds we keep seeing.
The agent does the right things in the wrong order, and a later step depends on an earlier one that never happened.
The agent succeeds on the task and leaves a mess behind — a half-written record, a duplicate, a lock it never released.
The agent is confidently wrong about the world and acts before anyone can check.
What we are trying
Grade the trace, not just the answer. Log every step and tool call, and evaluate the sequence.
Build small, adversarial suites that target the seams between steps, not the steps themselves.
Treat "did it leave the system clean?" as a first-class metric.
Open questions
- How much of a trace can you grade automatically before you need a human?
- Can an agent's own logs be trusted as evidence, or do you need an outside observer?
We are early here.
Cite this work
@misc{datafrontier_applied_ai_note,
title = {Evaluating agents you cannot fully observe},
author = {DataFrontier Team},
year = {2024},
url = {https://datafrontier.co/research/applied-ai-note}
}