Research

Insight

From Engineering Evaluation to Independent Oversight

Modern AI agents plan, act, retry, and interact with live systems. As a result, assurance can no longer rely on point‑in‑time validation or benchmark performance alone.

Three colleagues smiling and chatting together in a studio setting.

Words by

How do financial institutions maintain effective challenge when decision‑making is delegated to probabilistic, adaptive agents operating at machine speed?

What Engineering Evals Solve

Recent advances in agent evaluation have significantly raised the technical bar. Modern frameworks typically include:

Task suites derived from real‑world failures and representative user journeys
Automated graders to assess outcomes at scale
Full transcripts capturing reasoning, tool calls, and intermediate actions
Quantitative metrics such as pass@k and pass^k to measure capability and consistency

These components are essential. They establish whether an agent can perform a task and provide the raw evidence needed for review. However, evaluation alone does not answer the governance question regulators and boards care about:

Should this agent be allowed to operate, at this level of autonomy, in this context, today?

That decision requires controls beyond engineering.

The Governance Gap

In high‑stakes domains such as financial services, risk management rests on separation of duties. The builder of a system cannot be its final judge.

Agent eval frameworks are typically owned by the same teams that design, prompt, and deploy the agent. Over time, this creates predictable blind spots.

Incentive Drift

Evaluation suites naturally converge on known strengths. Rare but severe failures are under‑represented, even as operational exposure grows.

Capability Masking

Metrics like pass@k demonstrate that a solution exists, but can hide dangerous side effects that occur during failed attempts before eventual success.

Correlated Judgement

When agents are evaluated by similar models, shared reasoning errors and confidence biases can go undetected, especially in multi‑agent systems.

Point‑in‑Time Assurance

Agents interact with changing environments. A test passed at deployment does not guarantee safe behaviour months later, after model updates or new user behaviours emerge.

These are not engineering failures. They are governance failures.

Primebase’s Research Thesis

Primebase is built on the premise that effective challenge must be continuous, independent, and operationalised at runtime, not inferred from static test results.

Our research focuses on three principles.

1. Independence by Design

Oversight must be structurally independent from agent builders. Validation logic, escalation thresholds, and risk interpretation cannot be owned by the same incentives driving delivery velocity. Primebase treats evaluation outputs as inputs, not decisions.

2. Reliability Over Capability

Capability metrics show what an agent can do. Governance requires understanding how it fails.

Primebase research emphasises:

Severity‑weighted failure classification
Detectability and recoverability of errors
Bounded retries and irreversible action tracking

This reframes assurance from success rates to downside containment.

3. Oversight as a State, Not a Gate

For agentic systems, assurance decays over time.

Primebase defines oversight as a continuously maintained state, supported by:

Runtime observation of agent behaviour
Trigger‑based escalation when risk boundaries are approached
Periodic re‑validation driven by model changes, behavioural drift, and incident signals

From Evaluation to Decision

A core contribution of Primebase research is the distinction between measurement and judgement.

Engineering frameworks measure performance.

Governance frameworks decide:

Whether autonomy remains appropriate
Whether constraints must tighten or relax
When human intervention is required
What evidence demonstrates effective challenge

Primebase converts raw eval and runtime signals into defensible oversight decisions aligned with regulatory expectations such as effective challenge, proportionality and ongoing assurance.

Why This Matters Now

As agents gain broader action spaces and operate across customer journeys, risk is no longer concentrated in single model outputs. It emerges from interaction, retries, tool use, and coordination across systems.

Without independent, continuous oversight, strong benchmarks can coexist with fragile real‑world behaviour. Primebase exists to close that gap.

Ongoing Work

Our current research explores:

Governance patterns for multi‑agent systems
Runtime detection of autonomy drift
Mapping agent risk controls to supervisory framework
Evidence generation for second‑line and board assurance

We publish selectively and collaborate with practitioners who believe governance must evolve as fast as the systems it oversees.