AI Evaluation: The Critical Path

In general-purpose computing based on CPUs, execution time is deterministic across Products and Domains, but things change drastically in accelerated computing or AI-agentic worlds based on GPUs, where execution is parallel and non-deterministic. This situation gives rise to several challenges, hence coming up with the best evaluation practices always remains a challenge in the Agentic world. The agents in many cases will face an entirely new situation, which wouldn't be feasible to test in a real-world scenario, for example: a Driverless Agent facing completely new road and traffic scenarios on a day-to-day basis.

Before diving further into the eval story, a brief recap on the testing in the CPU world. Testing is a critical phase that spans the software development life cycle, from development through integration, testing, deployment, and maintenance. The test requirements are mapped to the business rules, and traceability is well defined. In general, the test case covers different test conditions, including boundary-value, corner-case, positive, negative, load, stress, and performance testing. As the ground rule, the test cases should be well-defined with pre- and post-conditions, dependencies, test case descriptions, test run status (pass, fail, blocked, or NA), and observations. The test cases are executed in black- and white-box testing, where automation makes a big difference in achieving coverage faster. In addition, there is A/B testing, Pilot or User, BUAT, and Customer Acceptance testing, which play a bigger role before any major release. There will be several test runs before the production release to measure the quality across the builds. After passing all the gates, the product or application will be released in a staggered or complete release with support for rollback. The whole exercise is deterministic, and whenever there is a change to the core module or libraries, the test suite is executed to ensure the system is working as expected across various input and environment conditions. In short, traditional testing has coverage for:

Safety envelopes
Redundancy
Continuous monitoring
Risk-based boundaries
Simulation environments
Fallback behaviors

Fast forward to the Agentic world, where frontier LLM models are used by companies that act as "Providers," and companies that build their AI applications as products act as "Integrators". The evaluations are similar to the test analogy explained above, but there is a bigger challenge as the models are not deterministic and they face more frequent novel situations to handle.

The agentic workflows challenges are:

Operate in open-ended problem spaces
Face novel, evolving, messy real-world states
Interact with APIs, tools, humans, content
Can coordinate and branch in parallel

Provider vs Integrator Responsibility

In the Agentic system, it's quite essential to understand the Provider vs Integrator Responsibilities.

Providers ensure:

base alignment
reasoning competency
general safety

Integrators ensure:

domain safety
workflow correctness
context-specific compliance
runtime constraints
production monitoring

Let us take the example of the Claude Sonnet 4.5 release from the Provider context, where, according to their system card, the following evaluation was carried out before their model release:

Trained on a proprietary mix of publicly available information on the Internet as of July 2025, non-public data from third parties, data provided by data-labelling services and paid contractors
Throughout the training process, they used several data cleaning and filtering methods, including deduplication and classification.
After the above pre-training process, the model underwent substantial post-training and fine-tuning, the object of which is to make it a helpful, honest, and harmless assistant. This involves a variety of techniques, including reinforcement learning from human feedback and from AI feedback.
Crowd workers, Anthropic partners with data work platforms to engage workers who help improve our models through preference selection, safety evaluation, and adversarial testing

AI workloads

Not all AI workloads fall under the same “uncertainty profile,” and not all require complex agent-style eval strategies. There are two fundamentally different classes of AI systems:

A. Deterministic-ish / bounded outputs

Ranking (search results)
Scoring models (credit risk)
Recommendation ranking
Basic chatbots with constrained flows
Image generation without tool actions

These operate in a probabilistic but bounded space. Uncertainty exists, but the outputs do not autonomously act on the world.

B. Agentic / tool-using / dynamic systems

Autonomous driving
AI that calls APIs
Multi-step planners
Procurement agents
Workflow orchestrators
Shell-executing agents
Financial trading agents

These interact with external state, face open-world conditions, and can cause harm. For this second class, uncertainty is intrinsic.

Guidance on Evaluations for Agentic Workflow (dynamic systems):

Use the Evals to measure: Capability bound, Safety tendencies, Model regression, Bias drift, Tool-use correctness and Reasoning quality.
Ensure meta-behaviours are stable and the agent adapts in unseen situations. The framework should test for Reasoning coherence, Self-check ability, Compliance with constraint, Uncertainty awareness, Red-teaming vulnerability, Tool hygiene, and Hallucination tendency.
Don't try to avoid Parallel Execution; instead, use runtime guardrails, runtime evaluators, and Containment boundaries.
Design for resilience, test the interfaces, enforce contracts, monitor runtime behaviour, add retry/backoff logic, Rate-limit, instrument metrics and alert on anomalies.
Monitor continuously and enforce correctness in context through Health checks, Guardrails, and Drift detectors.

Closing Notes:

Not all AI workloads face evaluation challenges, as mature metrics, A/B test frameworks, Production drift monitoring, and human feedback loops are available. for example:

Recommendations
Ads ranking
Classification
Search

A heavy-agency evaluation methodology is required to handle complex AI workloads where the model needs to plan steps, autonomously choose tools, interact with external systems, and cannot predict environmental states. These Agentic evaluations bring a paradigm shift. During the review, it's critical to test how the systems behave under uncertainty, handle novelty, detect risk, self-critique and escalate safely. To put in practice, it's essential to have solid Pre-deployment evals, simulation environments, Runtime guardrails, Real-time observability, self-evaluation loops and automatic abort conditions.

To end, Agentic Evals are about “Metacognition” and we evaluate whether the agent can scale to unknown states.

Detect ambiguity
Ask clarifying questions
De-escalate
Avoid hallucinating
Fail safely
Stay within tool boundaries