Agent Evaluation Framework: Progressive Implementation Plan
Status: Implementation Roadmap Last Updated: 2025-10-25 Purpose: Comprehensive evaluation strategy for multi-agent loan processing system Related Docs: AI Security, AI Security Framework Capabilities
Executive Summary
This document presents a progressive evaluation framework for the Loan Defenders multi-agent system, combining industry best practices from Anthropic/Claude, Microsoft Agent Framework, and academic research on agent failure modes. The framework spans offline testing, online monitoring, error analysis, and continuous improvement.
Key Insight: Agent evaluation differs from traditional software testing. Agents exhibit non-deterministic behavior, complex reasoning chains, and emergent failure modes requiring specialized evaluation techniques.
Implementation Timeline: 4 phases over 12-16 weeks, starting with basic offline testing and progressing to production monitoring and continuous learning.
Table of Contents
- Evaluation Philosophy
- Evaluation Layers & Metrics
- Testing Strategies
- Offline vs. Online Testing
- Error Analysis & Failure Taxonomy
- Microsoft Agent Framework Lab Integration
- Progressive Implementation Plan
- Tooling & Infrastructure
- Continuous Improvement Loop
Evaluation Philosophy
Core Principles (from Anthropic)
1. Evaluation is Part of the Entire Production Lifecycle
Evaluation is not a one-time validation step but a continuous cycle integral to improving agent performance. Every prompt change, model update, or architecture modification requires re-evaluation.
2. Volume Over Perfection
Anthropic recommendation: "More questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals."
Key: Prioritize scale and automation over perfect precision.
3. Real-World Task Distribution
Evaluations must mirror actual use cases, including: - ✅ Typical loan applications (happy path) - ✅ Edge cases (zero income, perfect credit + high debt) - ✅ Adversarial inputs (prompt injection attempts) - ✅ Ambiguous scenarios (missing data, conflicting signals) - ✅ Poor user input (typos, format errors)
4. Automation First
Structure tests to enable automated grading through: - Multiple-choice formats - String matching - Code-based evaluation - LLM-based assessment
Human grading only when absolutely necessary (slowest, most expensive).
Why Agent Evaluation is Different
| Traditional Software | AI Agents |
|---|---|
| Deterministic outputs | Non-deterministic behavior |
| Unit tests with exact matches | Semantic similarity checks |
| Single execution path | Multiple valid reasoning paths |
| Predictable failure modes | Cascading error propagation |
| Binary pass/fail | Graded quality assessment |
Implication: Need probabilistic evaluation (pass@k metrics), trajectory analysis, and failure taxonomy understanding.
Evaluation Layers & Metrics
Based on industry best practices, we evaluate across 4 layers of our agent system:
Layer 1: Model & LLM Performance
Metrics: - Accuracy: How often agent outputs match expected results (%) - Latency: Response time from query to final response (seconds) - Cost: Token usage and API call expenses ($/request) - Robustness: Performance consistency across diverse inputs (variance)
Example Test:
# Test credit agent accuracy on known cases
def test_credit_agent_accuracy():
test_cases = [
{"credit_score": 750, "debt": 10000, "expected": "LOW_RISK"},
{"credit_score": 620, "debt": 25000, "expected": "MEDIUM_RISK"},
{"credit_score": 580, "debt": 50000, "expected": "HIGH_RISK"},
]
correct = 0
for case in test_cases:
result = credit_agent.assess(case)
if result.risk_level == case["expected"]:
correct += 1
accuracy = correct / len(test_cases)
assert accuracy >= 0.90 # 90% accuracy threshold
Layer 2: Orchestration & Reasoning
Metrics: - Agent Trajectory Quality: Logical sequence of actions and reasoning paths - Tool Selection Accuracy: Correct MCP server invocation with appropriate parameters - Step Completion: Successful execution of required workflow steps (%) - Step Utility: Whether each action meaningfully advances task completion
Example Test:
# Test orchestrator workflow completeness
def test_orchestrator_workflow():
application = create_test_application()
# Capture agent trajectory
with capture_agent_trace() as trace:
decision = orchestrator.process(application)
# Verify all required agents were consulted
expected_agents = ["intake", "credit", "income", "risk"]
invoked_agents = [step.agent_name for step in trace.steps]
assert set(expected_agents).issubset(set(invoked_agents))
# Verify logical order
assert trace.steps[0].agent_name == "intake" # First step
assert trace.steps[-1].agent_name == "risk" # Last assessment
Layer 3: Knowledge & Retrieval
Metrics (if using RAG or knowledge bases): - Context Relevance: Retrieved information addresses user questions (%) - Context Precision: Information density and signal-to-noise ratio - Context Recall: Completeness of relevant information retrieval (%) - Faithfulness: Claims supported by provided sources (%)
Note: Currently not applicable to Loan Defenders (agents use MCP servers, not RAG).
Layer 4: Application & Business Outcomes
Metrics (most important): - Task Success Rate: Binary or graded task completion (%) - Decision Quality: Agreement with expert human judgments (%) - Semantic Similarity: Meaning-based output comparison using embeddings - Regulatory Compliance: ECOA/FCRA adherence (violations per 100 decisions) - User Satisfaction: Explicit feedback and implicit signals
Example Test:
# Test decision quality against expert-labeled dataset
def test_decision_quality():
expert_dataset = load_expert_labeled_loans() # 500 cases
true_positives = 0
false_positives = 0
true_negatives = 0
false_negatives = 0
for case in expert_dataset:
agent_decision = orchestrator.process(case["application"])
expert_decision = case["expert_judgment"]
if agent_decision.approved and expert_decision == "APPROVE":
true_positives += 1
elif agent_decision.approved and expert_decision == "DENY":
false_positives += 1 # Risk: bad loan approved
elif not agent_decision.approved and expert_decision == "DENY":
true_negatives += 1
else:
false_negatives += 1 # Lost business
accuracy = (true_positives + true_negatives) / len(expert_dataset)
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
# Business-critical thresholds
assert accuracy >= 0.90 # 90% agreement with experts
assert false_positives <= 25 # ≤5% bad loan approvals
assert false_negatives <= 75 # ≤15% lost business
Testing Strategies
1. Automated Evaluation (Primary Strategy)
Statistical Evaluators: - BLEU, ROUGE for text similarity - Cosine similarity for semantic comparison - Exact match for structured outputs
Programmatic Evaluators: - Rule-based checks (e.g., DTI ratio ≤ 43%) - Format validation (Pydantic models) - Constraint satisfaction (loan amount ranges)
CI/CD Integration:
# .github/workflows/agent-evaluation.yml
name: Agent Evaluation
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
evaluate-agents:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Agent Evaluation Suite
run: |
uv run pytest tests/evals/ -v --cov=loan_defenders
- name: Generate Evaluation Report
run: |
uv run python scripts/generate_eval_report.py
- name: Check Quality Gates
run: |
# Fail build if metrics below thresholds
uv run python scripts/check_quality_gates.py
Benefits: Fast, scalable, consistent, cost-effective.
2. LLM-as-Judge Evaluation
Use Cases: - Subjective qualities (helpfulness, tone, clarity) - Semantic correctness when exact match is too strict - Explanation quality assessment - Fair lending compliance checks
Implementation:
# tests/evals/judges/explanation_judge.py
from agent_framework import ChatAgent
from agent_framework.azure import AzureOpenAIChatClient
class ExplanationQualityJudge:
"""LLM-based judge for decision explanation quality"""
def __init__(self):
self.judge = ChatAgent(
chat_client=AzureOpenAIChatClient(
credential=DefaultAzureCredential(),
model="gpt-4o-mini" # Cheaper model for judging
),
name="Explanation Judge",
instructions="""
You are an expert evaluator of loan decision explanations.
Evaluate explanations on a 1-5 scale for:
1. Clarity: Is the explanation easy to understand?
2. Completeness: Does it address all factors?
3. Compliance: Does it meet FCRA/ECOA requirements?
4. Actionability: Does it provide clear next steps?
Respond with JSON:
{
"clarity_score": 1-5,
"completeness_score": 1-5,
"compliance_score": 1-5,
"actionability_score": 1-5,
"overall_score": 1-5,
"reasoning": "brief explanation"
}
"""
)
async def evaluate(self, decision: LoanDecision) -> Dict:
"""Evaluate decision explanation quality"""
prompt = f"""
Evaluate this loan decision explanation:
Decision: {"Approved" if decision.approved else "Denied"}
Explanation: {decision.reasoning}
Denial Reasons: {decision.denial_reasons if not decision.approved else "N/A"}
Provide scores and reasoning.
"""
result = await self.judge.run(prompt)
return json.loads(result.text)
# Usage in tests
async def test_explanation_quality():
judge = ExplanationQualityJudge()
test_decisions = load_test_decisions()
low_quality_count = 0
for decision in test_decisions:
scores = await judge.evaluate(decision)
if scores["overall_score"] < 3:
low_quality_count += 1
quality_rate = 1 - (low_quality_count / len(test_decisions))
assert quality_rate >= 0.85 # 85% explanations should score ≥3/5
Cost Consideration: ~$0.50-$1.00 per 1,000 evaluations (using GPT-4o-mini).
3. Human-in-the-Loop Evaluation
Use Cases: - Domain-specific correctness validation - Safety and appropriateness judgments - Nuanced quality assessment - Training data collection for automated evaluators
Implementation:
# tests/evals/human/review_queue.py
class HumanReviewQueue:
"""Queue random decisions for human expert review"""
def __init__(self, sampling_rate: float = 0.01):
self.sampling_rate = sampling_rate # 1% of decisions
self.queue = []
def maybe_enqueue(self, application: LoanApplication, decision: LoanDecision):
"""Randomly sample decisions for human review"""
if random.random() < self.sampling_rate:
self.queue.append({
"application_id": application.applicant_id,
"decision": decision,
"timestamp": datetime.utcnow(),
"status": "PENDING_REVIEW"
})
def get_pending_reviews(self) -> List[Dict]:
"""Get decisions waiting for human review"""
return [r for r in self.queue if r["status"] == "PENDING_REVIEW"]
def record_human_feedback(
self,
application_id: str,
human_decision: bool,
human_reasoning: str,
reviewer_id: str
):
"""Record expert reviewer's assessment"""
for review in self.queue:
if review["application_id"] == application_id:
review["human_decision"] = human_decision
review["human_reasoning"] = human_reasoning
review["reviewer_id"] = reviewer_id
review["status"] = "REVIEWED"
# Calculate agreement
agent_decision = review["decision"].approved
review["agreement"] = (agent_decision == human_decision)
break
# Usage
review_queue = HumanReviewQueue(sampling_rate=0.01)
# During processing
decision = orchestrator.process(application)
review_queue.maybe_enqueue(application, decision)
# Weekly review session
pending = review_queue.get_pending_reviews()
# Present to human reviewers via UI
Frequency: Weekly review sessions, 1% sampling rate (manageable volume).
4. Simulation-Based Testing
Use Cases: - Test edge cases systematically - Generate diverse test cases covering user personas - Consistency testing across multiple runs - Pre-production validation
Implementation:
# tests/evals/simulation/loan_application_generator.py
class LoanApplicationGenerator:
"""Generate synthetic loan applications for testing"""
PERSONAS = {
"perfect_borrower": {
"credit_score": (750, 850),
"monthly_income": (8000, 15000),
"monthly_debt": (500, 2000),
"employment_duration": (24, 120) # months
},
"risky_borrower": {
"credit_score": (580, 640),
"monthly_income": (2000, 3500),
"monthly_debt": (1500, 2500),
"employment_duration": (3, 12)
},
"borderline_borrower": {
"credit_score": (620, 680),
"monthly_income": (3500, 5500),
"monthly_debt": (1200, 2000),
"employment_duration": (12, 36)
}
}
def generate_application(
self,
persona: str,
loan_amount: float,
seed: int = None
) -> LoanApplication:
"""Generate synthetic application matching persona"""
if seed:
random.seed(seed)
profile = self.PERSONAS[persona]
return LoanApplication(
applicant_id=str(uuid.uuid4()),
credit_score=random.randint(*profile["credit_score"]),
monthly_income=random.uniform(*profile["monthly_income"]),
monthly_debt=random.uniform(*profile["monthly_debt"]),
employment_duration_months=random.randint(*profile["employment_duration"]),
requested_loan_amount=loan_amount
)
def generate_test_suite(self, cases_per_persona: int = 100) -> List[LoanApplication]:
"""Generate comprehensive test suite"""
test_suite = []
for persona in self.PERSONAS.keys():
for i in range(cases_per_persona):
# Vary loan amounts
loan_amount = random.choice([10000, 25000, 50000, 75000, 100000])
app = self.generate_application(
persona=persona,
loan_amount=loan_amount,
seed=i # Reproducible
)
test_suite.append(app)
return test_suite
# Usage
def test_agent_consistency():
"""Test if agent produces consistent decisions for same input"""
generator = LoanApplicationGenerator()
app = generator.generate_application("borderline_borrower", 30000, seed=42)
# Run 10 times
decisions = []
for _ in range(10):
decision = orchestrator.process(app)
decisions.append(decision.approved)
# Check consistency
unique_decisions = set(decisions)
assert len(unique_decisions) == 1, "Decision should be deterministic for same input"
Benefits: Systematic coverage, reproducibility, scalability.
Offline vs. Online Testing
Offline Testing (Development & Staging)
Definition: Evaluation in controlled environment using curated test datasets before deployment.
When: During development, before releases, in CI/CD pipelines.
Test Types: 1. Unit Tests: Individual agent behavior 2. Integration Tests: Multi-agent workflow 3. Regression Tests: Prevent breaking changes 4. Simulation Tests: Edge cases and adversarial inputs 5. Benchmark Tests: GAIA, TAU2 (if applicable)
Example Workflow:
# tests/evals/offline/regression_suite.py
class RegressionTestSuite:
"""Prevent regressions in agent behavior"""
def __init__(self):
self.baseline_metrics = self._load_baseline()
self.test_dataset = self._load_test_dataset()
async def run_regression_tests(self) -> RegressionReport:
"""Compare current version to baseline"""
current_metrics = {}
# Run all test cases
for test_case in self.test_dataset:
decision = await orchestrator.process(test_case["application"])
# Record metrics
current_metrics[test_case["id"]] = {
"decision": decision.approved,
"confidence": decision.confidence_score,
"reasoning": decision.reasoning
}
# Calculate deltas
regressions = []
improvements = []
for test_id, current in current_metrics.items():
baseline = self.baseline_metrics[test_id]
# Check for decision flips
if current["decision"] != baseline["decision"]:
regressions.append({
"test_id": test_id,
"issue": "DECISION_FLIP",
"baseline": baseline["decision"],
"current": current["decision"]
})
# Check for confidence drops
confidence_delta = current["confidence"] - baseline["confidence"]
if confidence_delta < -0.10: # 10% drop
regressions.append({
"test_id": test_id,
"issue": "CONFIDENCE_DROP",
"delta": confidence_delta
})
return RegressionReport(
total_tests=len(self.test_dataset),
regressions=regressions,
improvements=improvements,
passed=len(regressions) == 0
)
# CI/CD integration
async def test_no_regressions():
suite = RegressionTestSuite()
report = await suite.run_regression_tests()
if not report.passed:
print(f"❌ {len(report.regressions)} regressions detected:")
for reg in report.regressions:
print(f" - Test {reg['test_id']}: {reg['issue']}")
assert False, "Regressions detected - do not merge"
Success Criteria: - All unit tests pass (100%) - Integration tests pass (≥95%) - No regressions vs. previous version - Metrics meet quality gates
Online Testing (Production)
Definition: Evaluation in live, real-world environment during actual usage.
When: Continuously in production.
Monitoring Types: 1. Real-Time Metrics: Latency, error rates, throughput 2. Auto-Evaluation: LLM-judge on production logs (sampled) 3. User Feedback: Explicit ratings, implicit signals 4. Anomaly Detection: Sudden metric changes 5. A/B Testing: Compare agent versions in production
Example Implementation:
# apps/api/monitoring/online_evaluation.py
from agent_framework.observability import get_tracer, get_meter
class OnlineEvaluationService:
"""Continuous evaluation in production"""
def __init__(self, sampling_rate: float = 0.05):
self.sampling_rate = sampling_rate # Evaluate 5% of decisions
self.tracer = get_tracer()
self.meter = get_meter()
# Metrics
self.decision_counter = self.meter.create_counter("decisions_total")
self.approval_rate_gauge = self.meter.create_gauge("approval_rate")
self.latency_histogram = self.meter.create_histogram("decision_latency_seconds")
self.quality_scores = self.meter.create_histogram("decision_quality_score")
async def evaluate_decision(
self,
application: LoanApplication,
decision: LoanDecision,
latency: float
):
"""Evaluate production decision"""
with self.tracer.start_as_current_span("online_evaluation") as span:
# Record basic metrics
self.decision_counter.add(1, {
"approved": str(decision.approved),
"loan_amount_bucket": self._bucket_loan_amount(decision.loan_amount)
})
self.latency_histogram.record(latency)
# Sample for detailed evaluation
if random.random() < self.sampling_rate:
quality_score = await self._evaluate_quality(decision)
self.quality_scores.record(quality_score)
span.set_attribute("quality_score", quality_score)
# Alert if low quality
if quality_score < 3.0:
await self._alert_low_quality(application, decision, quality_score)
async def _evaluate_quality(self, decision: LoanDecision) -> float:
"""LLM-based quality evaluation"""
judge = ExplanationQualityJudge()
scores = await judge.evaluate(decision)
return scores["overall_score"]
async def _alert_low_quality(
self,
application: LoanApplication,
decision: LoanDecision,
score: float
):
"""Alert on low-quality decisions"""
alert = {
"severity": "WARNING",
"type": "LOW_QUALITY_DECISION",
"application_id": application.applicant_id,
"quality_score": score,
"decision": decision.approved,
"timestamp": datetime.utcnow()
}
# Send to monitoring system (Azure Monitor, Slack, etc.)
await send_alert(alert)
# Integration in API
from apps.api.monitoring.online_evaluation import OnlineEvaluationService
evaluator = OnlineEvaluationService(sampling_rate=0.05)
@app.post("/api/loan-application/submit")
async def submit_loan_application(application: LoanApplication):
start_time = time.time()
# Process application
decision = await orchestrator.process(application)
latency = time.time() - start_time
# Online evaluation (async, doesn't block response)
asyncio.create_task(
evaluator.evaluate_decision(application, decision, latency)
)
return decision
Success Criteria: - Latency P95 < 10 seconds - Error rate < 1% - Quality score ≥ 3.5/5 - No anomalies detected
Continuous Evaluation Loop
Best Practice: Combine offline and online testing in continuous loop.
┌─────────────────────────────────────────────────────────────────┐
│ Continuous Evaluation Loop │
└─────────────────────────────────────────────────────────────────┘
1. Offline Evaluation (Pre-Deployment)
├─ Run regression tests
├─ Simulate edge cases
├─ Check quality gates
└─ ✅ Pass → Deploy new version
2. Deploy to Production
├─ Gradual rollout (canary deployment)
├─ Monitor real-time metrics
└─ Auto-evaluate sampled decisions
3. Online Monitoring (Production)
├─ Collect user feedback
├─ Detect anomalies
├─ Identify failure patterns
└─ Flag low-quality decisions
4. Collect Failure Examples
├─ Production incidents
├─ Low-quality decisions
├─ User complaints
└─ Edge cases discovered
5. Add to Offline Test Set
├─ Create regression tests for failures
├─ Expand edge case coverage
├─ Update quality benchmarks
└─ Loop back to step 1
Implementation:
# scripts/sync_prod_failures_to_tests.py
class FailureTestSyncService:
"""Sync production failures to offline test suite"""
async def sync_failures(self):
"""Weekly job to update test suite with production failures"""
# Get production failures from last week
failures = await self.get_production_failures(days=7)
# Convert to test cases
new_tests = []
for failure in failures:
test_case = {
"id": f"prod_failure_{failure['id']}",
"application": failure["application"],
"expected_behavior": "Should not produce low-quality decision",
"source": "production_failure",
"date_added": datetime.utcnow()
}
new_tests.append(test_case)
# Add to regression suite
await self.add_to_regression_suite(new_tests)
print(f"✅ Added {len(new_tests)} production failures to test suite")
Error Analysis & Failure Taxonomy
Agent Error Taxonomy (AgentErrorTaxonomy)
Based on research paper "Where LLM Agents Fail and How They can Learn From Failures" (2024).
5 Major Categories:
1. Memory Errors
- Symptom: Agent forgets context, repeats actions, loses track of state
- Example: Re-verifying income after already verifying
- Detection: Check for duplicate tool calls in trajectory
- Mitigation: Improve context management, use state tracking
2. Reflection Errors
- Symptom: Agent doesn't learn from mistakes, repeats failed approaches
- Example: Repeatedly calling same MCP server with same params after error
- Detection: Analyze retry patterns in failed trajectories
- Mitigation: Add reflection prompts, implement backtracking
3. Planning Errors
- Symptom: Illogical action sequence, skips required steps
- Example: Making final decision before consulting all required agents
- Detection: Verify workflow completeness
- Mitigation: Enforce workflow constraints, add planning validation
4. Action Errors
- Symptom: Wrong tool selection, incorrect parameters
- Example: Calling
verify_employmentwith SSN instead of applicant_id - Detection: Tool call validation middleware (see AI Security doc)
- Mitigation: Schema enforcement, parameter validation
5. System-Level Errors
- Symptom: Timeouts, API failures, rate limits
- Example: MCP server unavailable, Azure OpenAI quota exceeded
- Detection: Exception monitoring, retry tracking
- Mitigation: Graceful degradation, retry logic, circuit breakers
Cascading Failure Detection
Key Research Finding: Single root-cause error propagates through subsequent decisions, leading to task failure.
Example Cascade:
1. Memory Error: Agent forgets credit score was already retrieved
↓
2. Action Error: Calls credit verification tool again with wrong params
↓
3. Planning Error: Skips income verification due to time spent on duplicate work
↓
4. Final Outcome: Incomplete assessment → incorrect loan decision
Detection Strategy:
# tests/evals/error_analysis/cascade_detector.py
class CascadeFailureDetector:
"""Detect and classify cascading failures"""
def analyze_trajectory(self, trace: AgentTrace) -> CascadeAnalysis:
"""Identify root cause and cascade pattern"""
errors = []
# Step 1: Identify all errors in trajectory
for step in trace.steps:
if step.status == "ERROR":
errors.append({
"step_index": step.index,
"error_type": self._classify_error(step.error),
"agent": step.agent_name,
"tool": step.tool_name
})
if len(errors) == 0:
return CascadeAnalysis(has_cascade=False)
# Step 2: Identify root cause (first error)
root_cause = errors[0]
# Step 3: Analyze propagation
cascade_path = self._trace_cascade(errors, trace)
return CascadeAnalysis(
has_cascade=len(cascade_path) > 1,
root_cause=root_cause,
cascade_path=cascade_path,
severity="CRITICAL" if len(cascade_path) >= 3 else "MODERATE"
)
def _classify_error(self, error: Exception) -> str:
"""Classify error into taxonomy category"""
error_msg = str(error).lower()
if "duplicate" in error_msg or "already" in error_msg:
return "MEMORY_ERROR"
elif "timeout" in error_msg or "rate limit" in error_msg:
return "SYSTEM_ERROR"
elif "invalid parameter" in error_msg:
return "ACTION_ERROR"
elif "missing step" in error_msg:
return "PLANNING_ERROR"
else:
return "UNKNOWN_ERROR"
# Usage in tests
def test_no_cascading_failures():
"""Ensure failures don't cascade"""
detector = CascadeFailureDetector()
# Run all test cases
failures = []
for test_case in load_test_suite():
try:
with capture_agent_trace() as trace:
decision = orchestrator.process(test_case)
except Exception:
analysis = detector.analyze_trajectory(trace)
if analysis.has_cascade:
failures.append(analysis)
# Report cascading failures
if failures:
print(f"❌ {len(failures)} cascading failures detected:")
for failure in failures:
print(f" Root cause: {failure.root_cause['error_type']}")
print(f" Cascade length: {len(failure.cascade_path)}")
assert len(failures) == 0, "No cascading failures allowed"
Multi-Agent Failure Taxonomy (MAST)
Based on research "Why Do Multi-Agent LLM Systems Fail?" (2024).
3 Categories, 14 Failure Modes:
Category 1: Specification Issues
- Incomplete Instructions: Agent lacks necessary context
- Ambiguous Goals: Unclear success criteria
- Conflicting Constraints: Impossible requirements
Example: Orchestrator told to "approve quickly" but also "thoroughly assess all risks" (conflicting).
Category 2: Inter-Agent Misalignment
- Ignoring Peer Input: Agent disregards other agents' assessments
- Incorrect Assumptions: Agent operates on wrong premises
- Communication Breakdown: Agents misinterpret each other
Example: Risk agent ignores credit agent's "high risk" assessment and still recommends approval.
Category 3: Task Verification
- Missing Verification Steps: No validation of outputs
- Incorrect Success Criteria: Wrong definition of task completion
- Premature Termination: Agent stops before task is done
Example: Orchestrator returns decision without verifying all agents completed assessments.
Key Insight: Many failures stem from poor system design, not model performance.
Mitigation: Better orchestration strategies, clear role definitions, explicit verification steps.
Root Cause Analysis Framework
# tests/evals/error_analysis/root_cause_analyzer.py
class RootCauseAnalyzer:
"""Systematic root cause analysis for failures"""
ERROR_TAXONOMY = {
"MEMORY_ERROR": {
"indicators": ["duplicate action", "context loss", "forgotten state"],
"root_causes": ["insufficient context window", "poor state tracking"],
"fixes": ["increase context", "add explicit state management"]
},
"PLANNING_ERROR": {
"indicators": ["skipped step", "wrong order", "incomplete workflow"],
"root_causes": ["unclear instructions", "missing constraints"],
"fixes": ["clarify persona instructions", "add workflow validation"]
},
"ACTION_ERROR": {
"indicators": ["wrong tool", "invalid parameters", "malformed request"],
"root_causes": ["poor tool descriptions", "weak validation"],
"fixes": ["improve tool docs", "add parameter validation"]
}
}
def analyze_failure(self, trace: AgentTrace, error: Exception) -> RootCauseReport:
"""Identify root cause and recommend fixes"""
# Step 1: Classify error
error_type = self._classify_error(error)
# Step 2: Extract indicators from trace
indicators = self._extract_indicators(trace, error_type)
# Step 3: Match to known root causes
taxonomy_entry = self.ERROR_TAXONOMY[error_type]
matched_causes = []
for indicator in indicators:
for root_cause in taxonomy_entry["root_causes"]:
if self._indicator_matches_cause(indicator, root_cause):
matched_causes.append(root_cause)
# Step 4: Recommend fixes
recommended_fixes = []
for cause in set(matched_causes):
fixes = [f for f in taxonomy_entry["fixes"] if cause in f]
recommended_fixes.extend(fixes)
return RootCauseReport(
error_type=error_type,
indicators=indicators,
root_causes=matched_causes,
recommended_fixes=recommended_fixes,
trace_id=trace.trace_id
)
# Usage
analyzer = RootCauseAnalyzer()
for failure in production_failures:
report = analyzer.analyze_failure(failure.trace, failure.error)
print(f"Failure: {failure.id}")
print(f" Error Type: {report.error_type}")
print(f" Root Causes: {report.root_causes}")
print(f" Recommended Fixes: {report.recommended_fixes}")
Microsoft Agent Framework Lab Integration
GAIA Benchmark
What It Tests: General AI Assistant capabilities requiring reasoning, multi-modality, web browsing, tool-use proficiency.
Dataset: 466 human-annotated tasks with unambiguous answers.
Performance: Humans 92% success, GPT-4 15% success (stark gap).
When to Use: Validate general agent capabilities, benchmark against industry standards.
Integration (Experimental - Lab Module):
# tests/benchmarks/test_gaia.py
from agent_framework.lab.gaia import GAIABenchmark
class GAIAEvaluation:
"""Benchmark agents against GAIA"""
def __init__(self):
self.benchmark = GAIABenchmark()
self.benchmark.load_dataset() # 466 tasks
async def run_benchmark(self) -> GAIAResults:
"""Run full GAIA benchmark"""
results = []
for task in self.benchmark.tasks:
try:
# Adapt task to loan domain (if applicable)
if self._is_relevant_to_loans(task):
result = await self._run_task(task)
results.append(result)
except Exception as e:
results.append({
"task_id": task.id,
"success": False,
"error": str(e)
})
success_rate = sum(1 for r in results if r["success"]) / len(results)
return GAIAResults(
total_tasks=len(results),
success_rate=success_rate,
results=results
)
# Note: GAIA may not be directly applicable to loan processing
# Consider adapting relevant tasks or using as general capability check
Applicability to Loan Defenders: Limited (GAIA is general-purpose, not loan-specific). Use for general capability validation, not primary evaluation.
TAU2 Benchmark (τ²-bench)
What It Tests: Conversational agents in dual-control environments (both agent and user use tools).
Domains: Retail, airline, telecom customer support.
Performance: GPT-4.1 34%, o4-mini 42%, Claude-3.7 49% pass@1 (very challenging).
When to Use: Test conversational flow, tool use in dynamic environments.
Applicability to Loan Defenders: Moderate relevance - loan processing involves conversational elements and dynamic decision-making.
Integration (Experimental - Lab Module):
# tests/benchmarks/test_tau2.py
from agent_framework.lab.tau2 import TAU2Benchmark
# Note: TAU2 is customer support focused, not loan-specific
# Consider creating custom loan-processing benchmark instead
Recommendation: TAU2 is valuable for conversational agent evaluation but not directly applicable. Consider creating custom loan-processing benchmark inspired by TAU2 methodology.
Custom Loan Processing Benchmark
Proposal: Create loan-specific benchmark dataset.
# tests/benchmarks/loan_defenders_benchmark.py
class LoanDefendersBenchmark:
"""Custom benchmark for loan processing agents"""
BENCHMARK_DATASET = {
"simple_cases": 100, # Clear approve/deny
"edge_cases": 50, # Borderline decisions
"adversarial": 30, # Attack attempts
"regulatory": 20, # Compliance scenarios
}
def __init__(self):
self.dataset = self._load_benchmark_dataset()
self.expert_labels = self._load_expert_judgments()
def _load_benchmark_dataset(self) -> List[LoanApplication]:
"""Load curated benchmark applications"""
# Generated using LoanApplicationGenerator + expert curation
dataset = []
# Simple cases: Clear outcomes
generator = LoanApplicationGenerator()
for i in range(100):
app = generator.generate_application(
persona=random.choice(["perfect_borrower", "risky_borrower"]),
loan_amount=random.choice([10000, 30000, 50000]),
seed=i
)
dataset.append(app)
# Edge cases: Borderline scenarios
for i in range(50):
app = generator.generate_application(
persona="borderline_borrower",
loan_amount=random.choice([25000, 50000, 75000]),
seed=100 + i
)
dataset.append(app)
# Adversarial: Injection attempts, malformed data
dataset.extend(load_adversarial_test_cases())
# Regulatory: ECOA/FCRA compliance scenarios
dataset.extend(load_regulatory_test_cases())
return dataset
async def run_benchmark(self) -> BenchmarkResults:
"""Run full benchmark suite"""
results = {
"accuracy": 0.0,
"precision": 0.0,
"recall": 0.0,
"regulatory_compliance": 0.0,
"adversarial_resistance": 0.0
}
# Evaluate each category
for category, count in self.BENCHMARK_DATASET.items():
category_results = await self._evaluate_category(category)
results[f"{category}_performance"] = category_results
# Calculate aggregate metrics
results["overall_score"] = self._calculate_overall_score(results)
return BenchmarkResults(**results)
# Usage
benchmark = LoanDefendersBenchmark()
results = await benchmark.run_benchmark()
print(f"Benchmark Results:")
print(f" Overall Score: {results.overall_score:.1%}")
print(f" Accuracy: {results.accuracy:.1%}")
print(f" Regulatory Compliance: {results.regulatory_compliance:.1%}")
Benefit: Loan-specific, actionable, tracks progress over time.
Progressive Implementation Plan
Phase 1: Foundation (Weeks 1-3)
Goal: Establish basic offline testing infrastructure.
Week 1: Test Infrastructure Setup
- [ ] Create tests/evals/ directory structure
- [ ] Set up test dataset generation (LoanApplicationGenerator)
- [ ] Implement basic accuracy tests (Layer 1 metrics)
- [ ] Add to CI/CD pipeline (GitHub Actions)
Week 2: Offline Evaluation Suite - [ ] Implement regression test suite - [ ] Create expert-labeled dataset (50-100 cases) - [ ] Build decision quality tests (accuracy, precision, recall) - [ ] Add workflow completeness tests (orchestration)
Week 3: Automation & Reporting - [ ] Automate test execution in CI/CD - [ ] Generate evaluation reports (HTML/Markdown) - [ ] Define quality gates (fail build if thresholds not met) - [ ] Document evaluation process
Deliverables: - ✅ 300+ automated test cases - ✅ CI/CD integration with quality gates - ✅ Evaluation dashboard
Effort: ~3 weeks, 1-2 engineers
Phase 2: Advanced Evaluation (Weeks 4-7)
Goal: Add LLM-as-judge, simulation, and error analysis.
Week 4: LLM-as-Judge Evaluation
- [ ] Implement ExplanationQualityJudge
- [ ] Build semantic similarity evaluators
- [ ] Add tone/compliance judges
- [ ] Optimize judging costs (use GPT-4o-mini)
Week 5: Simulation & Edge Cases - [ ] Expand simulation to cover all personas - [ ] Generate adversarial test cases - [ ] Test regulatory compliance scenarios - [ ] Build consistency test suite
Week 6: Error Analysis Framework
- [ ] Implement AgentErrorTaxonomy
- [ ] Build CascadeFailureDetector
- [ ] Create RootCauseAnalyzer
- [ ] Document common failure patterns
Week 7: Benchmarking
- [ ] Create LoanDefendersBenchmark
- [ ] Curate 200-case benchmark dataset
- [ ] Get expert labels for benchmark
- [ ] Baseline current performance
Deliverables: - ✅ LLM-based quality evaluation - ✅ Comprehensive simulation coverage - ✅ Error taxonomy and analysis tools - ✅ Custom loan benchmark
Effort: ~4 weeks, 2 engineers
Phase 3: Online Monitoring (Weeks 8-11)
Goal: Production monitoring and continuous evaluation.
Week 8: Observability Integration - [ ] Integrate Agent Framework observability (replace custom) - [ ] Configure Application Insights - [ ] Set up real-time metrics dashboard - [ ] Define alerting thresholds
Week 9: Online Evaluation Service
- [ ] Build OnlineEvaluationService
- [ ] Implement sampling (5% of decisions)
- [ ] Add LLM-judge for production decisions
- [ ] Create low-quality alert system
Week 10: User Feedback Loop
- [ ] Build HumanReviewQueue (1% sampling)
- [ ] Create reviewer UI for feedback
- [ ] Implement feedback storage
- [ ] Track agreement metrics
Week 11: Anomaly Detection - [ ] Build anomaly detection (approval rate spikes, etc.) - [ ] Implement automated alerts - [ ] Create incident response playbook - [ ] Test alerting system
Deliverables: - ✅ Real-time production monitoring - ✅ Automated quality evaluation (sampled) - ✅ Human feedback collection - ✅ Anomaly detection and alerting
Effort: ~4 weeks, 2 engineers
Phase 4: Continuous Improvement (Weeks 12-16)
Goal: Close the loop - prod failures → test suite → iteration.
Week 12: Failure Sync Pipeline
- [ ] Build FailureTestSyncService
- [ ] Automate prod failure → test case conversion
- [ ] Schedule weekly sync job
- [ ] Track test suite growth
Week 13: Advanced Analytics - [ ] Build failure pattern analysis dashboard - [ ] Implement trend detection (metrics over time) - [ ] Create agent performance leaderboard - [ ] Generate monthly evaluation reports
Week 14: A/B Testing Infrastructure - [ ] Implement canary deployments (10% traffic) - [ ] Build comparison framework (control vs. treatment) - [ ] Add statistical significance testing - [ ] Create rollback automation
Week 15: Fair Lending Evaluations - [ ] Implement disparate impact testing (4/5ths rule) - [ ] Build matched-pair test generator - [ ] Add ECOA/FCRA compliance checks - [ ] Schedule monthly fair lending audits
Week 16: Documentation & Training - [ ] Comprehensive evaluation documentation - [ ] Team training on evaluation framework - [ ] Runbook for common issues - [ ] Establish evaluation cadence
Deliverables: - ✅ Continuous improvement pipeline - ✅ Advanced analytics and trends - ✅ A/B testing capability - ✅ Fair lending compliance framework
Effort: ~5 weeks, 2-3 engineers
Tooling & Infrastructure
Required Tools
1. Agent Framework Observability (Built-in)
2. Evaluation Libraries
pip install pytest pytest-asyncio # Testing
pip install deepeval # G-Eval, RAGAS
pip install rouge-score # Text similarity
pip install sentence-transformers # Embeddings
3. Monitoring & Alerting - Azure Application Insights (already deployed) - Azure Monitor Workbooks (dashboards) - Azure Logic Apps (alerting, optional)
4. Data Management - Azure Table Storage (test datasets, benchmark results) - Azure Blob Storage (traces, logs)
Evaluation Dashboard
Components: 1. CI/CD Status: Current build, test pass rates 2. Regression Tracking: Metrics over time (line charts) 3. Error Distribution: Taxonomy breakdown (pie chart) 4. Benchmark Scores: LoanDefenders benchmark trend 5. Production Health: Real-time metrics, alerts
Tech Stack: - Azure Monitor Workbooks (primary) - Custom dashboard (React, optional for advanced features)
Continuous Improvement Loop
Quarterly Evaluation Cycle
Q1: Baseline & Foundation - Run initial benchmarks - Establish baseline metrics - Set quality thresholds
Q2: Iterate & Improve - Analyze failure patterns - Implement fixes - Re-run benchmarks - Measure improvements
Q3: Expand Coverage - Add new edge cases - Increase test dataset size - Improve evaluation depth
Q4: Compliance & Audit - Fair lending evaluation - Regulatory compliance review - External audit (if applicable)
Monthly Reviews
Evaluation Review Meeting: - Review metrics trends - Discuss failure patterns - Prioritize improvements - Update test suite
Agenda: 1. Metric review (20 min) 2. Failure analysis (20 min) 3. Improvement proposals (15 min) 4. Action items (5 min)
Continuous Learning
Feedback Sources: 1. Production failures → New test cases 2. User feedback → Quality improvements 3. Regulatory changes → Compliance updates 4. Industry benchmarks → Capability gaps
Process:
Production Failure Detected
↓
Root Cause Analysis
↓
Create Regression Test
↓
Fix Implementation
↓
Re-Run Evaluation
↓
Deploy to Production
↓
Monitor for Recurrence
Summary
Evaluation Framework Pillars
- Offline Testing: Comprehensive test suite, simulation, benchmarking
- Online Monitoring: Real-time metrics, sampling, anomaly detection
- Error Analysis: Taxonomy, cascade detection, root cause analysis
- Continuous Improvement: Failure sync, trend analysis, iterative enhancement
Success Metrics
Offline (Pre-Deployment): - Test pass rate: ≥95% - Regression tests: 0 failures - Benchmark score: ≥85% - Expert agreement: ≥90%
Online (Production): - Latency P95: <10 seconds - Error rate: <1% - Quality score: ≥3.5/5 - User satisfaction: ≥4.0/5
Improvement Rate: - Monthly benchmark improvement: +2-5% - Quarterly failure reduction: -10-20%
Implementation Timeline
- Phase 1 (Weeks 1-3): Foundation - Offline testing
- Phase 2 (Weeks 4-7): Advanced - LLM-judge, simulation, error analysis
- Phase 3 (Weeks 8-11): Online - Production monitoring
- Phase 4 (Weeks 12-16): Continuous - Improvement pipeline
Total: 12-16 weeks, 2-3 engineers
Cost: ~$100-200/month (Azure services + evaluation APIs)
References
Industry Practices
- Anthropic/Claude
- Develop Strong Empirical Evaluations
-
Microsoft Agent Framework
- Agent Framework Lab
-
Research Papers
- Where LLM Agents Fail (AgentErrorTaxonomy)
- Why Multi-Agent Systems Fail (MAST Taxonomy)
Benchmarks
- GAIA Benchmark - General AI Assistant
- TAU2 Benchmark - Conversational agents
Best Practices
Last Updated: 2025-10-25 Review Frequency: Quarterly or after major changes Next Review: 2025-04-25