Agent Evaluation Framework: Progressive Implementation Plan

Status: Implementation Roadmap Last Updated: 2025-10-25 Purpose: Comprehensive evaluation strategy for multi-agent loan processing system Related Docs: AI Security, AI Security Framework Capabilities

Executive Summary

This document presents a progressive evaluation framework for the Loan Defenders multi-agent system, combining industry best practices from Anthropic/Claude, Microsoft Agent Framework, and academic research on agent failure modes. The framework spans offline testing, online monitoring, error analysis, and continuous improvement.

Key Insight: Agent evaluation differs from traditional software testing. Agents exhibit non-deterministic behavior, complex reasoning chains, and emergent failure modes requiring specialized evaluation techniques.

Implementation Timeline: 4 phases over 12-16 weeks, starting with basic offline testing and progressing to production monitoring and continuous learning.

Evaluation Philosophy

Core Principles (from Anthropic)

1. Evaluation is Part of the Entire Production Lifecycle

Evaluation is not a one-time validation step but a continuous cycle integral to improving agent performance. Every prompt change, model update, or architecture modification requires re-evaluation.

2. Volume Over Perfection

Anthropic recommendation: "More questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals."

Key: Prioritize scale and automation over perfect precision.

3. Real-World Task Distribution

Evaluations must mirror actual use cases, including: - ✅ Typical loan applications (happy path) - ✅ Edge cases (zero income, perfect credit + high debt) - ✅ Adversarial inputs (prompt injection attempts) - ✅ Ambiguous scenarios (missing data, conflicting signals) - ✅ Poor user input (typos, format errors)

4. Automation First

Structure tests to enable automated grading through: - Multiple-choice formats - String matching - Code-based evaluation - LLM-based assessment

Human grading only when absolutely necessary (slowest, most expensive).

Why Agent Evaluation is Different

Traditional Software	AI Agents
Deterministic outputs	Non-deterministic behavior
Unit tests with exact matches	Semantic similarity checks
Single execution path	Multiple valid reasoning paths
Predictable failure modes	Cascading error propagation
Binary pass/fail	Graded quality assessment

Implication: Need probabilistic evaluation (pass@k metrics), trajectory analysis, and failure taxonomy understanding.

Evaluation Layers & Metrics

Based on industry best practices, we evaluate across 4 layers of our agent system:

Layer 1: Model & LLM Performance

Metrics: - Accuracy: How often agent outputs match expected results (%) - Latency: Response time from query to final response (seconds) - Cost: Token usage and API call expenses ($/request) - Robustness: Performance consistency across diverse inputs (variance)

Example Test:

# Test credit agent accuracy on known cases
def test_credit_agent_accuracy():
    test_cases = [
        {"credit_score": 750, "debt": 10000, "expected": "LOW_RISK"},
        {"credit_score": 620, "debt": 25000, "expected": "MEDIUM_RISK"},
        {"credit_score": 580, "debt": 50000, "expected": "HIGH_RISK"},
    ]

    correct = 0
    for case in test_cases:
        result = credit_agent.assess(case)
        if result.risk_level == case["expected"]:
            correct += 1

    accuracy = correct / len(test_cases)
    assert accuracy >= 0.90  # 90% accuracy threshold

Layer 2: Orchestration & Reasoning

Metrics: - Agent Trajectory Quality: Logical sequence of actions and reasoning paths - Tool Selection Accuracy: Correct MCP server invocation with appropriate parameters - Step Completion: Successful execution of required workflow steps (%) - Step Utility: Whether each action meaningfully advances task completion

Example Test:

# Test orchestrator workflow completeness
def test_orchestrator_workflow():
    application = create_test_application()

    # Capture agent trajectory
    with capture_agent_trace() as trace:
        decision = orchestrator.process(application)

    # Verify all required agents were consulted
    expected_agents = ["intake", "credit", "income", "risk"]
    invoked_agents = [step.agent_name for step in trace.steps]

    assert set(expected_agents).issubset(set(invoked_agents))

    # Verify logical order
    assert trace.steps[0].agent_name == "intake"  # First step
    assert trace.steps[-1].agent_name == "risk"   # Last assessment

Layer 3: Knowledge & Retrieval

Metrics (if using RAG or knowledge bases): - Context Relevance: Retrieved information addresses user questions (%) - Context Precision: Information density and signal-to-noise ratio - Context Recall: Completeness of relevant information retrieval (%) - Faithfulness: Claims supported by provided sources (%)

Note: Currently not applicable to Loan Defenders (agents use MCP servers, not RAG).

Layer 4: Application & Business Outcomes

Metrics (most important): - Task Success Rate: Binary or graded task completion (%) - Decision Quality: Agreement with expert human judgments (%) - Semantic Similarity: Meaning-based output comparison using embeddings - Regulatory Compliance: ECOA/FCRA adherence (violations per 100 decisions) - User Satisfaction: Explicit feedback and implicit signals

Example Test:

name="__codelineno-2-1" href="#__codelineno-2-1"># Test decision quality against expert-labeled dataset class="k">def test_decision_quality(): expert_dataset = load_expert_labeled_loans() # 500 cases true_positives = 0 false_positives = 0 true_negatives = 0 false_negatives = 0 for case in expert_dataset: agent_decision = orchestrator.process(case["application"]) expert_decision = case["expert_judgment"] if agent_decision.approved and expert_decision == "APPROVE": true_positives += 1 elif agent_decision.approved and expert_decision == "DENY": false_positives += 1 # Risk: bad loan approved elif not agent_decision.approved and expert_decision == "DENY": true_negatives += 1 else: false_negatives += 1 # Lost business accuracy = (true_positives + true_negatives) / len(expert_dataset) precision = true_positives / (true_positives + false_positives) recall = true_positives / (true_positives + false_negatives) # Business-critical thresholds assert accuracy >= 0.90 # 90% agreement with experts assert false_positives <= 25 # ≤5% bad loan approvals assert false_negatives <= 75 # ≤15% lost business

Testing Strategies

1. Automated Evaluation (Primary Strategy)

Statistical Evaluators: - BLEU, ROUGE for text similarity - Cosine similarity for semantic comparison - Exact match for structured outputs

Programmatic Evaluators: - Rule-based checks (e.g., DTI ratio ≤ 43%) - Format validation (Pydantic models) - Constraint satisfaction (loan amount ranges)

CI/CD Integration:

# .github/workflows/agent-evaluation.yml
name: Agent Evaluation

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  evaluate-agents:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Agent Evaluation Suite
        run: |
          uv run pytest tests/evals/ -v --cov=loan_defenders

      - name: Generate Evaluation Report
        run: |
          uv run python scripts/generate_eval_report.py

      - name: Check Quality Gates
        run: |
          # Fail build if metrics below thresholds
          uv run python scripts/check_quality_gates.py

Benefits: Fast, scalable, consistent, cost-effective.

2. LLM-as-Judge Evaluation

Use Cases: - Subjective qualities (helpfulness, tone, clarity) - Semantic correctness when exact match is too strict - Explanation quality assessment - Fair lending compliance checks

Implementation:

# tests/evals/judges/explanation_judge.py
from agent_framework import ChatAgent
from agent_framework.azure import AzureOpenAIChatClient

class ExplanationQualityJudge:
    """LLM-based judge for decision explanation quality"""

    def __init__(self):
        self.judge = ChatAgent(
            chat_client=AzureOpenAIChatClient(
                credential=DefaultAzureCredential(),
                model="gpt-4o-mini"  # Cheaper model for judging
            ),
            name="Explanation Judge",
            instructions="""
            You are an expert evaluator of loan decision explanations.

            Evaluate explanations on a 1-5 scale for:
            1. Clarity: Is the explanation easy to understand?
            2. Completeness: Does it address all factors?
            3. Compliance: Does it meet FCRA/ECOA requirements?
            4. Actionability: Does it provide clear next steps?

            Respond with JSON:
            {
                "clarity_score": 1-5,
                "completeness_score": 1-5,
                "compliance_score": 1-5,
                "actionability_score": 1-5,
                "overall_score": 1-5,
                "reasoning": "brief explanation"
            }
            """
        )

    async def evaluate(self, decision: LoanDecision) -> Dict:
        """Evaluate decision explanation quality"""

        prompt = f"""
        Evaluate this loan decision explanation:

        Decision: {"Approved" if decision.approved else "Denied"}
        Explanation: {decision.reasoning}
        Denial Reasons: {decision.denial_reasons if not decision.approved else "N/A"}

        Provide scores and reasoning.
        """

        result = await self.judge.run(prompt)
        return json.loads(result.text)

# Usage in tests
async def test_explanation_quality():
    judge = ExplanationQualityJudge()

    test_decisions = load_test_decisions()
    low_quality_count = 0

    for decision in test_decisions:
        scores = await judge.evaluate(decision)

        if scores["overall_score"] < 3:
            low_quality_count += 1

    quality_rate = 1 - (low_quality_count / len(test_decisions))
    assert quality_rate >= 0.85  # 85% explanations should score ≥3/5

Cost Consideration: ~$0.50-$1.00 per 1,000 evaluations (using GPT-4o-mini).

3. Human-in-the-Loop Evaluation

Use Cases: - Domain-specific correctness validation - Safety and appropriateness judgments - Nuanced quality assessment - Training data collection for automated evaluators

Implementation:

# tests/evals/human/review_queue.py
class HumanReviewQueue:
    """Queue random decisions for human expert review"""

    def __init__(self, sampling_rate: float = 0.01):
        self.sampling_rate = sampling_rate  # 1% of decisions
        self.queue = []

    def maybe_enqueue(self, application: LoanApplication, decision: LoanDecision):
        """Randomly sample decisions for human review"""

        if random.random() < self.sampling_rate:
            self.queue.append({
                "application_id": application.applicant_id,
                "decision": decision,
                "timestamp": datetime.utcnow(),
                "status": "PENDING_REVIEW"
            })

    def get_pending_reviews(self) -> List[Dict]:
        """Get decisions waiting for human review"""
        return [r for r in self.queue if r["status"] == "PENDING_REVIEW"]

    def record_human_feedback(
        self,
        application_id: str,
        human_decision: bool,
        human_reasoning: str,
        reviewer_id: str
    ):
        """Record expert reviewer's assessment"""

        for review in self.queue:
            if review["application_id"] == application_id:
                review["human_decision"] = human_decision
                review["human_reasoning"] = human_reasoning
                review["reviewer_id"] = reviewer_id
                review["status"] = "REVIEWED"

                # Calculate agreement
                agent_decision = review["decision"].approved
                review["agreement"] = (agent_decision == human_decision)

                break

# Usage
review_queue = HumanReviewQueue(sampling_rate=0.01)

# During processing
decision = orchestrator.process(application)
review_queue.maybe_enqueue(application, decision)

# Weekly review session
pending = review_queue.get_pending_reviews()
# Present to human reviewers via UI

Frequency: Weekly review sessions, 1% sampling rate (manageable volume).

4. Simulation-Based Testing

Use Cases: - Test edge cases systematically - Generate diverse test cases covering user personas - Consistency testing across multiple runs - Pre-production validation

Implementation:

# tests/evals/simulation/loan_application_generator.py
class LoanApplicationGenerator:
    """Generate synthetic loan applications for testing"""

    PERSONAS = {
        "perfect_borrower": {
            "credit_score": (750, 850),
            "monthly_income": (8000, 15000),
            "monthly_debt": (500, 2000),
            "employment_duration": (24, 120)  # months
        },
        "risky_borrower": {
            "credit_score": (580, 640),
            "monthly_income": (2000, 3500),
            "monthly_debt": (1500, 2500),
            "employment_duration": (3, 12)
        },
        "borderline_borrower": {
            "credit_score": (620, 680),
            "monthly_income": (3500, 5500),
            "monthly_debt": (1200, 2000),
            "employment_duration": (12, 36)
        }
    }

    def generate_application(
        self,
        persona: str,
        loan_amount: float,
        seed: int = None
    ) -> LoanApplication:
        """Generate synthetic application matching persona"""

        if seed:
            random.seed(seed)

        profile = self.PERSONAS[persona]

        return LoanApplication(
            applicant_id=str(uuid.uuid4()),
            credit_score=random.randint(*profile["credit_score"]),
            monthly_income=random.uniform(*profile["monthly_income"]),
            monthly_debt=random.uniform(*profile["monthly_debt"]),
            employment_duration_months=random.randint(*profile["employment_duration"]),
            requested_loan_amount=loan_amount
        )

    def generate_test_suite(self, cases_per_persona: int = 100) -> List[LoanApplication]:
        """Generate comprehensive test suite"""

        test_suite = []

        for persona in self.PERSONAS.keys():
            for i in range(cases_per_persona):
                # Vary loan amounts
                loan_amount = random.choice([10000, 25000, 50000, 75000, 100000])

                app = self.generate_application(
                    persona=persona,
                    loan_amount=loan_amount,
                    seed=i  # Reproducible
                )

                test_suite.append(app)

        return test_suite

# Usage
def test_agent_consistency():
    """Test if agent produces consistent decisions for same input"""

    generator = LoanApplicationGenerator()
    app = generator.generate_application("borderline_borrower", 30000, seed=42)

    # Run 10 times
    decisions = []
    for _ in range(10):
        decision = orchestrator.process(app)
        decisions.append(decision.approved)

    # Check consistency
    unique_decisions = set(decisions)
    assert len(unique_decisions) == 1, "Decision should be deterministic for same input"

Benefits: Systematic coverage, reproducibility, scalability.

Offline vs. Online Testing

Offline Testing (Development & Staging)

Definition: Evaluation in controlled environment using curated test datasets before deployment.

When: During development, before releases, in CI/CD pipelines.

Test Types: 1. Unit Tests: Individual agent behavior 2. Integration Tests: Multi-agent workflow 3. Regression Tests: Prevent breaking changes 4. Simulation Tests: Edge cases and adversarial inputs 5. Benchmark Tests: GAIA, TAU2 (if applicable)

Example Workflow:

# tests/evals/offline/regression_suite.py
class RegressionTestSuite:
    """Prevent regressions in agent behavior"""

    def __init__(self):
        self.baseline_metrics = self._load_baseline()
        self.test_dataset = self._load_test_dataset()

    async def run_regression_tests(self) -> RegressionReport:
        """Compare current version to baseline"""

        current_metrics = {}

        # Run all test cases
        for test_case in self.test_dataset:
            decision = await orchestrator.process(test_case["application"])

            # Record metrics
            current_metrics[test_case["id"]] = {
                "decision": decision.approved,
                "confidence": decision.confidence_score,
                "reasoning": decision.reasoning
            }

        # Calculate deltas
        regressions = []
        improvements = []

        for test_id, current in current_metrics.items():
            baseline = self.baseline_metrics[test_id]

            # Check for decision flips
            if current["decision"] != baseline["decision"]:
                regressions.append({
                    "test_id": test_id,
                    "issue": "DECISION_FLIP",
                    "baseline": baseline["decision"],
                    "current": current["decision"]
                })

            # Check for confidence drops
            confidence_delta = current["confidence"] - baseline["confidence"]
            if confidence_delta < -0.10:  # 10% drop
                regressions.append({
                    "test_id": test_id,
                    "issue": "CONFIDENCE_DROP",
                    "delta": confidence_delta
                })

        return RegressionReport(
            total_tests=len(self.test_dataset),
            regressions=regressions,
            improvements=improvements,
            passed=len(regressions) == 0
        )

# CI/CD integration
async def test_no_regressions():
    suite = RegressionTestSuite()
    report = await suite.run_regression_tests()

    if not report.passed:
        print(f"❌ {len(report.regressions)} regressions detected:")
        for reg in report.regressions:
            print(f"  - Test {reg['test_id']}: {reg['issue']}")

        assert False, "Regressions detected - do not merge"

Success Criteria: - All unit tests pass (100%) - Integration tests pass (≥95%) - No regressions vs. previous version - Metrics meet quality gates

Online Testing (Production)

Definition: Evaluation in live, real-world environment during actual usage.

When: Continuously in production.

Monitoring Types: 1. Real-Time Metrics: Latency, error rates, throughput 2. Auto-Evaluation: LLM-judge on production logs (sampled) 3. User Feedback: Explicit ratings, implicit signals 4. Anomaly Detection: Sudden metric changes 5. A/B Testing: Compare agent versions in production

Example Implementation:

# apps/api/monitoring/online_evaluation.py
from agent_framework.observability import get_tracer, get_meter

class OnlineEvaluationService:
    """Continuous evaluation in production"""

    def __init__(self, sampling_rate: float = 0.05):
        self.sampling_rate = sampling_rate  # Evaluate 5% of decisions
        self.tracer = get_tracer()
        self.meter = get_meter()

        # Metrics
        self.decision_counter = self.meter.create_counter("decisions_total")
        self.approval_rate_gauge = self.meter.create_gauge("approval_rate")
        self.latency_histogram = self.meter.create_histogram("decision_latency_seconds")
        self.quality_scores = self.meter.create_histogram("decision_quality_score")

    async def evaluate_decision(
        self,
        application: LoanApplication,
        decision: LoanDecision,
        latency: float
    ):
        """Evaluate production decision"""

        with self.tracer.start_as_current_span("online_evaluation") as span:
            # Record basic metrics
            self.decision_counter.add(1, {
                "approved": str(decision.approved),
                "loan_amount_bucket": self._bucket_loan_amount(decision.loan_amount)
            })

            self.latency_histogram.record(latency)

            # Sample for detailed evaluation
            if random.random() < self.sampling_rate:
                quality_score = await self._evaluate_quality(decision)
                self.quality_scores.record(quality_score)

                span.set_attribute("quality_score", quality_score)

                # Alert if low quality
                if quality_score < 3.0:
                    await self._alert_low_quality(application, decision, quality_score)

    async def _evaluate_quality(self, decision: LoanDecision) -> float:
        """LLM-based quality evaluation"""

        judge = ExplanationQualityJudge()
        scores = await judge.evaluate(decision)
        return scores["overall_score"]

    async def _alert_low_quality(
        self,
        application: LoanApplication,
        decision: LoanDecision,
        score: float
    ):
        """Alert on low-quality decisions"""

        alert = {
            "severity": "WARNING",
            "type": "LOW_QUALITY_DECISION",
            "application_id": application.applicant_id,
            "quality_score": score,
            "decision": decision.approved,
            "timestamp": datetime.utcnow()
        }

        # Send to monitoring system (Azure Monitor, Slack, etc.)
        await send_alert(alert)

# Integration in API
from apps.api.monitoring.online_evaluation import OnlineEvaluationService

evaluator = OnlineEvaluationService(sampling_rate=0.05)

@app.post("/api/loan-application/submit")
async def submit_loan_application(application: LoanApplication):
    start_time = time.time()

    # Process application
    decision = await orchestrator.process(application)

    latency = time.time() - start_time

    # Online evaluation (async, doesn't block response)
    asyncio.create_task(
        evaluator.evaluate_decision(application, decision, latency)
    )

    return decision

Success Criteria: - Latency P95 < 10 seconds - Error rate < 1% - Quality score ≥ 3.5/5 - No anomalies detected

Continuous Evaluation Loop

Best Practice: Combine offline and online testing in continuous loop.

┌─────────────────────────────────────────────────────────────────┐
│                   Continuous Evaluation Loop                    │
└─────────────────────────────────────────────────────────────────┘

1. Offline Evaluation (Pre-Deployment)
   ├─ Run regression tests
   ├─ Simulate edge cases
   ├─ Check quality gates
   └─ ✅ Pass → Deploy new version

2. Deploy to Production
   ├─ Gradual rollout (canary deployment)
   ├─ Monitor real-time metrics
   └─ Auto-evaluate sampled decisions

3. Online Monitoring (Production)
   ├─ Collect user feedback
   ├─ Detect anomalies
   ├─ Identify failure patterns
   └─ Flag low-quality decisions

4. Collect Failure Examples
   ├─ Production incidents
   ├─ Low-quality decisions
   ├─ User complaints
   └─ Edge cases discovered

5. Add to Offline Test Set
   ├─ Create regression tests for failures
   ├─ Expand edge case coverage
   ├─ Update quality benchmarks
   └─ Loop back to step 1

Implementation:

# scripts/sync_prod_failures_to_tests.py
class FailureTestSyncService:
    """Sync production failures to offline test suite"""

    async def sync_failures(self):
        """Weekly job to update test suite with production failures"""

        # Get production failures from last week
        failures = await self.get_production_failures(days=7)

        # Convert to test cases
        new_tests = []
        for failure in failures:
            test_case = {
                "id": f"prod_failure_{failure['id']}",
                "application": failure["application"],
                "expected_behavior": "Should not produce low-quality decision",
                "source": "production_failure",
                "date_added": datetime.utcnow()
            }
            new_tests.append(test_case)

        # Add to regression suite
        await self.add_to_regression_suite(new_tests)

        print(f"✅ Added {len(new_tests)} production failures to test suite")

Error Analysis & Failure Taxonomy

Agent Error Taxonomy (AgentErrorTaxonomy)

Based on research paper "Where LLM Agents Fail and How They can Learn From Failures" (2024).

5 Major Categories:

1. Memory Errors

Symptom: Agent forgets context, repeats actions, loses track of state
Example: Re-verifying income after already verifying
Detection: Check for duplicate tool calls in trajectory
Mitigation: Improve context management, use state tracking

2. Reflection Errors

Symptom: Agent doesn't learn from mistakes, repeats failed approaches
Example: Repeatedly calling same MCP server with same params after error
Detection: Analyze retry patterns in failed trajectories
Mitigation: Add reflection prompts, implement backtracking

3. Planning Errors

Symptom: Illogical action sequence, skips required steps
Example: Making final decision before consulting all required agents
Detection: Verify workflow completeness
Mitigation: Enforce workflow constraints, add planning validation

4. Action Errors

Symptom: Wrong tool selection, incorrect parameters
Example: Calling verify_employment with SSN instead of applicant_id
Detection: Tool call validation middleware (see AI Security doc)
Mitigation: Schema enforcement, parameter validation

5. System-Level Errors

Symptom: Timeouts, API failures, rate limits
Example: MCP server unavailable, Azure OpenAI quota exceeded
Detection: Exception monitoring, retry tracking
Mitigation: Graceful degradation, retry logic, circuit breakers

Cascading Failure Detection

Key Research Finding: Single root-cause error propagates through subsequent decisions, leading to task failure.

Example Cascade:

1. Memory Error: Agent forgets credit score was already retrieved
                 ↓
2. Action Error: Calls credit verification tool again with wrong params
                 ↓
3. Planning Error: Skips income verification due to time spent on duplicate work
                 ↓
4. Final Outcome: Incomplete assessment → incorrect loan decision

Detection Strategy:

# tests/evals/error_analysis/cascade_detector.py
class CascadeFailureDetector:
    """Detect and classify cascading failures"""

    def analyze_trajectory(self, trace: AgentTrace) -> CascadeAnalysis:
        """Identify root cause and cascade pattern"""

        errors = []

        # Step 1: Identify all errors in trajectory
        for step in trace.steps:
            if step.status == "ERROR":
                errors.append({
                    "step_index": step.index,
                    "error_type": self._classify_error(step.error),
                    "agent": step.agent_name,
                    "tool": step.tool_name
                })

        if len(errors) == 0:
            return CascadeAnalysis(has_cascade=False)

        # Step 2: Identify root cause (first error)
        root_cause = errors[0]

        # Step 3: Analyze propagation
        cascade_path = self._trace_cascade(errors, trace)

        return CascadeAnalysis(
            has_cascade=len(cascade_path) > 1,
            root_cause=root_cause,
            cascade_path=cascade_path,
            severity="CRITICAL" if len(cascade_path) >= 3 else "MODERATE"
        )

    def _classify_error(self, error: Exception) -> str:
        """Classify error into taxonomy category"""

        error_msg = str(error).lower()

        if "duplicate" in error_msg or "already" in error_msg:
            return "MEMORY_ERROR"
        elif "timeout" in error_msg or "rate limit" in error_msg:
            return "SYSTEM_ERROR"
        elif "invalid parameter" in error_msg:
            return "ACTION_ERROR"
        elif "missing step" in error_msg:
            return "PLANNING_ERROR"
        else:
            return "UNKNOWN_ERROR"

# Usage in tests
def test_no_cascading_failures():
    """Ensure failures don't cascade"""

    detector = CascadeFailureDetector()

    # Run all test cases
    failures = []
    for test_case in load_test_suite():
        try:
            with capture_agent_trace() as trace:
                decision = orchestrator.process(test_case)
        except Exception:
            analysis = detector.analyze_trajectory(trace)

            if analysis.has_cascade:
                failures.append(analysis)

    # Report cascading failures
    if failures:
        print(f"❌ {len(failures)} cascading failures detected:")
        for failure in failures:
            print(f"  Root cause: {failure.root_cause['error_type']}")
            print(f"  Cascade length: {len(failure.cascade_path)}")

    assert len(failures) == 0, "No cascading failures allowed"

Multi-Agent Failure Taxonomy (MAST)

Based on research "Why Do Multi-Agent LLM Systems Fail?" (2024).

3 Categories, 14 Failure Modes:

Category 1: Specification Issues

Incomplete Instructions: Agent lacks necessary context
Ambiguous Goals: Unclear success criteria
Conflicting Constraints: Impossible requirements

Example: Orchestrator told to "approve quickly" but also "thoroughly assess all risks" (conflicting).

Category 2: Inter-Agent Misalignment

Ignoring Peer Input: Agent disregards other agents' assessments
Incorrect Assumptions: Agent operates on wrong premises
Communication Breakdown: Agents misinterpret each other

Example: Risk agent ignores credit agent's "high risk" assessment and still recommends approval.

Category 3: Task Verification

Missing Verification Steps: No validation of outputs
Incorrect Success Criteria: Wrong definition of task completion
Premature Termination: Agent stops before task is done

Example: Orchestrator returns decision without verifying all agents completed assessments.

Key Insight: Many failures stem from poor system design, not model performance.

Mitigation: Better orchestration strategies, clear role definitions, explicit verification steps.

Root Cause Analysis Framework

# tests/evals/error_analysis/root_cause_analyzer.py
class RootCauseAnalyzer:
    """Systematic root cause analysis for failures"""

    ERROR_TAXONOMY = {
        "MEMORY_ERROR": {
            "indicators": ["duplicate action", "context loss", "forgotten state"],
            "root_causes": ["insufficient context window", "poor state tracking"],
            "fixes": ["increase context", "add explicit state management"]
        },
        "PLANNING_ERROR": {
            "indicators": ["skipped step", "wrong order", "incomplete workflow"],
            "root_causes": ["unclear instructions", "missing constraints"],
            "fixes": ["clarify persona instructions", "add workflow validation"]
        },
        "ACTION_ERROR": {
            "indicators": ["wrong tool", "invalid parameters", "malformed request"],
            "root_causes": ["poor tool descriptions", "weak validation"],
            "fixes": ["improve tool docs", "add parameter validation"]
        }
    }

    def analyze_failure(self, trace: AgentTrace, error: Exception) -> RootCauseReport:
        """Identify root cause and recommend fixes"""

        # Step 1: Classify error
        error_type = self._classify_error(error)

        # Step 2: Extract indicators from trace
        indicators = self._extract_indicators(trace, error_type)

        # Step 3: Match to known root causes
        taxonomy_entry = self.ERROR_TAXONOMY[error_type]
        matched_causes = []

        for indicator in indicators:
            for root_cause in taxonomy_entry["root_causes"]:
                if self._indicator_matches_cause(indicator, root_cause):
                    matched_causes.append(root_cause)

        # Step 4: Recommend fixes
        recommended_fixes = []
        for cause in set(matched_causes):
            fixes = [f for f in taxonomy_entry["fixes"] if cause in f]
            recommended_fixes.extend(fixes)

        return RootCauseReport(
            error_type=error_type,
            indicators=indicators,
            root_causes=matched_causes,
            recommended_fixes=recommended_fixes,
            trace_id=trace.trace_id
        )

# Usage
analyzer = RootCauseAnalyzer()

for failure in production_failures:
    report = analyzer.analyze_failure(failure.trace, failure.error)

    print(f"Failure: {failure.id}")
    print(f"  Error Type: {report.error_type}")
    print(f"  Root Causes: {report.root_causes}")
    print(f"  Recommended Fixes: {report.recommended_fixes}")

Microsoft Agent Framework Lab Integration

GAIA Benchmark

What It Tests: General AI Assistant capabilities requiring reasoning, multi-modality, web browsing, tool-use proficiency.

Dataset: 466 human-annotated tasks with unambiguous answers.

Performance: Humans 92% success, GPT-4 15% success (stark gap).

When to Use: Validate general agent capabilities, benchmark against industry standards.

Integration (Experimental - Lab Module):

# Install GAIA module from Agent Framework Lab
pip install "agent-framework-lab[gaia]"

# tests/benchmarks/test_gaia.py
from agent_framework.lab.gaia import GAIABenchmark

class GAIAEvaluation:
    """Benchmark agents against GAIA"""

    def __init__(self):
        self.benchmark = GAIABenchmark()
        self.benchmark.load_dataset()  # 466 tasks

    async def run_benchmark(self) -> GAIAResults:
        """Run full GAIA benchmark"""

        results = []

        for task in self.benchmark.tasks:
            try:
                # Adapt task to loan domain (if applicable)
                if self._is_relevant_to_loans(task):
                    result = await self._run_task(task)
                    results.append(result)
            except Exception as e:
                results.append({
                    "task_id": task.id,
                    "success": False,
                    "error": str(e)
                })

        success_rate = sum(1 for r in results if r["success"]) / len(results)

        return GAIAResults(
            total_tasks=len(results),
            success_rate=success_rate,
            results=results
        )

# Note: GAIA may not be directly applicable to loan processing
# Consider adapting relevant tasks or using as general capability check

Applicability to Loan Defenders: Limited (GAIA is general-purpose, not loan-specific). Use for general capability validation, not primary evaluation.

TAU2 Benchmark (τ²-bench)

What It Tests: Conversational agents in dual-control environments (both agent and user use tools).

Domains: Retail, airline, telecom customer support.

Performance: GPT-4.1 34%, o4-mini 42%, Claude-3.7 49% pass@1 (very challenging).

When to Use: Test conversational flow, tool use in dynamic environments.

Applicability to Loan Defenders: Moderate relevance - loan processing involves conversational elements and dynamic decision-making.

Integration (Experimental - Lab Module):

# Install TAU2 module from Agent Framework Lab
pip install "agent-framework-lab[tau2]"

# tests/benchmarks/test_tau2.py
from agent_framework.lab.tau2 import TAU2Benchmark

# Note: TAU2 is customer support focused, not loan-specific
# Consider creating custom loan-processing benchmark instead

Recommendation: TAU2 is valuable for conversational agent evaluation but not directly applicable. Consider creating custom loan-processing benchmark inspired by TAU2 methodology.

Custom Loan Processing Benchmark

Proposal: Create loan-specific benchmark dataset.

# tests/benchmarks/loan_defenders_benchmark.py
class LoanDefendersBenchmark:
    """Custom benchmark for loan processing agents"""

    BENCHMARK_DATASET = {
        "simple_cases": 100,      # Clear approve/deny
        "edge_cases": 50,         # Borderline decisions
        "adversarial": 30,        # Attack attempts
        "regulatory": 20,         # Compliance scenarios
    }

    def __init__(self):
        self.dataset = self._load_benchmark_dataset()
        self.expert_labels = self._load_expert_judgments()

    def _load_benchmark_dataset(self) -> List[LoanApplication]:
        """Load curated benchmark applications"""

        # Generated using LoanApplicationGenerator + expert curation
        dataset = []

        # Simple cases: Clear outcomes
        generator = LoanApplicationGenerator()
        for i in range(100):
            app = generator.generate_application(
                persona=random.choice(["perfect_borrower", "risky_borrower"]),
                loan_amount=random.choice([10000, 30000, 50000]),
                seed=i
            )
            dataset.append(app)

        # Edge cases: Borderline scenarios
        for i in range(50):
            app = generator.generate_application(
                persona="borderline_borrower",
                loan_amount=random.choice([25000, 50000, 75000]),
                seed=100 + i
            )
            dataset.append(app)

        # Adversarial: Injection attempts, malformed data
        dataset.extend(load_adversarial_test_cases())

        # Regulatory: ECOA/FCRA compliance scenarios
        dataset.extend(load_regulatory_test_cases())

        return dataset

    async def run_benchmark(self) -> BenchmarkResults:
        """Run full benchmark suite"""

        results = {
            "accuracy": 0.0,
            "precision": 0.0,
            "recall": 0.0,
            "regulatory_compliance": 0.0,
            "adversarial_resistance": 0.0
        }

        # Evaluate each category
        for category, count in self.BENCHMARK_DATASET.items():
            category_results = await self._evaluate_category(category)
            results[f"{category}_performance"] = category_results

        # Calculate aggregate metrics
        results["overall_score"] = self._calculate_overall_score(results)

        return BenchmarkResults(**results)

# Usage
benchmark = LoanDefendersBenchmark()
results = await benchmark.run_benchmark()

print(f"Benchmark Results:")
print(f"  Overall Score: {results.overall_score:.1%}")
print(f"  Accuracy: {results.accuracy:.1%}")
print(f"  Regulatory Compliance: {results.regulatory_compliance:.1%}")

Benefit: Loan-specific, actionable, tracks progress over time.

Progressive Implementation Plan

Phase 1: Foundation (Weeks 1-3)

Goal: Establish basic offline testing infrastructure.

Week 1: Test Infrastructure Setup - [ ] Create tests/evals/ directory structure - [ ] Set up test dataset generation (LoanApplicationGenerator) - [ ] Implement basic accuracy tests (Layer 1 metrics) - [ ] Add to CI/CD pipeline (GitHub Actions)

Week 2: Offline Evaluation Suite - [ ] Implement regression test suite - [ ] Create expert-labeled dataset (50-100 cases) - [ ] Build decision quality tests (accuracy, precision, recall) - [ ] Add workflow completeness tests (orchestration)

Week 3: Automation & Reporting - [ ] Automate test execution in CI/CD - [ ] Generate evaluation reports (HTML/Markdown) - [ ] Define quality gates (fail build if thresholds not met) - [ ] Document evaluation process

Deliverables: - ✅ 300+ automated test cases - ✅ CI/CD integration with quality gates - ✅ Evaluation dashboard

Effort: ~3 weeks, 1-2 engineers

Phase 2: Advanced Evaluation (Weeks 4-7)

Goal: Add LLM-as-judge, simulation, and error analysis.

Week 4: LLM-as-Judge Evaluation - [ ] Implement ExplanationQualityJudge - [ ] Build semantic similarity evaluators - [ ] Add tone/compliance judges - [ ] Optimize judging costs (use GPT-4o-mini)

Week 5: Simulation & Edge Cases - [ ] Expand simulation to cover all personas - [ ] Generate adversarial test cases - [ ] Test regulatory compliance scenarios - [ ] Build consistency test suite

Week 6: Error Analysis Framework - [ ] Implement AgentErrorTaxonomy - [ ] Build CascadeFailureDetector - [ ] Create RootCauseAnalyzer - [ ] Document common failure patterns

Week 7: Benchmarking - [ ] Create LoanDefendersBenchmark - [ ] Curate 200-case benchmark dataset - [ ] Get expert labels for benchmark - [ ] Baseline current performance

Deliverables: - ✅ LLM-based quality evaluation - ✅ Comprehensive simulation coverage - ✅ Error taxonomy and analysis tools - ✅ Custom loan benchmark

Effort: ~4 weeks, 2 engineers

Phase 3: Online Monitoring (Weeks 8-11)

Goal: Production monitoring and continuous evaluation.

Week 8: Observability Integration - [ ] Integrate Agent Framework observability (replace custom) - [ ] Configure Application Insights - [ ] Set up real-time metrics dashboard - [ ] Define alerting thresholds

Week 9: Online Evaluation Service - [ ] Build OnlineEvaluationService - [ ] Implement sampling (5% of decisions) - [ ] Add LLM-judge for production decisions - [ ] Create low-quality alert system

Week 10: User Feedback Loop - [ ] Build HumanReviewQueue (1% sampling) - [ ] Create reviewer UI for feedback - [ ] Implement feedback storage - [ ] Track agreement metrics

Week 11: Anomaly Detection - [ ] Build anomaly detection (approval rate spikes, etc.) - [ ] Implement automated alerts - [ ] Create incident response playbook - [ ] Test alerting system

Deliverables: - ✅ Real-time production monitoring - ✅ Automated quality evaluation (sampled) - ✅ Human feedback collection - ✅ Anomaly detection and alerting

Effort: ~4 weeks, 2 engineers

Phase 4: Continuous Improvement (Weeks 12-16)

Goal: Close the loop - prod failures → test suite → iteration.

Week 12: Failure Sync Pipeline - [ ] Build FailureTestSyncService - [ ] Automate prod failure → test case conversion - [ ] Schedule weekly sync job - [ ] Track test suite growth

Week 13: Advanced Analytics - [ ] Build failure pattern analysis dashboard - [ ] Implement trend detection (metrics over time) - [ ] Create agent performance leaderboard - [ ] Generate monthly evaluation reports

Week 14: A/B Testing Infrastructure - [ ] Implement canary deployments (10% traffic) - [ ] Build comparison framework (control vs. treatment) - [ ] Add statistical significance testing - [ ] Create rollback automation

Week 15: Fair Lending Evaluations - [ ] Implement disparate impact testing (4/5ths rule) - [ ] Build matched-pair test generator - [ ] Add ECOA/FCRA compliance checks - [ ] Schedule monthly fair lending audits

Week 16: Documentation & Training - [ ] Comprehensive evaluation documentation - [ ] Team training on evaluation framework - [ ] Runbook for common issues - [ ] Establish evaluation cadence

Deliverables: - ✅ Continuous improvement pipeline - ✅ Advanced analytics and trends - ✅ A/B testing capability - ✅ Fair lending compliance framework

Effort: ~5 weeks, 2-3 engineers

Tooling & Infrastructure

Required Tools

1. Agent Framework Observability (Built-in)

pip install agent-framework  # Includes OpenTelemetry

2. Evaluation Libraries

pip install pytest pytest-asyncio  # Testing
pip install deepeval  # G-Eval, RAGAS
pip install rouge-score  # Text similarity
pip install sentence-transformers  # Embeddings

3. Monitoring & Alerting - Azure Application Insights (already deployed) - Azure Monitor Workbooks (dashboards) - Azure Logic Apps (alerting, optional)

4. Data Management - Azure Table Storage (test datasets, benchmark results) - Azure Blob Storage (traces, logs)

Evaluation Dashboard

Components: 1. CI/CD Status: Current build, test pass rates 2. Regression Tracking: Metrics over time (line charts) 3. Error Distribution: Taxonomy breakdown (pie chart) 4. Benchmark Scores: LoanDefenders benchmark trend 5. Production Health: Real-time metrics, alerts

Tech Stack: - Azure Monitor Workbooks (primary) - Custom dashboard (React, optional for advanced features)

Continuous Improvement Loop

Quarterly Evaluation Cycle

Q1: Baseline & Foundation - Run initial benchmarks - Establish baseline metrics - Set quality thresholds

Q2: Iterate & Improve - Analyze failure patterns - Implement fixes - Re-run benchmarks - Measure improvements

Q3: Expand Coverage - Add new edge cases - Increase test dataset size - Improve evaluation depth

Q4: Compliance & Audit - Fair lending evaluation - Regulatory compliance review - External audit (if applicable)

Monthly Reviews

Evaluation Review Meeting: - Review metrics trends - Discuss failure patterns - Prioritize improvements - Update test suite

Agenda: 1. Metric review (20 min) 2. Failure analysis (20 min) 3. Improvement proposals (15 min) 4. Action items (5 min)

Continuous Learning

Feedback Sources: 1. Production failures → New test cases 2. User feedback → Quality improvements 3. Regulatory changes → Compliance updates 4. Industry benchmarks → Capability gaps

Process:

Production Failure Detected
         ↓
Root Cause Analysis
         ↓
Create Regression Test
         ↓
Fix Implementation
         ↓
Re-Run Evaluation
         ↓
Deploy to Production
         ↓
Monitor for Recurrence

Summary

Evaluation Framework Pillars

Offline Testing: Comprehensive test suite, simulation, benchmarking
Online Monitoring: Real-time metrics, sampling, anomaly detection
Error Analysis: Taxonomy, cascade detection, root cause analysis
Continuous Improvement: Failure sync, trend analysis, iterative enhancement

Success Metrics

Offline (Pre-Deployment): - Test pass rate: ≥95% - Regression tests: 0 failures - Benchmark score: ≥85% - Expert agreement: ≥90%

Online (Production): - Latency P95: <10 seconds - Error rate: <1% - Quality score: ≥3.5/5 - User satisfaction: ≥4.0/5

Improvement Rate: - Monthly benchmark improvement: +2-5% - Quarterly failure reduction: -10-20%

Implementation Timeline

Phase 1 (Weeks 1-3): Foundation - Offline testing
Phase 2 (Weeks 4-7): Advanced - LLM-judge, simulation, error analysis
Phase 3 (Weeks 8-11): Online - Production monitoring
Phase 4 (Weeks 12-16): Continuous - Improvement pipeline

Total: 12-16 weeks, 2-3 engineers

Cost: ~$100-200/month (Azure services + evaluation APIs)

References

Industry Practices

Anthropic/Claude
Develop Strong Empirical Evaluations
Statistical Approach to Model Evals
Microsoft Agent Framework
Agent Framework Lab
Enabling Observability
Research Papers
Where LLM Agents Fail (AgentErrorTaxonomy)
Why Multi-Agent Systems Fail (MAST Taxonomy)

Benchmarks

GAIA Benchmark - General AI Assistant
TAU2 Benchmark - Conversational agents

Best Practices

Last Updated: 2025-10-25 Review Frequency: Quarterly or after major changes Next Review: 2025-04-25

Agent Evaluation Framework: Progressive Implementation Plan

Executive Summary

Table of Contents

Evaluation Philosophy

Core Principles (from Anthropic)

Why Agent Evaluation is Different

Evaluation Layers & Metrics

Layer 1: Model & LLM Performance

Layer 2: Orchestration & Reasoning

Layer 3: Knowledge & Retrieval

Layer 4: Application & Business Outcomes

Testing Strategies

1. Automated Evaluation (Primary Strategy)

2. LLM-as-Judge Evaluation

3. Human-in-the-Loop Evaluation

4. Simulation-Based Testing

Offline vs. Online Testing

Offline Testing (Development & Staging)

Online Testing (Production)

Continuous Evaluation Loop

Error Analysis & Failure Taxonomy

Agent Error Taxonomy (AgentErrorTaxonomy)

1. Memory Errors

2. Reflection Errors

3. Planning Errors

4. Action Errors

5. System-Level Errors

Cascading Failure Detection

Multi-Agent Failure Taxonomy (MAST)

Category 1: Specification Issues

Category 2: Inter-Agent Misalignment

Category 3: Task Verification

Root Cause Analysis Framework

Microsoft Agent Framework Lab Integration

GAIA Benchmark

TAU2 Benchmark (τ²-bench)

Custom Loan Processing Benchmark

Progressive Implementation Plan

Phase 1: Foundation (Weeks 1-3)

Phase 2: Advanced Evaluation (Weeks 4-7)

Phase 3: Online Monitoring (Weeks 8-11)

Phase 4: Continuous Improvement (Weeks 12-16)

Tooling & Infrastructure

Required Tools

Evaluation Dashboard

Continuous Improvement Loop

Quarterly Evaluation Cycle

Monthly Reviews

Continuous Learning

Summary

Evaluation Framework Pillars

Success Metrics

Implementation Timeline

References

Industry Practices

Benchmarks

Best Practices