Observability Guide - Loan Defenders

Overview

The Loan Defenders application implements enterprise-grade observability using out-of-the-box solutions from OpenTelemetry, Azure Monitor, and Microsoft Agent Framework. This guide covers how to use observability features for debugging, performance analysis, and cost management.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Azure Application Insights                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │   Traces     │  │    Logs      │  │   Metrics    │         │
│  └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘
                              ▲
                              │ OpenTelemetry Exporter
                              │
┌─────────────────────────────────────────────────────────────────┐
│                    Loan Defenders API (FastAPI)                   │
│                                                                   │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  OpenTelemetry Auto-Instrumentation                   │      │
│  │  • configure_azure_monitor()                          │      │
│  │  • FastAPIInstrumentor.instrument_app()               │      │
│  └──────────────────────────────────────────────────────┘      │
│                                                                   │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  Observability Utilities                              │      │
│  │  • Correlation ID tracking (ContextVar)               │      │
│  │  • Token usage logging                                │      │
│  │  • Agent Framework observability                      │      │
│  └──────────────────────────────────────────────────────┘      │
│                                                                   │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  Application Components                               │      │
│  │  • API Endpoints (/api/chat, /api/sessions)           │      │
│  │  • Conversation Orchestrator                          │      │
│  │  • Sequential Pipeline (Agent Workflow)               │      │
│  │  • MCP Servers (Tools)                                │      │
│  └──────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘

Components

1. OpenTelemetry Auto-Instrumentation

Automatically captures: - HTTP request/response (FastAPI endpoints) - HTTP client calls (httpx, requests to external APIs) - Database calls (if applicable) - Exception stack traces

Configuration (loan_defenders/api/app.py:56-82):

# Automatically configured from environment variables
configure_azure_monitor()
FastAPIInstrumentor.instrument_app(app, excluded_urls="/health,/docs,/redoc")

2. Microsoft Agent Framework Observability

Automatically captures (loan_defenders/utils/observability.py:52-58): - Agent execution traces - Token usage (input/output tokens) - Agent performance metrics - Cost estimation

Configuration:

setup_observability(
    applicationinsights_connection_string=os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"),
    enable_sensitive_data=False,  # Never log PII
    enable_live_metrics=True,      # Real-time monitoring
)

3. Correlation ID Tracking

Purpose: Track requests end-to-end across API → Agents → MCP servers

Implementation (loan_defenders/api/app.py:93-136): - Middleware extracts X-Correlation-ID header or generates UUID - Stored in ContextVar (thread-safe for async) - Automatically added to all logs via Observability.get_correlation_id() - Propagated through OpenTelemetry traces - Returned in response headers

Usage in code:

logger.info(
    "Processing request",
    extra={
        "correlation_id": Observability.get_correlation_id(),
        "application_id": app_id,
        # ... other fields
    }
)

4. Token Usage Tracking

Purpose: Cost management and optimization

Implementation (loan_defenders/utils/observability.py:239-294):

Observability.log_token_usage(
    agent_name="Credit_Assessor",
    input_tokens=150,
    output_tokens=75,
    model="gpt-4",
    application_id="LN1234567890"
)

Logged fields: - event_type: "token_usage" - agent_name: Which agent consumed tokens - input_tokens, output_tokens, total_tokens - model: Model deployment name - application_id: Masked for security - correlation_id: For request tracing

Environment Variables

Required for full observability:

# Required: Application Insights connection string
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...;IngestionEndpoint=...

# Recommended: Service identification
OTEL_SERVICE_NAME=loan-defenders-api
OTEL_SERVICE_VERSION=0.1.0
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,cloud.provider=azure

# Optional: Logging configuration
LOG_LEVEL=INFO  # DEBUG for development
ENABLE_SENSITIVE_DATA=false  # NEVER enable in production

# Optional: Exclude health checks from tracing
OTEL_PYTHON_FASTAPI_EXCLUDED_URLS=/health,/docs,/redoc,/openapi.json

Azure Application Insights Queries (KQL)

1. Request Tracing

Find request by correlation ID:

union requests, traces, exceptions
| where customDimensions.correlation_id == "your-correlation-id-here"
| project timestamp, itemType, operation_Name, message, customDimensions
| order by timestamp asc

Track full loan application journey:

let correlationId = "your-correlation-id";
union requests, traces
| where customDimensions.correlation_id == correlationId
| extend
    phase = tostring(customDimensions.phase),
    agent = tostring(customDimensions.agent_name),
    app_id = tostring(customDimensions.application_id)
| project timestamp, itemType, operation_Name, phase, agent, message
| order by timestamp asc

2. Performance Analysis

API endpoint latency (p50, p95, p99):

requests
| where name == "POST /api/chat"
| summarize
    count = count(),
    avg_duration = avg(duration),
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99)
    by bin(timestamp, 5m)
| order by timestamp desc

Agent execution time:

traces
| where message contains "Agent processing"
| extend
    agent = tostring(customDimensions.agent_name),
    phase = tostring(customDimensions.phase),
    duration_ms = todouble(customDimensions.duration_ms)
| summarize
    count(),
    avg(duration_ms),
    percentile(duration_ms, 50),
    percentile(duration_ms, 95)
    by agent, phase

Slowest requests in last hour:

requests
| where timestamp > ago(1h)
| top 20 by duration desc
| project
    timestamp,
    name,
    duration,
    resultCode,
    operation_Id,
    customDimensions.correlation_id,
    customDimensions.application_id

3. Error Analysis

Error rate by endpoint:

requests
| where timestamp > ago(1h)
| summarize
    total = count(),
    errors = countif(success == false),
    error_rate = (countif(success == false) * 100.0 / count())
    by name
| where error_rate > 0
| order by error_rate desc

Recent exceptions with context:

exceptions
| where timestamp > ago(1h)
| extend
    correlation_id = tostring(customDimensions.correlation_id),
    application_id = tostring(customDimensions.application_id),
    error_type = tostring(customDimensions.error_type)
| project
    timestamp,
    type,
    outerMessage,
    correlation_id,
    application_id,
    error_type,
    operation_Id
| order by timestamp desc

Agent failure analysis:

traces
| where severityLevel >= 3  // Error or Critical
| where message contains "failed"
| extend
    agent = tostring(customDimensions.agent_name),
    phase = tostring(customDimensions.phase),
    error_type = tostring(customDimensions.error_type)
| summarize
    failure_count = count(),
    unique_errors = dcount(error_type)
    by agent, phase
| order by failure_count desc

4. Cost Management

Token usage by agent:

traces
| where customDimensions.event_type == "token_usage"
| extend
    agent = tostring(customDimensions.agent_name),
    total_tokens = toint(customDimensions.total_tokens),
    model = tostring(customDimensions.model)
| summarize
    total_tokens = sum(total_tokens),
    avg_tokens_per_call = avg(total_tokens),
    call_count = count()
    by agent, model
| extend estimated_cost_usd = total_tokens * 0.00001  // Adjust rate for your model
| order by estimated_cost_usd desc

Daily token usage trend:

traces
| where customDimensions.event_type == "token_usage"
| extend total_tokens = toint(customDimensions.total_tokens)
| summarize
    total_tokens = sum(total_tokens),
    call_count = count()
    by bin(timestamp, 1d)
| extend daily_cost_usd = total_tokens * 0.00001
| order by timestamp desc

Expensive applications (top token consumers):

traces
| where customDimensions.event_type == "token_usage"
| extend
    app_id = tostring(customDimensions.application_id),
    total_tokens = toint(customDimensions.total_tokens)
| summarize
    total_tokens = sum(total_tokens),
    agent_calls = count()
    by app_id
| extend estimated_cost = total_tokens * 0.00001
| top 20 by total_tokens desc

5. User Behavior

Loan application funnel:

traces
| where customDimensions.agent_name == "Cap-ital America"
  or customDimensions.phase in ("intake", "credit", "income", "risk")
| extend
    correlation_id = tostring(customDimensions.correlation_id),
    phase = tostring(customDimensions.phase),
    completion = toint(customDimensions.completion_percentage)
| summarize arg_max(timestamp, *) by correlation_id, phase
| summarize
    started = dcountif(correlation_id, phase == "collecting"),
    completed_intake = dcountif(correlation_id, completion >= 100),
    reached_credit = dcountif(correlation_id, phase == "credit"),
    reached_income = dcountif(correlation_id, phase == "income"),
    reached_decision = dcountif(correlation_id, phase == "risk")

Session duration analysis:

requests
| where name == "POST /api/chat"
| extend session_id = tostring(customDimensions.session_id)
| summarize
    session_start = min(timestamp),
    session_end = max(timestamp),
    message_count = count()
    by session_id
| extend session_duration_minutes = datetime_diff('minute', session_end, session_start)
| summarize
    avg(session_duration_minutes),
    percentile(session_duration_minutes, 50),
    percentile(session_duration_minutes, 95)

Debugging Workflows

Scenario 1: User Reports Error

Get correlation ID from user or response headers

Find full request trace:

union requests, traces, exceptions
| where customDimensions.correlation_id == "CORRELATION_ID"
| order by timestamp asc

Identify failure point from trace timeline
Check exception details if present
Review agent logs for that correlation ID

Scenario 2: Slow Performance

Identify slow endpoint:

requests
| where duration > 5000  // >5 seconds
| top 50 by duration desc

Check agent execution times for those requests
Review external dependencies (httpx calls)
Analyze token usage - high tokens = longer processing

Scenario 3: Cost Spike

Identify high token usage:

traces
| where customDimensions.event_type == "token_usage"
| where timestamp > ago(1h)
| summarize total_tokens = sum(toint(customDimensions.total_tokens)) by bin(timestamp, 5m)
| order by total_tokens desc

Find applications causing spike
Review agent prompts - are they too verbose?
Check for loops - is same request retrying?

Dashboards

Create Application Insights Workbook

Navigate to Azure Portal → Application Insights → Workbooks
Create new workbook
Add queries from above (Performance, Errors, Cost)
Set auto-refresh to 5 minutes
Pin to dashboard for monitoring

Recommended Tiles

Request Rate - requests/minute over time
Error Rate - percentage of failed requests
P95 Latency - 95th percentile response time
Token Usage - daily/hourly token consumption
Active Sessions - current session count
Agent Performance - avg duration by agent

Alerts

Configure alerts in Azure Application Insights:

Critical Alerts

High Error Rate
Condition: Error rate > 5% over 5 minutes
Action: Email, Slack notification
Slow Performance
Condition: P95 latency > 5 seconds over 5 minutes
Action: Email notification
Service Down
Condition: Availability < 99% over 5 minutes
Action: PagerDuty/Slack

Warning Alerts

High Token Usage
Condition: Tokens > 100K over 1 hour
Action: Email notification
Agent Failures
Condition: Agent failures > 10 over 15 minutes
Action: Slack notification

Best Practices

DO ✅

Always include correlation_id in logs
Mask PII (use Observability.mask_application_id())
Use structured logging with extra={} dict
Log business events (application submitted, approved, rejected)
Track token usage for cost management
Set appropriate log levels (INFO in prod, DEBUG in dev)

DON'T ❌

Log sensitive data (SSN, full names, addresses)
Log full prompts (token waste in logs)
Log on every line (noise and performance)
Ignore correlation IDs (breaks tracing)
Skip error context (always log error_type and stack trace)
Over-instrument (exclude health checks, static assets)

Troubleshooting

Traces not appearing in Application Insights

Check connection string: Verify APPLICATIONINSIGHTS_CONNECTION_STRING
Check firewall: Ensure outbound HTTPS to dc.services.visualstudio.com
Check sampling: Traces may be sampled (default 100% in dev)
Wait 2-3 minutes: Ingestion delay is normal

High latency from instrumentation

Check batch settings: Default batching should be <1ms overhead
Exclude noisy endpoints: Add to OTEL_PYTHON_FASTAPI_EXCLUDED_URLS
Reduce attribute cardinality: Limit unique values in custom dimensions
Disable sensitive data: Set ENABLE_SENSITIVE_DATA=false

Missing correlation IDs

Verify middleware: Check add_correlation_id_middleware is registered
Check ContextVar: Ensure async context propagation
Clear between requests: Middleware should call clear_correlation_id()

Resources

Support

For observability issues: 1. Check this guide first 2. Review Application Insights logs 3. Contact DevOps team with correlation ID