Skip to content

Observability Stack Architecture

Last Updated: 2025-10-29
Status: Production
Related ADR: ADR-057: Unified Observability and Logging Strategy


Overview

This document describes how logging, monitoring, and observability work across the entire Loan Defenders multi-agent system stack, from browser to backend services to cloud infrastructure.

┌────────────────────────────────────────────────────────────────┐
│                    Observability Stack                          │
│                                                                 │
│  Browser ──→ UI Logs ──────────────────┐                       │
│                                         │                       │
│  API Server ──→ Application Logs ──────┼──→ Azure Application  │
│                                         │    Insights           │
│  MCP Servers ──→ Tool Logs ─────────────┤    (Centralized)     │
│                                         │                       │
│  Agents ──→ LLM Token Metrics ──────────┘                       │
│                                                                 │
│  Infrastructure ──→ Azure Monitor ──────────→ Metrics/Alerts    │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Layer 1: Frontend (Browser)

Technology Stack

  • Framework: React 18 + TypeScript
  • Logging: Browser console API (console.log/error/warn)
  • State: No centralized logging (standard for SPAs)

What Gets Logged

// apps/ui/src/components/chat/CoordinatorChat.tsx
console.error('Failed to send message:', error);

// apps/ui/src/pages/results/ResultsPage.tsx
console.log('ResultsPage: Checking for stored decision');
console.warn('ResultsPage: No decision found in sessionStorage');

Security Considerations

  • ✅ No PII logged in browser console
  • ✅ Error messages are generic user-facing strings
  • ✅ No API keys or sensitive tokens in frontend code

Future Enhancements

  • Add frontend error reporting (e.g., Sentry)
  • Send critical errors to backend API for aggregation
  • Add user session analytics

Layer 2: API Server (FastAPI)

Technology Stack

  • Framework: FastAPI (Python 3.12)
  • Logging: loan_defenders_utils.Observability → Python logging module
  • Backend: Azure Application Insights (optional)

Architecture

# apps/api/loan_defenders/api/app.py

from loan_defenders_utils import Observability

# Initialize observability ONCE at startup
Observability.initialize()

# Get logger (once per module)
logger = Observability.get_logger("api")

# Structured logging with extra fields
logger.info(
    "Starting workflow stream",
    extra={
        "session_id": request.session_id[:8] + "***",
        "application_id": application.application_id[:8] + "***",
        "loan_amount": float(application.loan_amount),
    },
)

Log Flow

API Request
FastAPI Middleware (CORS, request logging)
Endpoint Handler
Observability.get_logger("api")
Python logging module (with structured extra fields)
Log Handlers:
    ├─→ Console (stdout) → Docker logs → Azure Container Instances logs
    ├─→ File (rotating, 10MB/file, 5 backups) → /logs/*.log
    └─→ Azure Application Insights (via OpenTelemetry) [optional]

Configuration

# Environment variables control logging behavior
LOG_LEVEL=INFO                    # DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_OUTPUT=console,file,azure     # Comma-separated outputs
LOG_FORMAT=json                   # json or text
OTEL_TRACES_ENABLED=true          # OpenTelemetry trace context
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...

Log Outputs

Console (stdout)

{
  "timestamp": "2025-10-29T14:30:45Z",
  "level": "INFO",
  "name": "loan_defenders.api",
  "message": "Starting workflow stream",
  "session_id": "sess1234***",
  "application_id": "LN123456***",
  "loan_amount": 250000.00
}

File Rotation

  • Path: /logs/loan-defenders_YYYYMMDD.log
  • Rotation: 10 MB per file, 5 backups
  • Format: JSON for machine parsing

Azure Application Insights

  • Integration: Via OpenTelemetry SDK
  • Traces: Each log entry becomes a trace
  • Custom Dimensions: All extra={} fields are searchable
  • Retention: 90 days by default

Layer 3: MCP Servers (Tool Servers)

Technology Stack

  • Framework: FastMCP (Python)
  • Logging: loan_defenders_utils.Observability
  • Ports: 8010 (application_verification), 8011 (document_processing), 8012 (financial_calculations)

Architecture

# apps/mcp_servers/application_verification/server.py

from loan_defenders_utils import Observability

logger = Observability.get_logger("application_verification_server")

# Tool function pattern (applied to ALL 15 tools across 3 servers)
@mcp.tool()
async def retrieve_credit_report(applicant_id: str, full_name: str, address: str) -> str:
    import time
    start_time = time.time()

    logger.debug(
        "Credit report request received",
        extra={
            "tool": "retrieve_credit_report",
            "applicant_id": applicant_id[:8] + "***",
        },
    )

    try:
        result = await service.retrieve_credit_report(applicant_id, full_name, address)
        duration = time.time() - start_time

        logger.info(
            "Credit report retrieved successfully",
            extra={
                "tool": "retrieve_credit_report",
                "applicant_id": applicant_id[:8] + "***",
                "duration_seconds": round(duration, 3),
                "credit_score": result.get("credit_score"),  # Business metric
            },
        )
        return json.dumps(result)

    except Exception as e:
        duration = time.time() - start_time
        logger.error(
            "Credit report retrieval failed",
            extra={
                "tool": "retrieve_credit_report",
                "applicant_id": applicant_id[:8] + "***",
                "duration_seconds": round(duration, 3),
                "error": str(e),
                "error_type": type(e).__name__,
            },
            exc_info=True,  # Full stack trace
        )
        raise

Startup Logging

# All 3 MCP servers log on startup
logger.info(
    "Starting Application Verification MCP Server",
    extra={
        "port": 8010,
        "tools_available": [
            "retrieve_credit_report",
            "verify_employment",
            "get_bank_account_data",
            "get_tax_transcript_data",
            "verify_asset_information",
            "validate_basic_parameters",
            "application_verification_health_check",
        ],
    },
)

Performance Tracking

Every tool call logs: - Request received: DEBUG level (high frequency) - Success: INFO level with duration and business metrics - Failure: ERROR level with error type, message, duration, and stack trace

Example Query (Azure Application Insights):

traces
| where customDimensions.tool == "retrieve_credit_report"
| extend duration = todouble(customDimensions.duration_seconds)
| summarize avg(duration), max(duration), count() by bin(timestamp, 5m)
| render timechart


Layer 4: AI Agents (Microsoft Agent Framework)

Technology Stack

  • Framework: Microsoft Agent Framework
  • Observability: Agent Framework built-in observability + custom logging
  • LLM: Azure OpenAI (GPT-4)

Observability Features

# Enable agent framework observability
ENABLE_AGENT_FRAMEWORK_OBSERVABILITY=true

Agent Framework tracks: - LLM call metadata (model, deployment, endpoint) - Token usage (input, output, total) - Agent execution time - Tool calls made by agents - Conversation history size

Integration with Observability Class

# loan_defenders_utils/observability.py

@staticmethod
def log_token_usage(
    agent_name: str,
    input_tokens: int,
    output_tokens: int,
    model: str | None = None,
    application_id: str | None = None,
) -> None:
    """Log token usage for cost tracking and analysis."""
    logger = logging.getLogger("agent_framework.observability.token_usage")

    total_tokens = input_tokens + output_tokens

    logger.info(
        "Token usage",
        extra={
            "event_type": "token_usage",
            "agent_name": agent_name,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": total_tokens,
            "model": model or "unknown",
            "application_id": Observability.mask_application_id(application_id),
            "correlation_id": Observability.get_correlation_id(),
        },
    )

Agent Pipeline Logging

# apps/api/loan_defenders/orchestrators/sequential_pipeline.py

# High-frequency events: DEBUG level (filtered in production)
logger.debug("Accumulating text chunk from event.data.text")

# Agent stage completion: INFO level
logger.info(
    "Agent stage completed",
    extra={
        "agent": "Credit_Agent",
        "stage": "credit_assessment",
        "duration_seconds": elapsed,
        "tokens_used": {"input": 450, "output": 230},
    },
)

Layer 5: Infrastructure (Azure Monitor)

Technology Stack

  • Platform: Azure Container Instances (ACI)
  • Monitoring: Azure Monitor
  • Logs: Container logs + Application Insights

Infrastructure Metrics

Automatically collected by Azure Monitor: - CPU usage per container - Memory usage per container - Network traffic (ingress/egress) - Container restart count - HTTP request metrics (via Application Gateway)

Container Log Collection

Azure Container Instances
Container stdout/stderr → Azure Monitor Logs
Log Analytics Workspace
Query with KQL or view in Azure Portal

Example Query:

ContainerInstanceLog_CL
| where ContainerGroup_s == "loan-defenders-apps"
| where Message contains "ERROR"
| project TimeGenerated, ContainerName_s, Message
| order by TimeGenerated desc

Health Checks

All services expose /health endpoints:

# API Server
@app.get("/health")
async def health_check_root() -> HealthResponse:
    return HealthResponse(
        status="healthy",
        services={
            "conversation_orchestrator": "available",
            "sequential_pipeline": "available",
            "session_manager": "available",
        },
        timestamp=datetime.now(timezone.utc).isoformat(),
    )

# MCP Servers
@mcp.custom_route("/health", methods=["GET"])
async def health_endpoint(request: Request) -> JSONResponse:
    return JSONResponse({
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "server": "application_verification",
        "port": 8010,
    })

Monitoring: - Application Gateway probes /health every 30 seconds - Unhealthy containers are restarted automatically - Health check failures trigger alerts


Cross-Cutting Concerns

Correlation IDs

Enable end-to-end tracing across all layers:

# Set at API entry point
correlation_id = Observability.set_correlation_id()

# Automatically included in all logs
logger.info("Processing", extra={
    "correlation_id": Observability.get_correlation_id()
})

# Query entire request journey
# traces
# | where customDimensions.correlation_id == "abc-123-def"
# | order by timestamp asc

OpenTelemetry Integration

Optional distributed tracing:

OTEL_TRACES_ENABLED=true
  • Injects trace_id and span_id into all logs
  • Enables correlation with distributed traces
  • Works with Jaeger, Zipkin, Azure Monitor

Security and PII Masking

Applied across all layers:

# Application IDs
application_id[:8] + "***"  # LN123456***

# Names
Observability.mask_pii("John Doe", "name")  # J*** D***

# Emails
Observability.mask_pii("john@example.com", "email")  # jo***@example.com

Never logged: - Social Security Numbers (SSN) - Full account numbers - Account balances - Income/tax amounts - Full credit scores


Operational Dashboards

Azure Application Insights Dashboards

1. System Health Dashboard

// Error rate by service
traces
| where severityLevel >= 3  // WARNING or ERROR
| summarize errors = count() by bin(timestamp, 5m), tostring(customDimensions.service)
| render timechart

2. Performance Dashboard

// Tool performance (P50, P95, P99)
traces
| where message contains "Tool completed"
| extend duration = todouble(customDimensions.duration_seconds)
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99)
    by tostring(customDimensions.tool)

3. Business Metrics Dashboard

// Loan application outcomes
traces
| where customDimensions.event_type == "business_metric"
| where customDimensions.metric == "loan_decision"
| summarize
    count(),
    avg_processing_time = avg(todouble(customDimensions.processing_time_seconds))
    by tostring(customDimensions.decision), bin(timestamp, 1h)
| render columnchart

4. Cost Tracking Dashboard

// LLM token usage and cost
traces
| where customDimensions.event_type == "token_usage"
| summarize
    total_tokens = sum(toint(customDimensions.total_tokens)),
    estimated_cost = sum(toint(customDimensions.total_tokens)) * 0.00001
    by tostring(customDimensions.agent_name), bin(timestamp, 1h)


Alerting Strategy

Critical Alerts (Immediate Response)

  1. Service Down
  2. Condition: Health check fails for >2 minutes
  3. Action: PagerDuty alert, automatic container restart

  4. High Error Rate

  5. Condition: Error rate >5% over 5 minutes
  6. Action: Alert engineering team

  7. Token Budget Exceeded

  8. Condition: Daily token usage >$100
  9. Action: Alert finance + engineering

Warning Alerts (Business Hours)

  1. Slow Performance
  2. Condition: P95 latency >10 seconds
  3. Action: Slack notification

  4. Increased Retry Rate

  5. Condition: Retry rate >10% over 15 minutes
  6. Action: Slack notification

  7. Low Resource Availability

  8. Condition: CPU >80% or Memory >80%
  9. Action: Slack notification, consider scaling

Log Retention Policies

Log Type Retention Storage Cost
Console logs (stdout) 7 days Azure Monitor Included
File logs (rotating) 30 days Container storage Minimal
Application Insights 90 days Azure storage $2.30/GB ingested
Archive (compliance) 7 years Blob storage $0.01/GB/month

Performance Characteristics

Log Volume Analysis

Daily Production Estimates (10,000 loan applications/day):

Component Log Level Entries/Day Size/Day
API Server INFO ~50,000 ~10 MB
MCP Servers INFO ~100,000 ~20 MB
Agents INFO ~80,000 ~15 MB
Infrastructure Auto ~10,000 ~5 MB
Total ~240,000 ~50 MB

Before optimization: ~500 MB/day (90% reduction achieved)

Latency Impact

  • Structured logging overhead: <1ms per log entry
  • Azure Application Insights: Async upload, no request blocking
  • File I/O: Buffered, negligible impact

Debugging Workflows

Common Debugging Scenarios

1. Find why a loan application failed

traces
| where customDimensions.application_id contains "LN123456"
| order by timestamp asc
| project timestamp, severityLevel, message, customDimensions

2. Identify slow MCP tool

traces
| where message contains "Tool completed"
| extend duration = todouble(customDimensions.duration_seconds)
| where duration > 5.0
| summarize count(), avg(duration) by tostring(customDimensions.tool)
| order by avg_duration desc

3. Trace agent-to-agent communication

traces
| where customDimensions.correlation_id == "abc-123-def"
| where customDimensions.agent_name != ""
| order by timestamp asc
| project timestamp, customDimensions.agent_name, customDimensions.stage, message

4. Find startup failures

traces
| where message contains "failed to start" or message contains "Failed to initialize"
| where timestamp > ago(1h)
| project timestamp, customDimensions.service, customDimensions.error, customDimensions.critical_env_vars

Best Practices

For Developers

  1. Never use print() - Always use logger from Observability.get_logger()
  2. Always mask PII - Use masking utilities for sensitive data
  3. Use correct log level - DEBUG for frequent, INFO for business events
  4. Include context - Always use extra={} with relevant metadata
  5. Handle errors - Wrap risky operations in try/except with logging

For Operations

  1. Monitor MTTR - Track time from alert to resolution
  2. Review logs weekly - Identify patterns and optimize log levels
  3. Test failure scenarios - Verify logs are helpful during outages
  4. Update dashboards - Keep KQL queries relevant to current issues
  5. Tune alerts - Minimize false positives while catching real issues

For Security

  1. Never log PII in full - Always mask sensitive data
  2. Audit log queries - Who is accessing which logs?
  3. Review retention - Comply with data protection regulations
  4. Encrypt in transit - Use HTTPS for Application Insights
  5. Regular security reviews - Verify no sensitive data leaking into logs


Maintenance

Review Schedule: Quarterly

Next Review: 2026-01-29

Review Checklist: - [ ] MTTR metrics achieved (<15 minutes) - [ ] Log volume within budget (<50 MB/day) - [ ] Zero PII exposure incidents - [ ] Dashboards actively used by team - [ ] Alert signal-to-noise ratio acceptable