Observability Stack Architecture

Last Updated: 2025-10-29
Status: Production
Related ADR: ADR-057: Unified Observability and Logging Strategy

Overview

This document describes how logging, monitoring, and observability work across the entire Loan Defenders multi-agent system stack, from browser to backend services to cloud infrastructure.

┌────────────────────────────────────────────────────────────────┐
│                    Observability Stack                          │
│                                                                 │
│  Browser ──→ UI Logs ──────────────────┐                       │
│                                         │                       │
│  API Server ──→ Application Logs ──────┼──→ Azure Application  │
│                                         │    Insights           │
│  MCP Servers ──→ Tool Logs ─────────────┤    (Centralized)     │
│                                         │                       │
│  Agents ──→ LLM Token Metrics ──────────┘                       │
│                                                                 │
│  Infrastructure ──→ Azure Monitor ──────────→ Metrics/Alerts    │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Layer 1: Frontend (Browser)

Technology Stack

Framework: React 18 + TypeScript
Logging: Browser console API (console.log/error/warn)
State: No centralized logging (standard for SPAs)

What Gets Logged

// apps/ui/src/components/chat/CoordinatorChat.tsx
console.error('Failed to send message:', error);

// apps/ui/src/pages/results/ResultsPage.tsx
console.log('ResultsPage: Checking for stored decision');
console.warn('ResultsPage: No decision found in sessionStorage');

Security Considerations

✅ No PII logged in browser console
✅ Error messages are generic user-facing strings
✅ No API keys or sensitive tokens in frontend code

Future Enhancements

Add frontend error reporting (e.g., Sentry)
Send critical errors to backend API for aggregation
Add user session analytics

Layer 2: API Server (FastAPI)

Technology Stack

Framework: FastAPI (Python 3.12)
Logging: loan_defenders_utils.Observability → Python logging module
Backend: Azure Application Insights (optional)

Architecture

# apps/api/loan_defenders/api/app.py

from loan_defenders_utils import Observability

# Initialize observability ONCE at startup
Observability.initialize()

# Get logger (once per module)
logger = Observability.get_logger("api")

# Structured logging with extra fields
logger.info(
    "Starting workflow stream",
    extra={
        "session_id": request.session_id[:8] + "***",
        "application_id": application.application_id[:8] + "***",
        "loan_amount": float(application.loan_amount),
    },
)

Log Flow

API Request
    ↓
FastAPI Middleware (CORS, request logging)
    ↓
Endpoint Handler
    ↓
Observability.get_logger("api")
    ↓
Python logging module (with structured extra fields)
    ↓
Log Handlers:
    ├─→ Console (stdout) → Docker logs → Azure Container Instances logs
    ├─→ File (rotating, 10MB/file, 5 backups) → /logs/*.log
    └─→ Azure Application Insights (via OpenTelemetry) [optional]

Configuration

# Environment variables control logging behavior
LOG_LEVEL=INFO                    # DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_OUTPUT=console,file,azure     # Comma-separated outputs
LOG_FORMAT=json                   # json or text
OTEL_TRACES_ENABLED=true          # OpenTelemetry trace context
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...

Log Outputs

Console (stdout)

{
  "timestamp": "2025-10-29T14:30:45Z",
  "level": "INFO",
  "name": "loan_defenders.api",
  "message": "Starting workflow stream",
  "session_id": "sess1234***",
  "application_id": "LN123456***",
  "loan_amount": 250000.00
}

File Rotation

Path: /logs/loan-defenders_YYYYMMDD.log
Rotation: 10 MB per file, 5 backups
Format: JSON for machine parsing

Azure Application Insights

Integration: Via OpenTelemetry SDK
Traces: Each log entry becomes a trace
Custom Dimensions: All extra={} fields are searchable
Retention: 90 days by default

Layer 3: MCP Servers (Tool Servers)

Technology Stack

Framework: FastMCP (Python)
Logging: loan_defenders_utils.Observability
Ports: 8010 (application_verification), 8011 (document_processing), 8012 (financial_calculations)

Architecture

# apps/mcp_servers/application_verification/server.py

from loan_defenders_utils import Observability

logger = Observability.get_logger("application_verification_server")

# Tool function pattern (applied to ALL 15 tools across 3 servers)
@mcp.tool()
async def retrieve_credit_report(applicant_id: str, full_name: str, address: str) -> str:
    import time
    start_time = time.time()

    logger.debug(
        "Credit report request received",
        extra={
            "tool": "retrieve_credit_report",
            "applicant_id": applicant_id[:8] + "***",
        },
    )

    try:
        result = await service.retrieve_credit_report(applicant_id, full_name, address)
        duration = time.time() - start_time

        logger.info(
            "Credit report retrieved successfully",
            extra={
                "tool": "retrieve_credit_report",
                "applicant_id": applicant_id[:8] + "***",
                "duration_seconds": round(duration, 3),
                "credit_score": result.get("credit_score"),  # Business metric
            },
        )
        return json.dumps(result)

    except Exception as e:
        duration = time.time() - start_time
        logger.error(
            "Credit report retrieval failed",
            extra={
                "tool": "retrieve_credit_report",
                "applicant_id": applicant_id[:8] + "***",
                "duration_seconds": round(duration, 3),
                "error": str(e),
                "error_type": type(e).__name__,
            },
            exc_info=True,  # Full stack trace
        )
        raise

Startup Logging

# All 3 MCP servers log on startup
logger.info(
    "Starting Application Verification MCP Server",
    extra={
        "port": 8010,
        "tools_available": [
            "retrieve_credit_report",
            "verify_employment",
            "get_bank_account_data",
            "get_tax_transcript_data",
            "verify_asset_information",
            "validate_basic_parameters",
            "application_verification_health_check",
        ],
    },
)

Performance Tracking

Every tool call logs: - Request received: DEBUG level (high frequency) - Success: INFO level with duration and business metrics - Failure: ERROR level with error type, message, duration, and stack trace

Example Query (Azure Application Insights):

traces
| where customDimensions.tool == "retrieve_credit_report"
| extend duration = todouble(customDimensions.duration_seconds)
| summarize avg(duration), max(duration), count() by bin(timestamp, 5m)
| render timechart

Layer 4: AI Agents (Microsoft Agent Framework)

Technology Stack

Framework: Microsoft Agent Framework
Observability: Agent Framework built-in observability + custom logging
LLM: Azure OpenAI (GPT-4)

Observability Features

# Enable agent framework observability
ENABLE_AGENT_FRAMEWORK_OBSERVABILITY=true

Agent Framework tracks: - LLM call metadata (model, deployment, endpoint) - Token usage (input, output, total) - Agent execution time - Tool calls made by agents - Conversation history size

Integration with Observability Class

# loan_defenders_utils/observability.py

@staticmethod
def log_token_usage(
    agent_name: str,
    input_tokens: int,
    output_tokens: int,
    model: str | None = None,
    application_id: str | None = None,
) -> None:
    """Log token usage for cost tracking and analysis."""
    logger = logging.getLogger("agent_framework.observability.token_usage")

    total_tokens = input_tokens + output_tokens

    logger.info(
        "Token usage",
        extra={
            "event_type": "token_usage",
            "agent_name": agent_name,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": total_tokens,
            "model": model or "unknown",
            "application_id": Observability.mask_application_id(application_id),
            "correlation_id": Observability.get_correlation_id(),
        },
    )

Agent Pipeline Logging

# apps/api/loan_defenders/orchestrators/sequential_pipeline.py

# High-frequency events: DEBUG level (filtered in production)
logger.debug("Accumulating text chunk from event.data.text")

# Agent stage completion: INFO level
logger.info(
    "Agent stage completed",
    extra={
        "agent": "Credit_Agent",
        "stage": "credit_assessment",
        "duration_seconds": elapsed,
        "tokens_used": {"input": 450, "output": 230},
    },
)

Layer 5: Infrastructure (Azure Monitor)

Technology Stack

Platform: Azure Container Instances (ACI)
Monitoring: Azure Monitor
Logs: Container logs + Application Insights

Infrastructure Metrics

Automatically collected by Azure Monitor: - CPU usage per container - Memory usage per container - Network traffic (ingress/egress) - Container restart count - HTTP request metrics (via Application Gateway)

Container Log Collection

Azure Container Instances
    ↓
Container stdout/stderr → Azure Monitor Logs
    ↓
Log Analytics Workspace
    ↓
Query with KQL or view in Azure Portal

Example Query:

ContainerInstanceLog_CL
| where ContainerGroup_s == "loan-defenders-apps"
| where Message contains "ERROR"
| project TimeGenerated, ContainerName_s, Message
| order by TimeGenerated desc

Health Checks

All services expose /health endpoints:

# API Server
@app.get("/health")
async def health_check_root() -> HealthResponse:
    return HealthResponse(
        status="healthy",
        services={
            "conversation_orchestrator": "available",
            "sequential_pipeline": "available",
            "session_manager": "available",
        },
        timestamp=datetime.now(timezone.utc).isoformat(),
    )

# MCP Servers
@mcp.custom_route("/health", methods=["GET"])
async def health_endpoint(request: Request) -> JSONResponse:
    return JSONResponse({
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "server": "application_verification",
        "port": 8010,
    })

Monitoring: - Application Gateway probes /health every 30 seconds - Unhealthy containers are restarted automatically - Health check failures trigger alerts

Cross-Cutting Concerns

Correlation IDs

Enable end-to-end tracing across all layers:

# Set at API entry point
correlation_id = Observability.set_correlation_id()

# Automatically included in all logs
logger.info("Processing", extra={
    "correlation_id": Observability.get_correlation_id()
})

# Query entire request journey
# traces
# | where customDimensions.correlation_id == "abc-123-def"
# | order by timestamp asc

OpenTelemetry Integration

Optional distributed tracing:

OTEL_TRACES_ENABLED=true

Injects trace_id and span_id into all logs
Enables correlation with distributed traces
Works with Jaeger, Zipkin, Azure Monitor

Security and PII Masking

Applied across all layers:

# Application IDs
application_id[:8] + "***"  # LN123456***

# Names
Observability.mask_pii("John Doe", "name")  # J*** D***

# Emails
Observability.mask_pii("john@example.com", "email")  # jo***@example.com

Never logged: - Social Security Numbers (SSN) - Full account numbers - Account balances - Income/tax amounts - Full credit scores

Operational Dashboards

Azure Application Insights Dashboards

1. System Health Dashboard

// Error rate by service
traces
| where severityLevel >= 3  // WARNING or ERROR
| summarize errors = count() by bin(timestamp, 5m), tostring(customDimensions.service)
| render timechart

2. Performance Dashboard

// Tool performance (P50, P95, P99)
traces
| where message contains "Tool completed"
| extend duration = todouble(customDimensions.duration_seconds)
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99)
    by tostring(customDimensions.tool)

3. Business Metrics Dashboard

// Loan application outcomes
traces
| where customDimensions.event_type == "business_metric"
| where customDimensions.metric == "loan_decision"
| summarize
    count(),
    avg_processing_time = avg(todouble(customDimensions.processing_time_seconds))
    by tostring(customDimensions.decision), bin(timestamp, 1h)
| render columnchart

4. Cost Tracking Dashboard

// LLM token usage and cost
traces
| where customDimensions.event_type == "token_usage"
| summarize
    total_tokens = sum(toint(customDimensions.total_tokens)),
    estimated_cost = sum(toint(customDimensions.total_tokens)) * 0.00001
    by tostring(customDimensions.agent_name), bin(timestamp, 1h)

Alerting Strategy

Critical Alerts (Immediate Response)

Service Down
Condition: Health check fails for >2 minutes
Action: PagerDuty alert, automatic container restart
High Error Rate
Condition: Error rate >5% over 5 minutes
Action: Alert engineering team
Token Budget Exceeded
Condition: Daily token usage >$100
Action: Alert finance + engineering

Warning Alerts (Business Hours)

Slow Performance
Condition: P95 latency >10 seconds
Action: Slack notification
Increased Retry Rate
Condition: Retry rate >10% over 15 minutes
Action: Slack notification
Low Resource Availability
Condition: CPU >80% or Memory >80%
Action: Slack notification, consider scaling

Log Retention Policies

Log Type	Retention	Storage	Cost
Console logs (stdout)	7 days	Azure Monitor	Included
File logs (rotating)	30 days	Container storage	Minimal
Application Insights	90 days	Azure storage	$2.30/GB ingested
Archive (compliance)	7 years	Blob storage	$0.01/GB/month

Performance Characteristics

Log Volume Analysis

Daily Production Estimates (10,000 loan applications/day):

Component	Log Level	Entries/Day	Size/Day
API Server	INFO	~50,000	~10 MB
MCP Servers	INFO	~100,000	~20 MB
Agents	INFO	~80,000	~15 MB
Infrastructure	Auto	~10,000	~5 MB
Total		~240,000	~50 MB

Before optimization: ~500 MB/day (90% reduction achieved)

Latency Impact

Structured logging overhead: <1ms per log entry
Azure Application Insights: Async upload, no request blocking
File I/O: Buffered, negligible impact

Debugging Workflows

Common Debugging Scenarios

1. Find why a loan application failed

traces
| where customDimensions.application_id contains "LN123456"
| order by timestamp asc
| project timestamp, severityLevel, message, customDimensions

2. Identify slow MCP tool

traces
| where message contains "Tool completed"
| extend duration = todouble(customDimensions.duration_seconds)
| where duration > 5.0
| summarize count(), avg(duration) by tostring(customDimensions.tool)
| order by avg_duration desc

3. Trace agent-to-agent communication

traces
| where customDimensions.correlation_id == "abc-123-def"
| where customDimensions.agent_name != ""
| order by timestamp asc
| project timestamp, customDimensions.agent_name, customDimensions.stage, message

4. Find startup failures

traces
| where message contains "failed to start" or message contains "Failed to initialize"
| where timestamp > ago(1h)
| project timestamp, customDimensions.service, customDimensions.error, customDimensions.critical_env_vars

Best Practices

For Developers

Never use print() - Always use logger from Observability.get_logger()
Always mask PII - Use masking utilities for sensitive data
Use correct log level - DEBUG for frequent, INFO for business events
Include context - Always use extra={} with relevant metadata
Handle errors - Wrap risky operations in try/except with logging

For Operations

Monitor MTTR - Track time from alert to resolution
Review logs weekly - Identify patterns and optimize log levels
Test failure scenarios - Verify logs are helpful during outages
Update dashboards - Keep KQL queries relevant to current issues
Tune alerts - Minimize false positives while catching real issues

For Security

Never log PII in full - Always mask sensitive data
Audit log queries - Who is accessing which logs?
Review retention - Comply with data protection regulations
Encrypt in transit - Use HTTPS for Application Insights
Regular security reviews - Verify no sensitive data leaking into logs

ADR: ADR-057: Unified Observability and Logging Strategy
Standards: LOGGING-STANDARDS.md
Quick Reference: LOGGING-QUICK-REFERENCE.md
Code: loan_defenders_utils/src/loan_defenders_utils/observability.py

Maintenance

Review Schedule: Quarterly

Next Review: 2026-01-29

Review Checklist: - [ ] MTTR metrics achieved (<15 minutes) - [ ] Log volume within budget (<50 MB/day) - [ ] Zero PII exposure incidents - [ ] Dashboards actively used by team - [ ] Alert signal-to-noise ratio acceptable

Observability Stack Architecture

Overview

Layer 1: Frontend (Browser)

Technology Stack

What Gets Logged

Security Considerations

Future Enhancements

Layer 2: API Server (FastAPI)

Technology Stack

Architecture

Log Flow

Configuration

Log Outputs

Console (stdout)

File Rotation

Azure Application Insights

Layer 3: MCP Servers (Tool Servers)

Technology Stack

Architecture

Startup Logging

Performance Tracking

Layer 4: AI Agents (Microsoft Agent Framework)

Technology Stack

Observability Features

Integration with Observability Class

Agent Pipeline Logging

Layer 5: Infrastructure (Azure Monitor)

Technology Stack

Infrastructure Metrics

Container Log Collection

Health Checks

Cross-Cutting Concerns

Correlation IDs

OpenTelemetry Integration

Security and PII Masking

Operational Dashboards

Azure Application Insights Dashboards

Alerting Strategy

Critical Alerts (Immediate Response)

Warning Alerts (Business Hours)

Log Retention Policies

Performance Characteristics

Log Volume Analysis

Latency Impact

Debugging Workflows

Common Debugging Scenarios

1. Find why a loan application failed

2. Identify slow MCP tool

3. Trace agent-to-agent communication

4. Find startup failures

Best Practices

For Developers

For Operations

For Security

Related Documentation

Maintenance