Observability Stack Architecture
Last Updated: 2025-10-29
Status: Production
Related ADR: ADR-057: Unified Observability and Logging Strategy
Overview
This document describes how logging, monitoring, and observability work across the entire Loan Defenders multi-agent system stack, from browser to backend services to cloud infrastructure.
┌────────────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ Browser ──→ UI Logs ──────────────────┐ │
│ │ │
│ API Server ──→ Application Logs ──────┼──→ Azure Application │
│ │ Insights │
│ MCP Servers ──→ Tool Logs ─────────────┤ (Centralized) │
│ │ │
│ Agents ──→ LLM Token Metrics ──────────┘ │
│ │
│ Infrastructure ──→ Azure Monitor ──────────→ Metrics/Alerts │
│ │
└────────────────────────────────────────────────────────────────┘
Layer 1: Frontend (Browser)
Technology Stack
- Framework: React 18 + TypeScript
- Logging: Browser console API (
console.log/error/warn) - State: No centralized logging (standard for SPAs)
What Gets Logged
// apps/ui/src/components/chat/CoordinatorChat.tsx
console.error('Failed to send message:', error);
// apps/ui/src/pages/results/ResultsPage.tsx
console.log('ResultsPage: Checking for stored decision');
console.warn('ResultsPage: No decision found in sessionStorage');
Security Considerations
- ✅ No PII logged in browser console
- ✅ Error messages are generic user-facing strings
- ✅ No API keys or sensitive tokens in frontend code
Future Enhancements
- Add frontend error reporting (e.g., Sentry)
- Send critical errors to backend API for aggregation
- Add user session analytics
Layer 2: API Server (FastAPI)
Technology Stack
- Framework: FastAPI (Python 3.12)
- Logging:
loan_defenders_utils.Observability→ Pythonloggingmodule - Backend: Azure Application Insights (optional)
Architecture
# apps/api/loan_defenders/api/app.py
from loan_defenders_utils import Observability
# Initialize observability ONCE at startup
Observability.initialize()
# Get logger (once per module)
logger = Observability.get_logger("api")
# Structured logging with extra fields
logger.info(
"Starting workflow stream",
extra={
"session_id": request.session_id[:8] + "***",
"application_id": application.application_id[:8] + "***",
"loan_amount": float(application.loan_amount),
},
)
Log Flow
API Request
↓
FastAPI Middleware (CORS, request logging)
↓
Endpoint Handler
↓
Observability.get_logger("api")
↓
Python logging module (with structured extra fields)
↓
Log Handlers:
├─→ Console (stdout) → Docker logs → Azure Container Instances logs
├─→ File (rotating, 10MB/file, 5 backups) → /logs/*.log
└─→ Azure Application Insights (via OpenTelemetry) [optional]
Configuration
# Environment variables control logging behavior
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_OUTPUT=console,file,azure # Comma-separated outputs
LOG_FORMAT=json # json or text
OTEL_TRACES_ENABLED=true # OpenTelemetry trace context
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...
Log Outputs
Console (stdout)
{
"timestamp": "2025-10-29T14:30:45Z",
"level": "INFO",
"name": "loan_defenders.api",
"message": "Starting workflow stream",
"session_id": "sess1234***",
"application_id": "LN123456***",
"loan_amount": 250000.00
}
File Rotation
- Path:
/logs/loan-defenders_YYYYMMDD.log - Rotation: 10 MB per file, 5 backups
- Format: JSON for machine parsing
Azure Application Insights
- Integration: Via OpenTelemetry SDK
- Traces: Each log entry becomes a trace
- Custom Dimensions: All
extra={}fields are searchable - Retention: 90 days by default
Layer 3: MCP Servers (Tool Servers)
Technology Stack
- Framework: FastMCP (Python)
- Logging:
loan_defenders_utils.Observability - Ports: 8010 (application_verification), 8011 (document_processing), 8012 (financial_calculations)
Architecture
# apps/mcp_servers/application_verification/server.py
from loan_defenders_utils import Observability
logger = Observability.get_logger("application_verification_server")
# Tool function pattern (applied to ALL 15 tools across 3 servers)
@mcp.tool()
async def retrieve_credit_report(applicant_id: str, full_name: str, address: str) -> str:
import time
start_time = time.time()
logger.debug(
"Credit report request received",
extra={
"tool": "retrieve_credit_report",
"applicant_id": applicant_id[:8] + "***",
},
)
try:
result = await service.retrieve_credit_report(applicant_id, full_name, address)
duration = time.time() - start_time
logger.info(
"Credit report retrieved successfully",
extra={
"tool": "retrieve_credit_report",
"applicant_id": applicant_id[:8] + "***",
"duration_seconds": round(duration, 3),
"credit_score": result.get("credit_score"), # Business metric
},
)
return json.dumps(result)
except Exception as e:
duration = time.time() - start_time
logger.error(
"Credit report retrieval failed",
extra={
"tool": "retrieve_credit_report",
"applicant_id": applicant_id[:8] + "***",
"duration_seconds": round(duration, 3),
"error": str(e),
"error_type": type(e).__name__,
},
exc_info=True, # Full stack trace
)
raise
Startup Logging
# All 3 MCP servers log on startup
logger.info(
"Starting Application Verification MCP Server",
extra={
"port": 8010,
"tools_available": [
"retrieve_credit_report",
"verify_employment",
"get_bank_account_data",
"get_tax_transcript_data",
"verify_asset_information",
"validate_basic_parameters",
"application_verification_health_check",
],
},
)
Performance Tracking
Every tool call logs: - Request received: DEBUG level (high frequency) - Success: INFO level with duration and business metrics - Failure: ERROR level with error type, message, duration, and stack trace
Example Query (Azure Application Insights):
traces
| where customDimensions.tool == "retrieve_credit_report"
| extend duration = todouble(customDimensions.duration_seconds)
| summarize avg(duration), max(duration), count() by bin(timestamp, 5m)
| render timechart
Layer 4: AI Agents (Microsoft Agent Framework)
Technology Stack
- Framework: Microsoft Agent Framework
- Observability: Agent Framework built-in observability + custom logging
- LLM: Azure OpenAI (GPT-4)
Observability Features
Agent Framework tracks: - LLM call metadata (model, deployment, endpoint) - Token usage (input, output, total) - Agent execution time - Tool calls made by agents - Conversation history size
Integration with Observability Class
# loan_defenders_utils/observability.py
@staticmethod
def log_token_usage(
agent_name: str,
input_tokens: int,
output_tokens: int,
model: str | None = None,
application_id: str | None = None,
) -> None:
"""Log token usage for cost tracking and analysis."""
logger = logging.getLogger("agent_framework.observability.token_usage")
total_tokens = input_tokens + output_tokens
logger.info(
"Token usage",
extra={
"event_type": "token_usage",
"agent_name": agent_name,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
"model": model or "unknown",
"application_id": Observability.mask_application_id(application_id),
"correlation_id": Observability.get_correlation_id(),
},
)
Agent Pipeline Logging
# apps/api/loan_defenders/orchestrators/sequential_pipeline.py
# High-frequency events: DEBUG level (filtered in production)
logger.debug("Accumulating text chunk from event.data.text")
# Agent stage completion: INFO level
logger.info(
"Agent stage completed",
extra={
"agent": "Credit_Agent",
"stage": "credit_assessment",
"duration_seconds": elapsed,
"tokens_used": {"input": 450, "output": 230},
},
)
Layer 5: Infrastructure (Azure Monitor)
Technology Stack
- Platform: Azure Container Instances (ACI)
- Monitoring: Azure Monitor
- Logs: Container logs + Application Insights
Infrastructure Metrics
Automatically collected by Azure Monitor: - CPU usage per container - Memory usage per container - Network traffic (ingress/egress) - Container restart count - HTTP request metrics (via Application Gateway)
Container Log Collection
Azure Container Instances
↓
Container stdout/stderr → Azure Monitor Logs
↓
Log Analytics Workspace
↓
Query with KQL or view in Azure Portal
Example Query:
ContainerInstanceLog_CL
| where ContainerGroup_s == "loan-defenders-apps"
| where Message contains "ERROR"
| project TimeGenerated, ContainerName_s, Message
| order by TimeGenerated desc
Health Checks
All services expose /health endpoints:
# API Server
@app.get("/health")
async def health_check_root() -> HealthResponse:
return HealthResponse(
status="healthy",
services={
"conversation_orchestrator": "available",
"sequential_pipeline": "available",
"session_manager": "available",
},
timestamp=datetime.now(timezone.utc).isoformat(),
)
# MCP Servers
@mcp.custom_route("/health", methods=["GET"])
async def health_endpoint(request: Request) -> JSONResponse:
return JSONResponse({
"status": "healthy",
"timestamp": datetime.now(timezone.utc).isoformat(),
"server": "application_verification",
"port": 8010,
})
Monitoring:
- Application Gateway probes /health every 30 seconds
- Unhealthy containers are restarted automatically
- Health check failures trigger alerts
Cross-Cutting Concerns
Correlation IDs
Enable end-to-end tracing across all layers:
# Set at API entry point
correlation_id = Observability.set_correlation_id()
# Automatically included in all logs
logger.info("Processing", extra={
"correlation_id": Observability.get_correlation_id()
})
# Query entire request journey
# traces
# | where customDimensions.correlation_id == "abc-123-def"
# | order by timestamp asc
OpenTelemetry Integration
Optional distributed tracing:
- Injects
trace_idandspan_idinto all logs - Enables correlation with distributed traces
- Works with Jaeger, Zipkin, Azure Monitor
Security and PII Masking
Applied across all layers:
# Application IDs
application_id[:8] + "***" # LN123456***
# Names
Observability.mask_pii("John Doe", "name") # J*** D***
# Emails
Observability.mask_pii("john@example.com", "email") # jo***@example.com
Never logged: - Social Security Numbers (SSN) - Full account numbers - Account balances - Income/tax amounts - Full credit scores
Operational Dashboards
Azure Application Insights Dashboards
1. System Health Dashboard
// Error rate by service
traces
| where severityLevel >= 3 // WARNING or ERROR
| summarize errors = count() by bin(timestamp, 5m), tostring(customDimensions.service)
| render timechart
2. Performance Dashboard
// Tool performance (P50, P95, P99)
traces
| where message contains "Tool completed"
| extend duration = todouble(customDimensions.duration_seconds)
| summarize
p50 = percentile(duration, 50),
p95 = percentile(duration, 95),
p99 = percentile(duration, 99)
by tostring(customDimensions.tool)
3. Business Metrics Dashboard
// Loan application outcomes
traces
| where customDimensions.event_type == "business_metric"
| where customDimensions.metric == "loan_decision"
| summarize
count(),
avg_processing_time = avg(todouble(customDimensions.processing_time_seconds))
by tostring(customDimensions.decision), bin(timestamp, 1h)
| render columnchart
4. Cost Tracking Dashboard
// LLM token usage and cost
traces
| where customDimensions.event_type == "token_usage"
| summarize
total_tokens = sum(toint(customDimensions.total_tokens)),
estimated_cost = sum(toint(customDimensions.total_tokens)) * 0.00001
by tostring(customDimensions.agent_name), bin(timestamp, 1h)
Alerting Strategy
Critical Alerts (Immediate Response)
- Service Down
- Condition: Health check fails for >2 minutes
-
Action: PagerDuty alert, automatic container restart
-
High Error Rate
- Condition: Error rate >5% over 5 minutes
-
Action: Alert engineering team
-
Token Budget Exceeded
- Condition: Daily token usage >$100
- Action: Alert finance + engineering
Warning Alerts (Business Hours)
- Slow Performance
- Condition: P95 latency >10 seconds
-
Action: Slack notification
-
Increased Retry Rate
- Condition: Retry rate >10% over 15 minutes
-
Action: Slack notification
-
Low Resource Availability
- Condition: CPU >80% or Memory >80%
- Action: Slack notification, consider scaling
Log Retention Policies
| Log Type | Retention | Storage | Cost |
|---|---|---|---|
| Console logs (stdout) | 7 days | Azure Monitor | Included |
| File logs (rotating) | 30 days | Container storage | Minimal |
| Application Insights | 90 days | Azure storage | $2.30/GB ingested |
| Archive (compliance) | 7 years | Blob storage | $0.01/GB/month |
Performance Characteristics
Log Volume Analysis
Daily Production Estimates (10,000 loan applications/day):
| Component | Log Level | Entries/Day | Size/Day |
|---|---|---|---|
| API Server | INFO | ~50,000 | ~10 MB |
| MCP Servers | INFO | ~100,000 | ~20 MB |
| Agents | INFO | ~80,000 | ~15 MB |
| Infrastructure | Auto | ~10,000 | ~5 MB |
| Total | ~240,000 | ~50 MB |
Before optimization: ~500 MB/day (90% reduction achieved)
Latency Impact
- Structured logging overhead: <1ms per log entry
- Azure Application Insights: Async upload, no request blocking
- File I/O: Buffered, negligible impact
Debugging Workflows
Common Debugging Scenarios
1. Find why a loan application failed
traces
| where customDimensions.application_id contains "LN123456"
| order by timestamp asc
| project timestamp, severityLevel, message, customDimensions
2. Identify slow MCP tool
traces
| where message contains "Tool completed"
| extend duration = todouble(customDimensions.duration_seconds)
| where duration > 5.0
| summarize count(), avg(duration) by tostring(customDimensions.tool)
| order by avg_duration desc
3. Trace agent-to-agent communication
traces
| where customDimensions.correlation_id == "abc-123-def"
| where customDimensions.agent_name != ""
| order by timestamp asc
| project timestamp, customDimensions.agent_name, customDimensions.stage, message
4. Find startup failures
traces
| where message contains "failed to start" or message contains "Failed to initialize"
| where timestamp > ago(1h)
| project timestamp, customDimensions.service, customDimensions.error, customDimensions.critical_env_vars
Best Practices
For Developers
- Never use print() - Always use
loggerfromObservability.get_logger() - Always mask PII - Use masking utilities for sensitive data
- Use correct log level - DEBUG for frequent, INFO for business events
- Include context - Always use
extra={}with relevant metadata - Handle errors - Wrap risky operations in try/except with logging
For Operations
- Monitor MTTR - Track time from alert to resolution
- Review logs weekly - Identify patterns and optimize log levels
- Test failure scenarios - Verify logs are helpful during outages
- Update dashboards - Keep KQL queries relevant to current issues
- Tune alerts - Minimize false positives while catching real issues
For Security
- Never log PII in full - Always mask sensitive data
- Audit log queries - Who is accessing which logs?
- Review retention - Comply with data protection regulations
- Encrypt in transit - Use HTTPS for Application Insights
- Regular security reviews - Verify no sensitive data leaking into logs
Related Documentation
- ADR: ADR-057: Unified Observability and Logging Strategy
- Standards: LOGGING-STANDARDS.md
- Quick Reference: LOGGING-QUICK-REFERENCE.md
- Code:
loan_defenders_utils/src/loan_defenders_utils/observability.py
Maintenance
Review Schedule: Quarterly
Next Review: 2026-01-29
Review Checklist: - [ ] MTTR metrics achieved (<15 minutes) - [ ] Log volume within budget (<50 MB/day) - [ ] Zero PII exposure incidents - [ ] Dashboards actively used by team - [ ] Alert signal-to-noise ratio acceptable