ADR-048: Azure Key Vault for Cross-Layer Deployment Output Storage

Status: Accepted
Date: 2025-10-23
Decision Makers: Infrastructure Team, Architect
Related ADRs: - ADR-032: Remove Azure Key Vault from MVP (superseded for infrastructure use case) - ADR-041: 4-Layer Deployment Architecture - ADR-047: Layer-Specific RBAC Architecture

Context

Problem Statement

Our 4-layer deployment architecture (ADR-041) requires each layer to retrieve outputs from previous layers to wire dependencies. Currently, we query Azure deployment history using az deployment group show, which has several reliability and performance issues:

Current Approach Issues:

Deployment History Retention Risk
Azure ARM maintains ~800 deployments per resource group
Older deployments may be purged after limit reached
No SLA guarantee on retention period
Layer deployments fail if prerequisite deployment history purged
Poor Performance
Layer 4 makes 9 separate az deployment group show API calls
Each call takes 1-2 seconds
Total prerequisite query time: 10-18 seconds
Adds significant overhead to every deployment
Fragile Name Pattern Matching
Relies on patterns like [?contains(name, 'layer1-foundation')]
Breaks if deployment naming conventions change
Requires additional sorting for multiple matches
Parallel Deployment Conflicts
Single resource group = single namespace
Concurrent deployments can conflict
Difficult to support feature branch deployments

Example from deploy-layer4-apps.sh (lines 117-179):

# Find Layer 1 deployment (API call 1)
LAYER1_DEPLOYMENT=$(az deployment group list \
    --resource-group "$RESOURCE_GROUP" \
    --query "[?contains(name, 'layer1')].name | [0]" -o tsv)

# Get each output individually (API calls 2-9)
MANAGED_IDENTITY_ID=$(az deployment group show \
    --resource-group "$RESOURCE_GROUP" \
    --name "$LAYER1_DEPLOYMENT" \
    --query "properties.outputs.managedIdentityId.value" -o tsv)

APP_INSIGHTS_CONN_STRING=$(az deployment group show ...)
AI_ENDPOINT=$(az deployment group show ...)
# ... 6 more similar calls

Why Not Keep Current Approach?

The deployment history approach was acceptable for initial development but has fundamental limitations: - Cannot guarantee long-term reliability (purge risk) - Performance degrades as we add more outputs - Does not scale to parallel development workflows - Industry has moved to explicit endpoint registries

Relationship to ADR-032

ADR-032 removed Key Vault because the application didn't need it for secrets management. This ADR reintroduces Key Vault for a different purpose: infrastructure deployment output storage. This is a standard pattern in cloud infrastructure (see Alternatives Considered).

Key Distinction: - ADR-032 Scope: Application runtime secret management (API keys, passwords) - ADR-048 Scope: Infrastructure deployment coordination (resource IDs, endpoints)

Industry Standard Patterns

All major cloud providers use dedicated services for cross-deployment output storage:

Platform	Service	Usage
AWS	Systems Manager Parameter Store	Store CloudFormation outputs
AWS	Secrets Manager	Cross-stack references
Azure	Key Vault	Store Bicep/ARM deployment outputs
Azure	App Configuration	Non-sensitive configuration
GCP	Secret Manager	Store Terraform outputs
Terraform	State Backend	Output references
Pulumi	Stack References	Cross-stack dependencies

Decision

Add Azure Key Vault to Layer 1 infrastructure specifically for storing deployment outputs, using a simple naming convention that does not require deployment ID tracking.

Implementation Pattern

Simplified Secret Naming Convention:

{resourceGroupPrefix}-{layerName}-{outputName}

Examples:
  ldfdevnew-layer1-managedIdentityId
  ldfdevnew-layer1-logAnalyticsWorkspaceId
  ldfdevnew-layer2-acrLoginServer
  ldfdevnew-layer2-containerAppsEnvId

No Deployment ID Required: - Layer scripts read secrets by predictable name only - Latest secret value is always returned by Key Vault (automatic versioning) - Tags are metadata only (for troubleshooting, not retrieval)

Each Layer Script Pattern:

# After deployment completes
# Push outputs to Key Vault with metadata tags
push_endpoint_to_keyvault "$RESOURCE_GROUP" "layer1" "managedIdentityId" "$VALUE"

# Downstream layers retrieve by name only
VALUE=$(get_endpoint_from_keyvault "$RESOURCE_GROUP" "layer1" "managedIdentityId")

# Tags are for metadata/troubleshooting only:
# - updatedAt: timestamp
# - deploymentId: for audit trail
# - layer: for filtering

What We're Building

Infrastructure Changes: 1. Add Key Vault to infrastructure/bicep/modules/security.bicep 2. Configure RBAC: Deployment identity gets "Key Vault Secrets Officer" 3. Create helper script: infrastructure/scripts/helpers/keyvault-config.sh

Helper Functions:

push_endpoint_to_keyvault()    # Store output after layer deployment
get_endpoint_from_keyvault()   # Retrieve output by name
get_all_layer_endpoints()      # Batch retrieve all outputs for layers
check_keyvault_access()        # Validation for troubleshooting
list_layer_endpoints()         # List all stored endpoints

NO Fallback Logic: - Since this is a new deployment approach (no existing users) - All scripts will use Key Vault from day 1 - Simpler implementation, less code to maintain

Consequences

Positive:

Reliability
Independent of deployment history retention policies
Works even if deployment purged after 6+ months
Explicit, durable storage of infrastructure state
Performance
80-90% faster prerequisite queries (10-18s → 1-2s)
Batch retrieval reduces API calls (9 calls → 1-2 calls)
Scales better as we add more outputs
Parallel Deployments
Resource group isolation = namespace isolation
Multiple feature branch deployments supported
Secret naming: {rgPrefix}-{layer}-{output} prevents conflicts
Better Observability
Key Vault audit logs track who accessed what
Secret versioning provides history
Tags provide deployment correlation
Industry Alignment
Standard pattern across AWS, Azure, GCP
Matches Terraform/Pulumi state management approaches
Documented in Azure Well-Architected Framework

Negative:

Additional Infrastructure
Adds Key Vault to Layer 1 (minimal cost: <$0.01/month)
Increases infrastructure complexity slightly
One more service to monitor
Permission Management
Deployment identities need Key Vault RBAC roles
One-time setup: ~30 minutes per environment
Must document for new developers
Operational Overhead
Need to monitor Key Vault availability (99.9% SLA)
Need to manage Key Vault network access
Additional troubleshooting surface

Mitigations:

Risk	Mitigation
Key Vault unavailable during deployment	99.9% SLA, monitor availability, rare failure mode
Permission issues block deployment	Document setup, create validation script
Secret naming collisions	Enforce naming convention via helper script
Stale endpoint values	Auto-update on every deployment, timestamp tracking
Complexity for new developers	Comprehensive documentation, helper script abstraction

Architecture Diagrams

Before (Current):

┌─────────────────────┐
│  Layer 2 Script     │
└─────────────────────┘
          │
          │ az deployment group list
          │ az deployment group show (×7)
          ▼
┌─────────────────────┐
│  ARM Deployment     │
│  History API        │  ⚠️  800 deployment limit
│  (10-18 seconds)    │  ⚠️  Fragile name matching
└─────────────────────┘

After (Proposed):

┌─────────────────────┐
│  Layer 1 Script     │
│  Deploys → Pushes   │────┐
└─────────────────────┘    │
                           │ push outputs
                           ▼
                    ┌─────────────────────┐
┌─────────────────┐ │   Azure Key Vault   │
│  Layer 2 Script │─┤   (Endpoint Store)  │
│  Retrieves      │ │   99.9% SLA         │
└─────────────────┘ │   (1-2 seconds)     │
                    └─────────────────────┘
                           ▲
                           │ retrieve by name
┌─────────────────┐        │
│  Layer 3 Script │────────┤
└─────────────────┘        │
                           │
┌─────────────────┐        │
│  Layer 4 Script │────────┘
└─────────────────┘

Alternatives Considered

Alternative 1: Status Quo (Deployment History Queries)

Rejected: Retention risk, poor performance, fragile name matching.

Alternative 2: Azure App Configuration

Pros: Designed for application configuration, better query capabilities
Cons: - Overkill for infrastructure outputs - Higher cost (~$1.20/month minimum) - Not designed for sensitive endpoints (resource IDs can be considered sensitive)

Verdict: Key Vault is more appropriate for infrastructure state.

Alternative 3: Git Repository Storage

Pros: Version controlled, easy to review
Cons: - Security risk (endpoints in version control) - Requires commit/push after every deployment (slower) - Merge conflicts in parallel development

Verdict: Unacceptable security and operational complexity.

Alternative 4: Azure Storage Table/Blob

Pros: Simple key-value store, cheap
Cons: - No built-in audit logging - Less secure than Key Vault (no RBAC granularity) - Need to build versioning ourselves

Verdict: Reinventing Key Vault features.

Alternative 5: Bicep Deployment Stacks (Preview)

Pros: Azure-native cross-deployment references
Cons: - Preview feature, not GA yet (as of Oct 2023) - Limited documentation - Unknown SLA and retention policies

Verdict: Wait for GA, reassess in future ADR.

Implementation Plan

Phase 1: Infrastructure Setup (Week 1)

Tasks: 1. Add Key Vault to infrastructure/bicep/modules/security.bicep - Standard tier (sufficient for our needs) - RBAC authorization mode (modern approach) - Network integration with VNet service endpoints - Soft delete + purge protection enabled

Configure RBAC permissions in Layer 1 Bicep

resource kvRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: keyVault
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '00482a5a-887f-4fb3-b0f7-f7e9d0b3f9f8') // Key Vault Secrets Officer
    principalId: managedIdentity.properties.principalId
  }
}

Create helper script: infrastructure/scripts/helpers/keyvault-config.sh
push_endpoint_to_keyvault()
get_endpoint_from_keyvault()
get_all_layer_endpoints()
Consistent error handling
Network/permission validation
Document setup in infrastructure/README.md

Deliverables: - [ ] Key Vault deployed in Layer 1 - [ ] RBAC configured for Managed Identity - [ ] Helper script created and tested - [ ] Documentation updated

Phase 2: Layer Script Updates (Week 1-2)

Update deployment scripts: 1. Layer 1: Add push_endpoint_to_keyvault after Bicep deployment 2. Layer 2: Replace deployment history queries with Key Vault reads 3. Layer 3: Same pattern as Layer 2 4. Layer 4: Replace 9 API calls with batch retrieval

Testing: - Deploy all layers in dev environment - Verify outputs stored in Key Vault - Verify downstream layers retrieve correctly - Test redeployment scenarios

Deliverables: - [ ] All layer scripts updated - [ ] Integration testing complete - [ ] Performance benchmarks captured

Phase 3: Validation & Rollout (Week 2)

Tasks: 1. Deploy to test environment 2. Monitor Key Vault metrics (availability, latency) 3. Update CI/CD workflows if needed 4. Create troubleshooting runbook

Success Metrics (Track 60 Days): - Deployment success rate: ≥95% (baseline: ~95%) - Prerequisite query time: <3 seconds (baseline: 10-18s) - Key Vault availability: ≥99.9% - Deployment failures due to Key Vault: <1%

Deliverables: - [ ] Test environment validated - [ ] Monitoring configured - [ ] Runbook created - [ ] Team trained on new approach

Cost Analysis

Key Vault Costs (Standard Tier):

Base:                  $0 (no charge for vault itself)
Secret Operations:     $0.03 per 10,000 operations

Monthly Usage Estimate (Active Development):
  Write ops:  4 layers × 10 outputs × 20 deploys = 800 ops
  Read ops:   40 retrievals × 20 deploys = 800 ops
  Total:      1,600 operations
  Cost:       (1,600 / 10,000) × $0.03 = $0.0048/month

Annual Cost: ~$0.06/year (6 cents)

Production Environment:

Monthly deploys:  5
Operations:       ~500
Cost:            <$0.01/month

Comparison: - Removed per ADR-032: $5-10/month (app secrets) - Adding back: <$0.01/month (deployment outputs) - Net Savings: Still ~$5-10/month vs original

ROI Calculation:

Additional Cost:        $0.06/year
Time Saved:            80% query time reduction
                       10 seconds × 240 deploys/year = 40 minutes
Developer Time Value:   40 min × $60/hr = $40/year
ROI:                   66,000%

Security Considerations

Access Control: - Deployment scripts: Key Vault Secrets Officer (read/write) - Container Apps: Key Vault Secrets User (read-only) if needed - Developers: On-demand access only (troubleshooting)

Network Security: - VNet service endpoint integration (already configured in Layer 1) - Firewall rules allow deployment runner IPs - No public internet access (Azure backbone only)

Audit Trail: - All secret access logged in Azure Monitor - Retention: 90 days default, configurable - Alerts on anomalous access patterns

Data Sensitivity: - Endpoint values: Resource IDs, FQDNs (medium sensitivity) - No PII, passwords, or API keys - Appropriate for Key Vault storage

Monitoring & Operations

Key Metrics to Track: - Key Vault availability (target: ≥99.9%) - Secret operation latency (target: <500ms) - Failed secret reads/writes (alert threshold: >5/hour) - Secret age (alert if >90 days without update)

Dashboards: - Azure Monitor workbook for Key Vault metrics - Deployment performance trends (before/after comparison)

Troubleshooting Tools:

# List all endpoints for a layer
./infrastructure/scripts/helpers/keyvault-config.sh list layer1

# Check Key Vault access
./infrastructure/scripts/helpers/keyvault-config.sh check-access

# View secret metadata
az keyvault secret show --vault-name X --name Y --query attributes

Documentation Updates

Files to Update: - infrastructure/README.md - Add Key Vault endpoint storage section - infrastructure/scripts/README.md - Document helper script usage - AGENTS.md - Update deployment architecture notes - docs/architecture/deployment-layers.md - Update diagrams

New Documentation: - docs/infrastructure/keyvault-endpoint-storage.md - Detailed guide - docs/troubleshooting/keyvault-deployment-issues.md - Runbook

References

Azure Documentation: - Azure Key Vault Best Practices - Azure Well-Architected Framework - Key Vault - Bicep Deployment Outputs

Industry Patterns: - Terraform Remote State Backend - AWS Systems Manager Parameter Store - Pulumi Stack References

Related ADRs: - ADR-032: Removed Key Vault for app secrets (still valid) - ADR-041: 4-layer architecture (dependency management challenge) - ADR-047: Layer-specific RBAC (Key Vault permissions)

Decision Outcome

Status: Accepted

Rationale: - Industry standard pattern for deployment output storage - Solves real reliability and performance issues - Minimal cost impact (<$0.01/month) - Simple implementation (8 hours effort) - Does not conflict with ADR-032 (different use case)

Next Steps: 1. Create GitHub issue for implementation (link to this ADR) 2. Implement Phase 1: Infrastructure setup 3. Implement Phase 2: Script updates 4. Implement Phase 3: Validation and rollout 5. Review after 60 days: Assess success metrics

Review Date: 2025-12-22 (60 days after implementation)

Supersedes: None (ADR-032 remains valid for application secrets)
Superseded By: None
Last Updated: 2025-10-23