Skip to content

ADR-048: Azure Key Vault for Cross-Layer Deployment Output Storage

Status: Accepted
Date: 2025-10-23
Decision Makers: Infrastructure Team, Architect
Related ADRs: - ADR-032: Remove Azure Key Vault from MVP (superseded for infrastructure use case) - ADR-041: 4-Layer Deployment Architecture - ADR-047: Layer-Specific RBAC Architecture


Context

Problem Statement

Our 4-layer deployment architecture (ADR-041) requires each layer to retrieve outputs from previous layers to wire dependencies. Currently, we query Azure deployment history using az deployment group show, which has several reliability and performance issues:

Current Approach Issues:

  1. Deployment History Retention Risk
  2. Azure ARM maintains ~800 deployments per resource group
  3. Older deployments may be purged after limit reached
  4. No SLA guarantee on retention period
  5. Layer deployments fail if prerequisite deployment history purged

  6. Poor Performance

  7. Layer 4 makes 9 separate az deployment group show API calls
  8. Each call takes 1-2 seconds
  9. Total prerequisite query time: 10-18 seconds
  10. Adds significant overhead to every deployment

  11. Fragile Name Pattern Matching

  12. Relies on patterns like [?contains(name, 'layer1-foundation')]
  13. Breaks if deployment naming conventions change
  14. Requires additional sorting for multiple matches

  15. Parallel Deployment Conflicts

  16. Single resource group = single namespace
  17. Concurrent deployments can conflict
  18. Difficult to support feature branch deployments

Example from deploy-layer4-apps.sh (lines 117-179):

# Find Layer 1 deployment (API call 1)
LAYER1_DEPLOYMENT=$(az deployment group list \
    --resource-group "$RESOURCE_GROUP" \
    --query "[?contains(name, 'layer1')].name | [0]" -o tsv)

# Get each output individually (API calls 2-9)
MANAGED_IDENTITY_ID=$(az deployment group show \
    --resource-group "$RESOURCE_GROUP" \
    --name "$LAYER1_DEPLOYMENT" \
    --query "properties.outputs.managedIdentityId.value" -o tsv)

APP_INSIGHTS_CONN_STRING=$(az deployment group show ...)
AI_ENDPOINT=$(az deployment group show ...)
# ... 6 more similar calls

Why Not Keep Current Approach?

The deployment history approach was acceptable for initial development but has fundamental limitations: - Cannot guarantee long-term reliability (purge risk) - Performance degrades as we add more outputs - Does not scale to parallel development workflows - Industry has moved to explicit endpoint registries

Relationship to ADR-032

ADR-032 removed Key Vault because the application didn't need it for secrets management. This ADR reintroduces Key Vault for a different purpose: infrastructure deployment output storage. This is a standard pattern in cloud infrastructure (see Alternatives Considered).

Key Distinction: - ADR-032 Scope: Application runtime secret management (API keys, passwords) - ADR-048 Scope: Infrastructure deployment coordination (resource IDs, endpoints)

Industry Standard Patterns

All major cloud providers use dedicated services for cross-deployment output storage:

Platform Service Usage
AWS Systems Manager Parameter Store Store CloudFormation outputs
AWS Secrets Manager Cross-stack references
Azure Key Vault Store Bicep/ARM deployment outputs
Azure App Configuration Non-sensitive configuration
GCP Secret Manager Store Terraform outputs
Terraform State Backend Output references
Pulumi Stack References Cross-stack dependencies

Decision

Add Azure Key Vault to Layer 1 infrastructure specifically for storing deployment outputs, using a simple naming convention that does not require deployment ID tracking.

Implementation Pattern

Simplified Secret Naming Convention:

{resourceGroupPrefix}-{layerName}-{outputName}

Examples:
  ldfdevnew-layer1-managedIdentityId
  ldfdevnew-layer1-logAnalyticsWorkspaceId
  ldfdevnew-layer2-acrLoginServer
  ldfdevnew-layer2-containerAppsEnvId

No Deployment ID Required: - Layer scripts read secrets by predictable name only - Latest secret value is always returned by Key Vault (automatic versioning) - Tags are metadata only (for troubleshooting, not retrieval)

Each Layer Script Pattern:

# After deployment completes
# Push outputs to Key Vault with metadata tags
push_endpoint_to_keyvault "$RESOURCE_GROUP" "layer1" "managedIdentityId" "$VALUE"

# Downstream layers retrieve by name only
VALUE=$(get_endpoint_from_keyvault "$RESOURCE_GROUP" "layer1" "managedIdentityId")

# Tags are for metadata/troubleshooting only:
# - updatedAt: timestamp
# - deploymentId: for audit trail
# - layer: for filtering

What We're Building

Infrastructure Changes: 1. Add Key Vault to infrastructure/bicep/modules/security.bicep 2. Configure RBAC: Deployment identity gets "Key Vault Secrets Officer" 3. Create helper script: infrastructure/scripts/helpers/keyvault-config.sh

Helper Functions:

push_endpoint_to_keyvault()    # Store output after layer deployment
get_endpoint_from_keyvault()   # Retrieve output by name
get_all_layer_endpoints()      # Batch retrieve all outputs for layers
check_keyvault_access()        # Validation for troubleshooting
list_layer_endpoints()         # List all stored endpoints

NO Fallback Logic: - Since this is a new deployment approach (no existing users) - All scripts will use Key Vault from day 1 - Simpler implementation, less code to maintain

Consequences

Positive:

  1. Reliability
  2. Independent of deployment history retention policies
  3. Works even if deployment purged after 6+ months
  4. Explicit, durable storage of infrastructure state

  5. Performance

  6. 80-90% faster prerequisite queries (10-18s → 1-2s)
  7. Batch retrieval reduces API calls (9 calls → 1-2 calls)
  8. Scales better as we add more outputs

  9. Parallel Deployments

  10. Resource group isolation = namespace isolation
  11. Multiple feature branch deployments supported
  12. Secret naming: {rgPrefix}-{layer}-{output} prevents conflicts

  13. Better Observability

  14. Key Vault audit logs track who accessed what
  15. Secret versioning provides history
  16. Tags provide deployment correlation

  17. Industry Alignment

  18. Standard pattern across AWS, Azure, GCP
  19. Matches Terraform/Pulumi state management approaches
  20. Documented in Azure Well-Architected Framework

Negative:

  1. Additional Infrastructure
  2. Adds Key Vault to Layer 1 (minimal cost: <$0.01/month)
  3. Increases infrastructure complexity slightly
  4. One more service to monitor

  5. Permission Management

  6. Deployment identities need Key Vault RBAC roles
  7. One-time setup: ~30 minutes per environment
  8. Must document for new developers

  9. Operational Overhead

  10. Need to monitor Key Vault availability (99.9% SLA)
  11. Need to manage Key Vault network access
  12. Additional troubleshooting surface

Mitigations:

Risk Mitigation
Key Vault unavailable during deployment 99.9% SLA, monitor availability, rare failure mode
Permission issues block deployment Document setup, create validation script
Secret naming collisions Enforce naming convention via helper script
Stale endpoint values Auto-update on every deployment, timestamp tracking
Complexity for new developers Comprehensive documentation, helper script abstraction

Architecture Diagrams

Before (Current):

┌─────────────────────┐
│  Layer 2 Script     │
└─────────────────────┘
          │ az deployment group list
          │ az deployment group show (×7)
┌─────────────────────┐
│  ARM Deployment     │
│  History API        │  ⚠️  800 deployment limit
│  (10-18 seconds)    │  ⚠️  Fragile name matching
└─────────────────────┘

After (Proposed):

┌─────────────────────┐
│  Layer 1 Script     │
│  Deploys → Pushes   │────┐
└─────────────────────┘    │
                           │ push outputs
                    ┌─────────────────────┐
┌─────────────────┐ │   Azure Key Vault   │
│  Layer 2 Script │─┤   (Endpoint Store)  │
│  Retrieves      │ │   99.9% SLA         │
└─────────────────┘ │   (1-2 seconds)     │
                    └─────────────────────┘
                           │ retrieve by name
┌─────────────────┐        │
│  Layer 3 Script │────────┤
└─────────────────┘        │
┌─────────────────┐        │
│  Layer 4 Script │────────┘
└─────────────────┘

Alternatives Considered

Alternative 1: Status Quo (Deployment History Queries)

Rejected: Retention risk, poor performance, fragile name matching.

Alternative 2: Azure App Configuration

Pros: Designed for application configuration, better query capabilities
Cons: - Overkill for infrastructure outputs - Higher cost (~$1.20/month minimum) - Not designed for sensitive endpoints (resource IDs can be considered sensitive)

Verdict: Key Vault is more appropriate for infrastructure state.

Alternative 3: Git Repository Storage

Pros: Version controlled, easy to review
Cons: - Security risk (endpoints in version control) - Requires commit/push after every deployment (slower) - Merge conflicts in parallel development

Verdict: Unacceptable security and operational complexity.

Alternative 4: Azure Storage Table/Blob

Pros: Simple key-value store, cheap
Cons: - No built-in audit logging - Less secure than Key Vault (no RBAC granularity) - Need to build versioning ourselves

Verdict: Reinventing Key Vault features.

Alternative 5: Bicep Deployment Stacks (Preview)

Pros: Azure-native cross-deployment references
Cons: - Preview feature, not GA yet (as of Oct 2023) - Limited documentation - Unknown SLA and retention policies

Verdict: Wait for GA, reassess in future ADR.

Implementation Plan

Phase 1: Infrastructure Setup (Week 1)

Tasks: 1. Add Key Vault to infrastructure/bicep/modules/security.bicep - Standard tier (sufficient for our needs) - RBAC authorization mode (modern approach) - Network integration with VNet service endpoints - Soft delete + purge protection enabled

  1. Configure RBAC permissions in Layer 1 Bicep

    resource kvRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
      scope: keyVault
      properties: {
        roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
          '00482a5a-887f-4fb3-b0f7-f7e9d0b3f9f8') // Key Vault Secrets Officer
        principalId: managedIdentity.properties.principalId
      }
    }
    

  2. Create helper script: infrastructure/scripts/helpers/keyvault-config.sh

  3. push_endpoint_to_keyvault()
  4. get_endpoint_from_keyvault()
  5. get_all_layer_endpoints()
  6. Consistent error handling
  7. Network/permission validation

  8. Document setup in infrastructure/README.md

Deliverables: - [ ] Key Vault deployed in Layer 1 - [ ] RBAC configured for Managed Identity - [ ] Helper script created and tested - [ ] Documentation updated

Phase 2: Layer Script Updates (Week 1-2)

Update deployment scripts: 1. Layer 1: Add push_endpoint_to_keyvault after Bicep deployment 2. Layer 2: Replace deployment history queries with Key Vault reads 3. Layer 3: Same pattern as Layer 2 4. Layer 4: Replace 9 API calls with batch retrieval

Testing: - Deploy all layers in dev environment - Verify outputs stored in Key Vault - Verify downstream layers retrieve correctly - Test redeployment scenarios

Deliverables: - [ ] All layer scripts updated - [ ] Integration testing complete - [ ] Performance benchmarks captured

Phase 3: Validation & Rollout (Week 2)

Tasks: 1. Deploy to test environment 2. Monitor Key Vault metrics (availability, latency) 3. Update CI/CD workflows if needed 4. Create troubleshooting runbook

Success Metrics (Track 60 Days): - Deployment success rate: ≥95% (baseline: ~95%) - Prerequisite query time: <3 seconds (baseline: 10-18s) - Key Vault availability: ≥99.9% - Deployment failures due to Key Vault: <1%

Deliverables: - [ ] Test environment validated - [ ] Monitoring configured - [ ] Runbook created - [ ] Team trained on new approach

Cost Analysis

Key Vault Costs (Standard Tier):

Base:                  $0 (no charge for vault itself)
Secret Operations:     $0.03 per 10,000 operations

Monthly Usage Estimate (Active Development):
  Write ops:  4 layers × 10 outputs × 20 deploys = 800 ops
  Read ops:   40 retrievals × 20 deploys = 800 ops
  Total:      1,600 operations
  Cost:       (1,600 / 10,000) × $0.03 = $0.0048/month

Annual Cost: ~$0.06/year (6 cents)

Production Environment:

Monthly deploys:  5
Operations:       ~500
Cost:            <$0.01/month

Comparison: - Removed per ADR-032: $5-10/month (app secrets) - Adding back: <$0.01/month (deployment outputs) - Net Savings: Still ~$5-10/month vs original

ROI Calculation:

Additional Cost:        $0.06/year
Time Saved:            80% query time reduction
                       10 seconds × 240 deploys/year = 40 minutes
Developer Time Value:   40 min × $60/hr = $40/year
ROI:                   66,000%

Security Considerations

Access Control: - Deployment scripts: Key Vault Secrets Officer (read/write) - Container Apps: Key Vault Secrets User (read-only) if needed - Developers: On-demand access only (troubleshooting)

Network Security: - VNet service endpoint integration (already configured in Layer 1) - Firewall rules allow deployment runner IPs - No public internet access (Azure backbone only)

Audit Trail: - All secret access logged in Azure Monitor - Retention: 90 days default, configurable - Alerts on anomalous access patterns

Data Sensitivity: - Endpoint values: Resource IDs, FQDNs (medium sensitivity) - No PII, passwords, or API keys - Appropriate for Key Vault storage

Monitoring & Operations

Key Metrics to Track: - Key Vault availability (target: ≥99.9%) - Secret operation latency (target: <500ms) - Failed secret reads/writes (alert threshold: >5/hour) - Secret age (alert if >90 days without update)

Dashboards: - Azure Monitor workbook for Key Vault metrics - Deployment performance trends (before/after comparison)

Troubleshooting Tools:

# List all endpoints for a layer
./infrastructure/scripts/helpers/keyvault-config.sh list layer1

# Check Key Vault access
./infrastructure/scripts/helpers/keyvault-config.sh check-access

# View secret metadata
az keyvault secret show --vault-name X --name Y --query attributes

Documentation Updates

Files to Update: - infrastructure/README.md - Add Key Vault endpoint storage section - infrastructure/scripts/README.md - Document helper script usage - AGENTS.md - Update deployment architecture notes - docs/architecture/deployment-layers.md - Update diagrams

New Documentation: - docs/infrastructure/keyvault-endpoint-storage.md - Detailed guide - docs/troubleshooting/keyvault-deployment-issues.md - Runbook

References

Azure Documentation: - Azure Key Vault Best Practices - Azure Well-Architected Framework - Key Vault - Bicep Deployment Outputs

Industry Patterns: - Terraform Remote State Backend - AWS Systems Manager Parameter Store - Pulumi Stack References

Related ADRs: - ADR-032: Removed Key Vault for app secrets (still valid) - ADR-041: 4-layer architecture (dependency management challenge) - ADR-047: Layer-specific RBAC (Key Vault permissions)

Decision Outcome

Status: Accepted

Rationale: - Industry standard pattern for deployment output storage - Solves real reliability and performance issues - Minimal cost impact (<$0.01/month) - Simple implementation (8 hours effort) - Does not conflict with ADR-032 (different use case)

Next Steps: 1. Create GitHub issue for implementation (link to this ADR) 2. Implement Phase 1: Infrastructure setup 3. Implement Phase 2: Script updates 4. Implement Phase 3: Validation and rollout 5. Review after 60 days: Assess success metrics

Review Date: 2025-12-22 (60 days after implementation)


Supersedes: None (ADR-032 remains valid for application secrets)
Superseded By: None
Last Updated: 2025-10-23