ADR-048: Azure Key Vault for Cross-Layer Deployment Output Storage
Status: Accepted
Date: 2025-10-23
Decision Makers: Infrastructure Team, Architect
Related ADRs:
- ADR-032: Remove Azure Key Vault from MVP (superseded for infrastructure use case)
- ADR-041: 4-Layer Deployment Architecture
- ADR-047: Layer-Specific RBAC Architecture
Context
Problem Statement
Our 4-layer deployment architecture (ADR-041) requires each layer to retrieve outputs from previous layers to wire dependencies. Currently, we query Azure deployment history using az deployment group show, which has several reliability and performance issues:
Current Approach Issues:
- Deployment History Retention Risk
- Azure ARM maintains ~800 deployments per resource group
- Older deployments may be purged after limit reached
- No SLA guarantee on retention period
-
Layer deployments fail if prerequisite deployment history purged
-
Poor Performance
- Layer 4 makes 9 separate
az deployment group showAPI calls - Each call takes 1-2 seconds
- Total prerequisite query time: 10-18 seconds
-
Adds significant overhead to every deployment
-
Fragile Name Pattern Matching
- Relies on patterns like
[?contains(name, 'layer1-foundation')] - Breaks if deployment naming conventions change
-
Requires additional sorting for multiple matches
-
Parallel Deployment Conflicts
- Single resource group = single namespace
- Concurrent deployments can conflict
- Difficult to support feature branch deployments
Example from deploy-layer4-apps.sh (lines 117-179):
# Find Layer 1 deployment (API call 1)
LAYER1_DEPLOYMENT=$(az deployment group list \
--resource-group "$RESOURCE_GROUP" \
--query "[?contains(name, 'layer1')].name | [0]" -o tsv)
# Get each output individually (API calls 2-9)
MANAGED_IDENTITY_ID=$(az deployment group show \
--resource-group "$RESOURCE_GROUP" \
--name "$LAYER1_DEPLOYMENT" \
--query "properties.outputs.managedIdentityId.value" -o tsv)
APP_INSIGHTS_CONN_STRING=$(az deployment group show ...)
AI_ENDPOINT=$(az deployment group show ...)
# ... 6 more similar calls
Why Not Keep Current Approach?
The deployment history approach was acceptable for initial development but has fundamental limitations: - Cannot guarantee long-term reliability (purge risk) - Performance degrades as we add more outputs - Does not scale to parallel development workflows - Industry has moved to explicit endpoint registries
Relationship to ADR-032
ADR-032 removed Key Vault because the application didn't need it for secrets management. This ADR reintroduces Key Vault for a different purpose: infrastructure deployment output storage. This is a standard pattern in cloud infrastructure (see Alternatives Considered).
Key Distinction: - ADR-032 Scope: Application runtime secret management (API keys, passwords) - ADR-048 Scope: Infrastructure deployment coordination (resource IDs, endpoints)
Industry Standard Patterns
All major cloud providers use dedicated services for cross-deployment output storage:
| Platform | Service | Usage |
|---|---|---|
| AWS | Systems Manager Parameter Store | Store CloudFormation outputs |
| AWS | Secrets Manager | Cross-stack references |
| Azure | Key Vault | Store Bicep/ARM deployment outputs |
| Azure | App Configuration | Non-sensitive configuration |
| GCP | Secret Manager | Store Terraform outputs |
| Terraform | State Backend | Output references |
| Pulumi | Stack References | Cross-stack dependencies |
Decision
Add Azure Key Vault to Layer 1 infrastructure specifically for storing deployment outputs, using a simple naming convention that does not require deployment ID tracking.
Implementation Pattern
Simplified Secret Naming Convention:
{resourceGroupPrefix}-{layerName}-{outputName}
Examples:
ldfdevnew-layer1-managedIdentityId
ldfdevnew-layer1-logAnalyticsWorkspaceId
ldfdevnew-layer2-acrLoginServer
ldfdevnew-layer2-containerAppsEnvId
No Deployment ID Required: - Layer scripts read secrets by predictable name only - Latest secret value is always returned by Key Vault (automatic versioning) - Tags are metadata only (for troubleshooting, not retrieval)
Each Layer Script Pattern:
# After deployment completes
# Push outputs to Key Vault with metadata tags
push_endpoint_to_keyvault "$RESOURCE_GROUP" "layer1" "managedIdentityId" "$VALUE"
# Downstream layers retrieve by name only
VALUE=$(get_endpoint_from_keyvault "$RESOURCE_GROUP" "layer1" "managedIdentityId")
# Tags are for metadata/troubleshooting only:
# - updatedAt: timestamp
# - deploymentId: for audit trail
# - layer: for filtering
What We're Building
Infrastructure Changes:
1. Add Key Vault to infrastructure/bicep/modules/security.bicep
2. Configure RBAC: Deployment identity gets "Key Vault Secrets Officer"
3. Create helper script: infrastructure/scripts/helpers/keyvault-config.sh
Helper Functions:
push_endpoint_to_keyvault() # Store output after layer deployment
get_endpoint_from_keyvault() # Retrieve output by name
get_all_layer_endpoints() # Batch retrieve all outputs for layers
check_keyvault_access() # Validation for troubleshooting
list_layer_endpoints() # List all stored endpoints
NO Fallback Logic: - Since this is a new deployment approach (no existing users) - All scripts will use Key Vault from day 1 - Simpler implementation, less code to maintain
Consequences
Positive:
- Reliability
- Independent of deployment history retention policies
- Works even if deployment purged after 6+ months
-
Explicit, durable storage of infrastructure state
-
Performance
- 80-90% faster prerequisite queries (10-18s → 1-2s)
- Batch retrieval reduces API calls (9 calls → 1-2 calls)
-
Scales better as we add more outputs
-
Parallel Deployments
- Resource group isolation = namespace isolation
- Multiple feature branch deployments supported
-
Secret naming:
{rgPrefix}-{layer}-{output}prevents conflicts -
Better Observability
- Key Vault audit logs track who accessed what
- Secret versioning provides history
-
Tags provide deployment correlation
-
Industry Alignment
- Standard pattern across AWS, Azure, GCP
- Matches Terraform/Pulumi state management approaches
- Documented in Azure Well-Architected Framework
Negative:
- Additional Infrastructure
- Adds Key Vault to Layer 1 (minimal cost: <$0.01/month)
- Increases infrastructure complexity slightly
-
One more service to monitor
-
Permission Management
- Deployment identities need Key Vault RBAC roles
- One-time setup: ~30 minutes per environment
-
Must document for new developers
-
Operational Overhead
- Need to monitor Key Vault availability (99.9% SLA)
- Need to manage Key Vault network access
- Additional troubleshooting surface
Mitigations:
| Risk | Mitigation |
|---|---|
| Key Vault unavailable during deployment | 99.9% SLA, monitor availability, rare failure mode |
| Permission issues block deployment | Document setup, create validation script |
| Secret naming collisions | Enforce naming convention via helper script |
| Stale endpoint values | Auto-update on every deployment, timestamp tracking |
| Complexity for new developers | Comprehensive documentation, helper script abstraction |
Architecture Diagrams
Before (Current):
┌─────────────────────┐
│ Layer 2 Script │
└─────────────────────┘
│
│ az deployment group list
│ az deployment group show (×7)
▼
┌─────────────────────┐
│ ARM Deployment │
│ History API │ ⚠️ 800 deployment limit
│ (10-18 seconds) │ ⚠️ Fragile name matching
└─────────────────────┘
After (Proposed):
┌─────────────────────┐
│ Layer 1 Script │
│ Deploys → Pushes │────┐
└─────────────────────┘ │
│ push outputs
▼
┌─────────────────────┐
┌─────────────────┐ │ Azure Key Vault │
│ Layer 2 Script │─┤ (Endpoint Store) │
│ Retrieves │ │ 99.9% SLA │
└─────────────────┘ │ (1-2 seconds) │
└─────────────────────┘
▲
│ retrieve by name
┌─────────────────┐ │
│ Layer 3 Script │────────┤
└─────────────────┘ │
│
┌─────────────────┐ │
│ Layer 4 Script │────────┘
└─────────────────┘
Alternatives Considered
Alternative 1: Status Quo (Deployment History Queries)
Rejected: Retention risk, poor performance, fragile name matching.
Alternative 2: Azure App Configuration
Pros: Designed for application configuration, better query capabilities
Cons:
- Overkill for infrastructure outputs
- Higher cost (~$1.20/month minimum)
- Not designed for sensitive endpoints (resource IDs can be considered sensitive)
Verdict: Key Vault is more appropriate for infrastructure state.
Alternative 3: Git Repository Storage
Pros: Version controlled, easy to review
Cons:
- Security risk (endpoints in version control)
- Requires commit/push after every deployment (slower)
- Merge conflicts in parallel development
Verdict: Unacceptable security and operational complexity.
Alternative 4: Azure Storage Table/Blob
Pros: Simple key-value store, cheap
Cons:
- No built-in audit logging
- Less secure than Key Vault (no RBAC granularity)
- Need to build versioning ourselves
Verdict: Reinventing Key Vault features.
Alternative 5: Bicep Deployment Stacks (Preview)
Pros: Azure-native cross-deployment references
Cons:
- Preview feature, not GA yet (as of Oct 2023)
- Limited documentation
- Unknown SLA and retention policies
Verdict: Wait for GA, reassess in future ADR.
Implementation Plan
Phase 1: Infrastructure Setup (Week 1)
Tasks:
1. Add Key Vault to infrastructure/bicep/modules/security.bicep
- Standard tier (sufficient for our needs)
- RBAC authorization mode (modern approach)
- Network integration with VNet service endpoints
- Soft delete + purge protection enabled
-
Configure RBAC permissions in Layer 1 Bicep
resource kvRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = { scope: keyVault properties: { roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '00482a5a-887f-4fb3-b0f7-f7e9d0b3f9f8') // Key Vault Secrets Officer principalId: managedIdentity.properties.principalId } } -
Create helper script:
infrastructure/scripts/helpers/keyvault-config.sh push_endpoint_to_keyvault()get_endpoint_from_keyvault()get_all_layer_endpoints()- Consistent error handling
-
Network/permission validation
-
Document setup in
infrastructure/README.md
Deliverables: - [ ] Key Vault deployed in Layer 1 - [ ] RBAC configured for Managed Identity - [ ] Helper script created and tested - [ ] Documentation updated
Phase 2: Layer Script Updates (Week 1-2)
Update deployment scripts:
1. Layer 1: Add push_endpoint_to_keyvault after Bicep deployment
2. Layer 2: Replace deployment history queries with Key Vault reads
3. Layer 3: Same pattern as Layer 2
4. Layer 4: Replace 9 API calls with batch retrieval
Testing: - Deploy all layers in dev environment - Verify outputs stored in Key Vault - Verify downstream layers retrieve correctly - Test redeployment scenarios
Deliverables: - [ ] All layer scripts updated - [ ] Integration testing complete - [ ] Performance benchmarks captured
Phase 3: Validation & Rollout (Week 2)
Tasks: 1. Deploy to test environment 2. Monitor Key Vault metrics (availability, latency) 3. Update CI/CD workflows if needed 4. Create troubleshooting runbook
Success Metrics (Track 60 Days): - Deployment success rate: ≥95% (baseline: ~95%) - Prerequisite query time: <3 seconds (baseline: 10-18s) - Key Vault availability: ≥99.9% - Deployment failures due to Key Vault: <1%
Deliverables: - [ ] Test environment validated - [ ] Monitoring configured - [ ] Runbook created - [ ] Team trained on new approach
Cost Analysis
Key Vault Costs (Standard Tier):
Base: $0 (no charge for vault itself)
Secret Operations: $0.03 per 10,000 operations
Monthly Usage Estimate (Active Development):
Write ops: 4 layers × 10 outputs × 20 deploys = 800 ops
Read ops: 40 retrievals × 20 deploys = 800 ops
Total: 1,600 operations
Cost: (1,600 / 10,000) × $0.03 = $0.0048/month
Annual Cost: ~$0.06/year (6 cents)
Production Environment:
Comparison: - Removed per ADR-032: $5-10/month (app secrets) - Adding back: <$0.01/month (deployment outputs) - Net Savings: Still ~$5-10/month vs original
ROI Calculation:
Additional Cost: $0.06/year
Time Saved: 80% query time reduction
10 seconds × 240 deploys/year = 40 minutes
Developer Time Value: 40 min × $60/hr = $40/year
ROI: 66,000%
Security Considerations
Access Control:
- Deployment scripts: Key Vault Secrets Officer (read/write)
- Container Apps: Key Vault Secrets User (read-only) if needed
- Developers: On-demand access only (troubleshooting)
Network Security: - VNet service endpoint integration (already configured in Layer 1) - Firewall rules allow deployment runner IPs - No public internet access (Azure backbone only)
Audit Trail: - All secret access logged in Azure Monitor - Retention: 90 days default, configurable - Alerts on anomalous access patterns
Data Sensitivity: - Endpoint values: Resource IDs, FQDNs (medium sensitivity) - No PII, passwords, or API keys - Appropriate for Key Vault storage
Monitoring & Operations
Key Metrics to Track: - Key Vault availability (target: ≥99.9%) - Secret operation latency (target: <500ms) - Failed secret reads/writes (alert threshold: >5/hour) - Secret age (alert if >90 days without update)
Dashboards: - Azure Monitor workbook for Key Vault metrics - Deployment performance trends (before/after comparison)
Troubleshooting Tools:
# List all endpoints for a layer
./infrastructure/scripts/helpers/keyvault-config.sh list layer1
# Check Key Vault access
./infrastructure/scripts/helpers/keyvault-config.sh check-access
# View secret metadata
az keyvault secret show --vault-name X --name Y --query attributes
Documentation Updates
Files to Update:
- infrastructure/README.md - Add Key Vault endpoint storage section
- infrastructure/scripts/README.md - Document helper script usage
- AGENTS.md - Update deployment architecture notes
- docs/architecture/deployment-layers.md - Update diagrams
New Documentation:
- docs/infrastructure/keyvault-endpoint-storage.md - Detailed guide
- docs/troubleshooting/keyvault-deployment-issues.md - Runbook
References
Azure Documentation: - Azure Key Vault Best Practices - Azure Well-Architected Framework - Key Vault - Bicep Deployment Outputs
Industry Patterns: - Terraform Remote State Backend - AWS Systems Manager Parameter Store - Pulumi Stack References
Related ADRs: - ADR-032: Removed Key Vault for app secrets (still valid) - ADR-041: 4-layer architecture (dependency management challenge) - ADR-047: Layer-specific RBAC (Key Vault permissions)
Decision Outcome
Status: Accepted
Rationale: - Industry standard pattern for deployment output storage - Solves real reliability and performance issues - Minimal cost impact (<$0.01/month) - Simple implementation (8 hours effort) - Does not conflict with ADR-032 (different use case)
Next Steps: 1. Create GitHub issue for implementation (link to this ADR) 2. Implement Phase 1: Infrastructure setup 3. Implement Phase 2: Script updates 4. Implement Phase 3: Validation and rollout 5. Review after 60 days: Assess success metrics
Review Date: 2025-12-22 (60 days after implementation)
Supersedes: None (ADR-032 remains valid for application secrets)
Superseded By: None
Last Updated: 2025-10-23