Skip to content

ADR-047: Layer-Specific RBAC Architecture

Status: Accepted
Date: 2024-10-21
Authors: Infrastructure Team
Deciders: Architecture Team, Security Team

Context

The original RBAC implementation used a monolithic rbac.bicep module that mixed concerns from multiple deployment layers (Layer 1 foundation permissions and Layer 4 container app permissions). This approach caused critical timing issues:

Problems Identified

  1. Chicken-and-Egg Problem: Container Apps were created with system-assigned managed identities, then RBAC permissions were assigned post-creation. Container Apps tried to pull images immediately upon creation, but ACR Pull permissions hadn't propagated yet (Azure RBAC can take 5-10 minutes).

  2. Mixed Concerns: A single RBAC module handled:

  3. Layer 1: Managed Identity → AI Services (Cognitive Services User)
  4. Layer 1: Managed Identity → Storage Account (Storage Blob Data Contributor)
  5. Layer 4: Container Apps → ACR (AcrPull)

  6. Deployment Failures: Container Apps failed with "Validation of container app creation/update timed out" because they couldn't pull images from ACR during provisioning.

  7. Multiple Role Assignments: Each container app (UI, API, 3 MCP servers) required a separate role assignment with system-assigned identities (5 role assignments total).

Decision

We will restructure RBAC into layer-specific modules that align with the 4-layer deployment architecture and use user-assigned managed identity for container apps.

New Architecture

infrastructure/bicep/modules/
├── rbac-layer1-foundation.bicep      # AI Services, Storage permissions
├── rbac-layer4-container-apps.bicep  # Container Apps identity + ACR permissions
└── [rbac.bicep REMOVED]              # Monolithic module deleted

Key Changes

1. Layer 1 RBAC (Fully Integrated)

Current Implementation: RBAC is handled within Layer 1 modules
Status: ✅ FULLY INTEGRATED - All permissions assigned

What's Working: - ✅ Managed Identity created in modules/security.bicep - ✅ Managed Identity → AI Services (Cognitive Services User, OpenAI User) - assigned in modules/ai-services.bicep - ✅ Managed Identity → Storage Account (Storage Blob Data Contributor) - assigned in modules/security.bicep

Note: RBAC is distributed across modules rather than centralized. See GitHub issue for future centralization.

2. Layer 4 RBAC Module (rbac-layer4-container-apps.bicep)

Purpose: Container Apps ACR access
Deployed: BEFORE container apps in Layer 4
Strategy: 1. Creates a single user-assigned managed identity for ALL container apps 2. Assigns AcrPull role to this identity 3. Returns identity ID for container apps to reference

Key Innovation: Identity and permissions exist BEFORE apps try to pull images.

3. User-Assigned vs System-Assigned Identity

Decision: Use User-Assigned Managed Identity for Container Apps

Rationale (per Azure Well-Architected Framework): - ✅ Multiple resources (5 apps) need same permissions - ✅ Pre-authorization required before resource creation - ✅ Timing critical (RBAC must exist before image pull) - ✅ Reduced role assignments (1 instead of 5) - ✅ Simplified management (single identity lifecycle)

Microsoft Guidance:

"Use user-assigned managed identities when multiple resources need the same set of permissions, to reduce the number of role assignments needed." — Azure Well-Architected Framework - Security

Deployment Flow

Layer 4 Deployment Sequence:

1. rbac-layer4-container-apps module
   ├── Create user-assigned identity
   └── Assign AcrPull to ACR

2. Azure propagates RBAC (automatic)

3. uiContainerApp module (if enabled)
   └── Uses shared identity ID

4. apiContainerApp module (if enabled)
   └── Uses shared identity ID

5. mcpServerContainerApps module (if enabled)
   ├── verification (uses shared identity)
   ├── documents (uses shared identity)
   └── financial (uses shared identity)

Container App Module Pattern

All container app modules now support both identity types (backward compatible):

@description('User-assigned managed identity resource ID for ACR access')
param userAssignedIdentityId string = ''

var useUserAssignedIdentity = !empty(userAssignedIdentityId)

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  identity: useUserAssignedIdentity ? {
    type: 'UserAssigned'
    userAssignedIdentities: {
      '${userAssignedIdentityId}': {}
    }
  } : {
    type: 'SystemAssigned'  // Fallback
  }
  properties: {
    configuration: {
      registries: [
        {
          server: acrLoginServer
          identity: useUserAssignedIdentity ? userAssignedIdentityId : 'system'
        }
      ]
    }
  }
}

Azure Built-in Role IDs

Decision: Continue using hardcoded Azure built-in role GUIDs.

Rationale: - Azure built-in role IDs are universal constants maintained by Microsoft - Identical across ALL subscriptions, tenants, regions, and clouds - Documented at: https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles - Using role names requires dynamic lookup (unnecessary overhead)

Used Role IDs:

var acrPullRoleId = '7f951dda-4ed3-4680-a7ca-43fe172d538d'                    // AcrPull
var cognitiveServicesUserRoleId = 'a97b65f3-24c7-4388-baec-2e87135dc908'      // Cognitive Services User
var storageBlobDataContributorRoleId = 'ba92f5b4-2d11-453d-a403-e96b0029c9fe' // Storage Blob Data Contributor

Note: Custom roles (if created) would require dynamic lookup as their IDs are subscription-specific.

Consequences

Positive

  1. No More Timing Issues: Identity and permissions exist before container apps try to pull images
  2. Cleaner Architecture: Each layer manages its own RBAC concerns
  3. Reduced Complexity: 1 identity instead of 5, 1 role assignment instead of 5
  4. Faster Deployments: No waiting for system-assigned identity creation + RBAC propagation
  5. Better Separation of Concerns: Layer-specific modules align with deployment boundaries
  6. Production Ready: Follows Azure best practices and Well-Architected Framework
  7. Backward Compatible: Modules still support system-assigned identity as fallback
  8. Simplified Auditing: Single identity to audit for all container app ACR access

Neutral

  1. ⚠️ Additional Resource: Creates one user-assigned managed identity (minimal cost)
  2. ⚠️ Module Count: Two RBAC modules instead of one (but better organized)
  3. ⚠️ Learning Curve: Team needs to understand user-assigned vs system-assigned identity patterns

Negative

  1. Breaking Change: Existing deployments with system-assigned identities need migration
  2. Manual Cleanup: Old system-assigned identity role assignments may need manual removal

Migration Path

For existing deployments:

Option A: Clean Redeployment (Recommended)

# Delete existing container apps
az containerapp delete --name ldfdev-api-dev --resource-group ldfdev-rg --yes
az containerapp delete --name ldfdev-mcp-verification --resource-group ldfdev-rg --yes
az containerapp delete --name ldfdev-mcp-documents --resource-group ldfdev-rg --yes
az containerapp delete --name ldfdev-mcp-financial --resource-group ldfdev-rg --yes

# Redeploy with new architecture
./infrastructure/scripts/deploy-layer4.sh dev

Option B: In-Place Update

# Bicep will update identity configuration
./infrastructure/scripts/deploy-layer4.sh dev

# Manually clean up old role assignments
az role assignment list --scope <acr-id> -o table
az role assignment delete --ids <old-system-identity-assignment-ids>

Compliance

Azure Well-Architected Framework

Pillar Alignment
Security ✅ Follows managed identity best practices
Operational Excellence ✅ Simplified management, reduced role assignments
Performance Efficiency ✅ Faster deployments, no RBAC propagation delays
Cost Optimization ✅ Reduced role assignment overhead
Reliability ✅ Eliminates timing-related deployment failures

Microsoft Best Practices

Practice Status
Use managed identities over service principals ✅ Implemented
Use user-assigned for multiple resources ✅ Implemented
Minimize role assignments ✅ Implemented
Assign least privilege ✅ Implemented (AcrPull only)
Layer-specific security boundaries ✅ Implemented

Implementation

Files Modified

File Status Purpose
modules/rbac.bicep DELETED Monolithic module removed
modules/rbac-layer1-foundation.bicep NEW Foundation permissions
modules/rbac-layer4-container-apps.bicep NEW Container apps identity + ACR
layer4-apps.bicep ✅ Modified Uses new RBAC module
modules/container-app-ui.bicep ✅ Modified User-assigned identity support
modules/container-app-api.bicep ✅ Modified User-assigned identity support
modules/container-app-mcp-server.bicep ✅ Modified User-assigned identity support
modules/container-apps-mcp-servers.bicep ✅ Modified Passes identity to children

Deployment Changes

// Before: Monolithic RBAC after deployment
module rbac 'modules/rbac.bicep' = {
  params: {
    managedIdentityPrincipalId: managedIdentityId
    aiServicesId: aiServicesId
    storageAccountId: storageAccountId
    acrId: acrId
    mcpServerPrincipalIds: {
      verification: mcpVerification.outputs.principalId  // System-assigned
      documents: mcpDocuments.outputs.principalId
      financial: mcpFinancial.outputs.principalId
    }
  }
  dependsOn: [
    apiContainerApp      // ← Apps created FIRST
    mcpContainerApps     // ← Then RBAC assigned (too late!)
  ]
}

// After: Layer-specific RBAC before deployment
module containerAppsRbac 'modules/rbac-layer4-container-apps.bicep' = {
  params: {
    deploymentPrefix: deploymentPrefix
    location: location
    acrName: '${deploymentPrefix}acr'
  }
}  // ← Identity + RBAC created FIRST

module apiContainerApp 'modules/container-app-api.bicep' = {
  params: {
    userAssignedIdentityId: containerAppsRbac.outputs.identityId  // ← Use pre-configured identity
  }
  dependsOn: [
    containerAppsRbac  // ← Wait for RBAC to be ready
  ]
}

Alternatives Considered

Alternative 1: Keep Monolithic RBAC, Use ACR Admin Credentials

Approach: Use ACR admin username/password instead of managed identity.

Rejected Because: - ❌ Less secure (credentials in Key Vault) - ❌ Requires credential rotation - ❌ Not aligned with Azure best practices - ❌ Doesn't address timing issue fundamentally

Alternative 2: Add Retry Logic and Delays

Approach: Add sleep/retry logic in deployment scripts to wait for RBAC propagation.

Rejected Because: - ❌ Unreliable (RBAC propagation time varies) - ❌ Slower deployments (waiting 5-10 minutes unnecessarily) - ❌ Doesn't fix root cause - ❌ Creates fragile deployment process

Alternative 3: Pre-create All System-Assigned Identities

Approach: Create container apps with minimal configuration first, assign RBAC, then update.

Rejected Because: - ❌ Two-phase deployment (complex) - ❌ Still requires RBAC propagation wait - ❌ More role assignments to manage (5 instead of 1) - ❌ System-assigned identities deleted if app deleted

References

Microsoft Documentation

  1. Azure Container Apps - Managed Identity
  2. Azure Built-in Roles
  3. Azure Well-Architected Framework - Security
  4. RBAC Best Practices
  5. User-Assigned Managed Identities
  • ADR-032: Key Vault Removal (simplified infrastructure)
  • ADR-043: ACR Task Builds and Deployment Validation
  • ADR-045: Intelligent Container Image Validation

Internal Documentation

  • /temp/RBAC-RESTRUCTURING.md - Detailed technical implementation
  • /temp/QUESTIONS-ANSWERED.md - FAQ on design decisions
  • /temp/DEPLOYMENT-STATUS.md - Current deployment status
  • /docs/architecture/security.md - Security architecture overview
  • /docs/deployment/rbac-setup.md - RBAC deployment guide

Decision Tracking

Decided: 2024-10-21
Implemented: 2024-10-21
Verified: Pending (next deployment)

Success Metrics

  • Container apps deploy without "validation timeout" errors
  • User-assigned identity created before container apps
  • ACR Pull role assigned before image pull attempts
  • All container apps successfully pull images on first try
  • Zero manual RBAC assignments needed post-deployment
  • Deployment time reduced by 5-10 minutes (no RBAC wait)

Review Criteria

This ADR should be reviewed if: 1. Azure changes managed identity best practices 2. Container Apps ACR integration changes 3. We need to support cross-tenant deployments 4. Custom roles are introduced requiring dynamic lookup 5. Additional platform services require ACR access


Supersedes: None (new pattern)
Superseded by: None
Status: ✅ Accepted and Implemented