Skip to content

ADR-049: Container Deployment Platform - ACI vs Container Apps with VNet Integration

Status: Accepted Date: 2025-10-24
Deciders: Infrastructure Team, Development Team
Technical Story: Container Apps VNet integration provisioning blocked; evaluating Azure Container Instances as alternative

Context and Problem Statement

We need to deploy containerized applications (API, UI, and 3 MCP servers) that must: 1. Access Azure AI Foundry service secured with private endpoints in a VNet 2. Communicate with each other for microservices architecture 3. Deploy reliably without multi-hour provisioning delays 4. Support future production requirements (scaling, zero-downtime updates)

Current Blocker: Azure Container Apps with VNet integration + workload profiles has been stuck in "Waiting" provisioning state for 30+ minutes across multiple deployment attempts, despite: - Adding required workload profiles for new VNet integration method - Moving ACR private endpoint to separate subnet (dedicated Container Apps subnet requirement) - Adding subnet delegation for Microsoft.App/environments - Configuring all required NSG rules - Building and pushing images successfully

Decision Drivers

  • Blocker Severity: Cannot proceed with Container Apps after hours of troubleshooting
  • VNet Requirement: AI Foundry has publicNetworkAccess: Disabled - containers MUST be in VNet
  • Time to Deploy: Need working deployment to unblock development
  • Container Communication: API needs to call 3 MCP servers
  • Production Readiness: Future need for auto-scaling, HA, zero-downtime updates
  • Maintainability: Service discovery, deployment complexity

Research Findings

Azure Container Instances (ACI) with VNet

✅ Capabilities Confirmed

  1. VNet Integration
  2. Delegates subnet to container groups
  3. Containers get private IPs from VNet subnet
  4. Quote from Microsoft Docs: "By deploying container groups into an Azure virtual network, your containers can communicate securely with other resources in the virtual network."

  5. Private Endpoint Access

  6. Quote from Microsoft Docs: "Containers deployed to a virtual network can access resources secured with private endpoints in the VNet."
  7. Confirmed: ACI CAN access AI Foundry with private endpoint ✅

  8. Container-to-Container Communication

  9. Within Same Container Group: Containers share IP, communicate via localhost on any port
  10. Between Container Groups: Must use private IPs or DNS
  11. Each container group gets its own private IP from delegated subnet
  12. Container Group A can reach Container Group B via: http://10.1.0.x:port

  13. Deployment Speed

  14. Provisions in seconds (not 30+ minutes)
  15. No "environment" provisioning needed
  16. Simple, reliable

❌ Limitations Identified

  1. No Auto-Scaling
  2. Manual scaling only
  3. Must create additional container groups manually

  4. No Built-in Service Discovery

  5. No automatic DNS between container groups
  6. Must use: hardcoded IPs, environment variables, or Azure Private DNS Zone

  7. No Built-in Load Balancing

  8. Single instance per container group
  9. Must add Application Gateway or Front Door for HA

  10. No Revision Management

  11. No traffic splitting
  12. No automatic blue-green deployments
  13. Updates require container restart (10-30 sec downtime)

  14. Lifecycle Coupling

  15. All containers in same group restart together
  16. Cannot scale/update individual containers in a group
  17. Implication: Must use separate container groups for each service

Azure Container Apps with VNet

✅ Superior Features

  1. Auto-scaling (0-N replicas based on load)
  2. Built-in load balancing across replicas
  3. Automatic service discovery between apps
  4. Revision management with traffic splitting
  5. Zero-downtime deployments
  6. Health probes and automatic retries

❌ Current Blocker

  • VNet + workload profiles provisioning stuck in "Waiting" state
  • 30+ minutes provisioning attempts fail
  • Platform issue, not configuration issue (all requirements met)
  • Unknown resolution timeline

Considered Options

Option 1: Azure Container Instances with Separate Container Groups + Private DNS

Architecture:

VNet (10.1.0.0/16) with Private DNS Zone: loan-defenders.internal
  └── ACI Subnet (10.1.0.0/24)
      ├── API Container Group → api.loan-defenders.internal (10.1.0.4)
      ├── UI Container Group → ui.loan-defenders.internal (10.1.0.5)
      ├── MCP Verification → mcp-verification.loan-defenders.internal (10.1.0.6)
      ├── MCP Documents → mcp-documents.loan-defenders.internal (10.1.0.7)
      └── MCP Financial → mcp-financial.loan-defenders.internal (10.1.0.8)

Service Discovery:

API Environment Variables:
  MCP_VERIFICATION_URL: http://mcp-verification.loan-defenders.internal:8010
  MCP_DOCUMENTS_URL: http://mcp-documents.loan-defenders.internal:8011
  MCP_FINANCIAL_URL: http://mcp-financial.loan-defenders.internal:8012
  AI_ENDPOINT: https://ldftest2-test-ai.cognitiveservices.azure.com/

Pros: - ✅ Deploys in seconds (proven working) - ✅ Can access AI Foundry private endpoint - ✅ Container-to-container communication via DNS - ✅ Each service independent lifecycle - ✅ Unblocks development immediately - ✅ Simpler architecture (less moving parts) - ✅ Lower cost for always-on workloads

Cons: - ❌ No auto-scaling - ❌ Manual service discovery setup (Private DNS Zone) - ❌ No built-in health probes between services - ❌ 10-30 second downtime during updates - ❌ Need Application Gateway for production HA

Option 2: Wait for Container Apps VNet to Provision

Pros: - ✅ Better production features (auto-scale, HA, zero-downtime) - ✅ Built-in service discovery - ✅ Revision management

Cons: - ❌ Currently blocked (30+ min "Waiting" state) - ❌ Unknown resolution timeline - ❌ Blocks all development work - ❌ May be regional/subscription-specific Azure platform issue

Option 3: Enable Public Network on AI Foundry + Container Apps without VNet

Pros: - ✅ Container Apps would deploy immediately (no VNet complexity) - ✅ All Container Apps production features available

Cons: - ❌ Breaks security architecture (AI Foundry exposed publicly) - ❌ Defeats purpose of private endpoints - ❌ Not acceptable for production

Option 4: Azure Kubernetes Service (AKS)

Pros: - ✅ Full control - ✅ VNet integration very stable - ✅ Production-grade

Cons: - ❌ Much more complex - ❌ Higher operational overhead - ❌ Overkill for current requirements - ❌ Time to set up and learn

Decision Outcome

Chosen Option: Azure Container Instances - Single Container Group (All Services Together)

Reasoning:

  1. Unblocks Development: After hours of troubleshooting Container Apps, ACI provides immediate path forward
  2. Simplest Architecture: All containers in one group communicate via localhost - no service discovery needed
  3. Meets Core Requirements:
  4. ✅ VNet integration for AI Foundry private endpoint access
  5. ✅ Container-to-container communication via localhost
  6. ✅ Security requirements met
  7. Acceptable for Dev/Test: All containers restart together during deployment (acceptable for non-production)
  8. Migration Path: Can migrate to Container Apps or separate container groups later for production
  9. Pragmatic: Get working deployment now, optimize later

Implementation Plan

Phase 1: Initial Deployment (This Week)

  1. Deploy single container group with all services:
  2. API container (port 8000)
  3. UI container (port 80)
  4. MCP Verification server (port 8010)
  5. MCP Documents server (port 8011)
  6. MCP Financial server (port 8012)
  7. Configure API to call MCP servers via localhost
  8. Test end-to-end connectivity
  9. No DNS setup needed (localhost communication)

Phase 2: Production Hardening (Next Sprint)

  1. Add Application Gateway for HA and zero-downtime updates
  2. Implement health checks
  3. Set up monitoring and alerts
  4. Document deployment and update procedures

Phase 3: Future Migration (When Ready)

  1. Monitor Container Apps VNet integration fixes
  2. When resolved, migrate to Container Apps
  3. Keep ACI deployment scripts as backup/DR option

Positive Consequences

  • ✅ Immediate deployment capability
  • ✅ VNet security maintained
  • ✅ AI Foundry private endpoint access working
  • ✅ Simpler architecture (easier to debug)
  • ✅ Lower cost initially
  • ✅ Migration path to Container Apps preserved

Negative Consequences

  • ❌ Manual service discovery (mitigated by Private DNS)
  • ❌ No auto-scaling (can add manually if needed)
  • ❌ Need to build additional tooling for zero-downtime updates
  • ❌ Less "cloud-native" than Container Apps

Mitigation Strategies

For Missing Auto-Scaling:

  • Monitor usage patterns
  • Create additional container groups manually if load increases
  • Plan Container Apps migration when scaling becomes critical

For Service Discovery:

  • Use Azure Private DNS Zone (reliable, Azure-managed)
  • Document DNS naming convention
  • Automate DNS record creation in deployment scripts

For Zero-Downtime Updates:

  • Short-term: Accept 10-30 second downtime (dev/test acceptable)
  • Medium-term: Add Application Gateway with blue-green backend pools
  • Long-term: Migrate to Container Apps

Validation

Success Criteria

  1. ✅ API container can access AI Foundry private endpoint
  2. ✅ API container can call all 3 MCP servers
  3. ✅ MCP servers can be independently updated
  4. ✅ Deployment completes in < 5 minutes
  5. ✅ Total cost < Container Apps cost

Testing Plan

  1. Deploy test MCP server, verify private IP assignment
  2. Deploy test API, verify can reach MCP via DNS
  3. Test AI Foundry access from container
  4. Measure deployment time
  5. Test update procedure (redeploy with new image)

References

  • ADR-001: Multi-Agent Architecture Design
  • ADR-014: Azure AI Foundry for Multi-Agent Orchestration
  • ADR-027: Standardized VNet Architecture with Private Endpoints

Appendix A: Container Apps VNet Troubleshooting History

Attempts Made: 1. Initial deployment with internal: true - Failed (validation timeout) 2. Changed to internal: false - Failed (validation timeout) 3. Added workload profiles (required for new VNet method) - Stuck "Waiting" 30+ min 4. Moved ACR PE to separate subnet - Stuck "Waiting" 5. Added subnet delegation - Stuck "Waiting" 6. Added NSG rules for MCR, AzureFrontDoor, CognitiveServices, AzureML, Storage - Stuck "Waiting" 7. Tried ACR admin credentials instead of managed identity - Failed (environment not provisioned)

Conclusion: Platform issue, not configuration issue. All requirements met per Microsoft documentation.

Appendix B: Cost Comparison

Azure Container Instances (estimated): - API: 0.5 vCPU + 1GB RAM = ~$35/month - UI: 0.25 vCPU + 0.5GB RAM = ~$15/month - MCP servers (3x): 0.25 vCPU + 0.5GB RAM each = ~$45/month - Total: ~$95/month

Azure Container Apps (estimated): - Similar resource allocation - Consumption plan with always-on = ~$100/month - Total: ~$100/month

Costs similar, so not a primary decision factor

Appendix C: Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                        Azure Resource Group                             │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │ VNet: ldftest2-test-vnet (10.1.0.0/16)                         │    │
│  │                                                                 │    │
│  │  ┌───────────────────────────────────────────────────────┐     │    │
│  │  │ ACI Subnet (10.1.0.0/24)                              │     │    │
│  │  │                                                        │     │    │
│  │  │  ┌──────────────────────────────────────────────┐     │     │    │
│  │  │  │ Container Groups (Each with private IP)     │     │     │    │
│  │  │  │                                              │     │     │    │
│  │  │  │  • API (10.1.0.4)                           │     │     │    │
│  │  │  │  • UI (10.1.0.5)                            │     │     │    │
│  │  │  │  • MCP Verification (10.1.0.6)              │     │     │    │
│  │  │  │  • MCP Documents (10.1.0.7)                 │     │     │    │
│  │  │  │  • MCP Financial (10.1.0.8)                 │     │     │    │
│  │  │  └──────────────────────────────────────────────┘     │     │    │
│  │  │                                                        │     │    │
│  │  └───────────────────────────────────────────────────────┘     │    │
│  │                                                                 │    │
│  │  ┌───────────────────────────────────────────────────────┐     │    │
│  │  │ Data Subnet (10.1.3.0/24)                             │     │    │
│  │  │                                                        │     │    │
│  │  │  Private Endpoints:                                   │     │    │
│  │  │  • AI Foundry (ldftest2-test-ai)                     │     │    │
│  │  │  • ACR (ldftest2acr)                                 │     │    │
│  │  └───────────────────────────────────────────────────────┘     │    │
│  │                                                                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────┐        │
│  │ Private DNS Zone: loan-defenders.internal                   │        │
│  │                                                              │        │
│  │  A Records:                                                 │        │
│  │  • api.loan-defenders.internal → 10.1.0.4                  │        │
│  │  • mcp-verification.loan-defenders.internal → 10.1.0.6     │        │
│  │  • mcp-documents.loan-defenders.internal → 10.1.0.7        │        │
│  │  • mcp-financial.loan-defenders.internal → 10.1.0.8        │        │
│  └─────────────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────────────┘

Communication Flow:
1. API → AI Foundry: Via private endpoint (10.1.3.x)
2. API → MCP Servers: Via DNS (mcp-*.loan-defenders.internal)
3. UI → API: Via private IP or public endpoint

Status: Accepted (2025-10-24)

Implementation Status: This ADR has been accepted and implementation is in progress via the comprehensive migration guide. ACI provides immediate deployment capability while maintaining VNet security requirements. Migration to Container Apps remains a future option once Azure platform issues are resolved.