ADR-049: Container Deployment Platform - ACI vs Container Apps with VNet Integration
Status: Accepted
Date: 2025-10-24
Deciders: Infrastructure Team, Development Team
Technical Story: Container Apps VNet integration provisioning blocked; evaluating Azure Container Instances as alternative
Context and Problem Statement
We need to deploy containerized applications (API, UI, and 3 MCP servers) that must: 1. Access Azure AI Foundry service secured with private endpoints in a VNet 2. Communicate with each other for microservices architecture 3. Deploy reliably without multi-hour provisioning delays 4. Support future production requirements (scaling, zero-downtime updates)
Current Blocker: Azure Container Apps with VNet integration + workload profiles has been stuck in "Waiting" provisioning state for 30+ minutes across multiple deployment attempts, despite: - Adding required workload profiles for new VNet integration method - Moving ACR private endpoint to separate subnet (dedicated Container Apps subnet requirement) - Adding subnet delegation for Microsoft.App/environments - Configuring all required NSG rules - Building and pushing images successfully
Decision Drivers
- Blocker Severity: Cannot proceed with Container Apps after hours of troubleshooting
- VNet Requirement: AI Foundry has
publicNetworkAccess: Disabled- containers MUST be in VNet - Time to Deploy: Need working deployment to unblock development
- Container Communication: API needs to call 3 MCP servers
- Production Readiness: Future need for auto-scaling, HA, zero-downtime updates
- Maintainability: Service discovery, deployment complexity
Research Findings
Azure Container Instances (ACI) with VNet
✅ Capabilities Confirmed
- VNet Integration
- Delegates subnet to container groups
- Containers get private IPs from VNet subnet
-
Quote from Microsoft Docs: "By deploying container groups into an Azure virtual network, your containers can communicate securely with other resources in the virtual network."
-
Private Endpoint Access
- Quote from Microsoft Docs: "Containers deployed to a virtual network can access resources secured with private endpoints in the VNet."
-
Confirmed: ACI CAN access AI Foundry with private endpoint ✅
-
Container-to-Container Communication
- Within Same Container Group: Containers share IP, communicate via localhost on any port
- Between Container Groups: Must use private IPs or DNS
- Each container group gets its own private IP from delegated subnet
-
Container Group A can reach Container Group B via:
http://10.1.0.x:port -
Deployment Speed
- Provisions in seconds (not 30+ minutes)
- No "environment" provisioning needed
- Simple, reliable
❌ Limitations Identified
- No Auto-Scaling
- Manual scaling only
-
Must create additional container groups manually
-
No Built-in Service Discovery
- No automatic DNS between container groups
-
Must use: hardcoded IPs, environment variables, or Azure Private DNS Zone
-
No Built-in Load Balancing
- Single instance per container group
-
Must add Application Gateway or Front Door for HA
-
No Revision Management
- No traffic splitting
- No automatic blue-green deployments
-
Updates require container restart (10-30 sec downtime)
-
Lifecycle Coupling
- All containers in same group restart together
- Cannot scale/update individual containers in a group
- Implication: Must use separate container groups for each service
Azure Container Apps with VNet
✅ Superior Features
- Auto-scaling (0-N replicas based on load)
- Built-in load balancing across replicas
- Automatic service discovery between apps
- Revision management with traffic splitting
- Zero-downtime deployments
- Health probes and automatic retries
❌ Current Blocker
- VNet + workload profiles provisioning stuck in "Waiting" state
- 30+ minutes provisioning attempts fail
- Platform issue, not configuration issue (all requirements met)
- Unknown resolution timeline
Considered Options
Option 1: Azure Container Instances with Separate Container Groups + Private DNS
Architecture:
VNet (10.1.0.0/16) with Private DNS Zone: loan-defenders.internal
└── ACI Subnet (10.1.0.0/24)
├── API Container Group → api.loan-defenders.internal (10.1.0.4)
├── UI Container Group → ui.loan-defenders.internal (10.1.0.5)
├── MCP Verification → mcp-verification.loan-defenders.internal (10.1.0.6)
├── MCP Documents → mcp-documents.loan-defenders.internal (10.1.0.7)
└── MCP Financial → mcp-financial.loan-defenders.internal (10.1.0.8)
Service Discovery:
API Environment Variables:
MCP_VERIFICATION_URL: http://mcp-verification.loan-defenders.internal:8010
MCP_DOCUMENTS_URL: http://mcp-documents.loan-defenders.internal:8011
MCP_FINANCIAL_URL: http://mcp-financial.loan-defenders.internal:8012
AI_ENDPOINT: https://ldftest2-test-ai.cognitiveservices.azure.com/
Pros: - ✅ Deploys in seconds (proven working) - ✅ Can access AI Foundry private endpoint - ✅ Container-to-container communication via DNS - ✅ Each service independent lifecycle - ✅ Unblocks development immediately - ✅ Simpler architecture (less moving parts) - ✅ Lower cost for always-on workloads
Cons: - ❌ No auto-scaling - ❌ Manual service discovery setup (Private DNS Zone) - ❌ No built-in health probes between services - ❌ 10-30 second downtime during updates - ❌ Need Application Gateway for production HA
Option 2: Wait for Container Apps VNet to Provision
Pros: - ✅ Better production features (auto-scale, HA, zero-downtime) - ✅ Built-in service discovery - ✅ Revision management
Cons: - ❌ Currently blocked (30+ min "Waiting" state) - ❌ Unknown resolution timeline - ❌ Blocks all development work - ❌ May be regional/subscription-specific Azure platform issue
Option 3: Enable Public Network on AI Foundry + Container Apps without VNet
Pros: - ✅ Container Apps would deploy immediately (no VNet complexity) - ✅ All Container Apps production features available
Cons: - ❌ Breaks security architecture (AI Foundry exposed publicly) - ❌ Defeats purpose of private endpoints - ❌ Not acceptable for production
Option 4: Azure Kubernetes Service (AKS)
Pros: - ✅ Full control - ✅ VNet integration very stable - ✅ Production-grade
Cons: - ❌ Much more complex - ❌ Higher operational overhead - ❌ Overkill for current requirements - ❌ Time to set up and learn
Decision Outcome
Chosen Option: Azure Container Instances - Single Container Group (All Services Together)
Reasoning:
- Unblocks Development: After hours of troubleshooting Container Apps, ACI provides immediate path forward
- Simplest Architecture: All containers in one group communicate via localhost - no service discovery needed
- Meets Core Requirements:
- ✅ VNet integration for AI Foundry private endpoint access
- ✅ Container-to-container communication via localhost
- ✅ Security requirements met
- Acceptable for Dev/Test: All containers restart together during deployment (acceptable for non-production)
- Migration Path: Can migrate to Container Apps or separate container groups later for production
- Pragmatic: Get working deployment now, optimize later
Implementation Plan
Phase 1: Initial Deployment (This Week)
- Deploy single container group with all services:
- API container (port 8000)
- UI container (port 80)
- MCP Verification server (port 8010)
- MCP Documents server (port 8011)
- MCP Financial server (port 8012)
- Configure API to call MCP servers via localhost
- Test end-to-end connectivity
- No DNS setup needed (localhost communication)
Phase 2: Production Hardening (Next Sprint)
- Add Application Gateway for HA and zero-downtime updates
- Implement health checks
- Set up monitoring and alerts
- Document deployment and update procedures
Phase 3: Future Migration (When Ready)
- Monitor Container Apps VNet integration fixes
- When resolved, migrate to Container Apps
- Keep ACI deployment scripts as backup/DR option
Positive Consequences
- ✅ Immediate deployment capability
- ✅ VNet security maintained
- ✅ AI Foundry private endpoint access working
- ✅ Simpler architecture (easier to debug)
- ✅ Lower cost initially
- ✅ Migration path to Container Apps preserved
Negative Consequences
- ❌ Manual service discovery (mitigated by Private DNS)
- ❌ No auto-scaling (can add manually if needed)
- ❌ Need to build additional tooling for zero-downtime updates
- ❌ Less "cloud-native" than Container Apps
Mitigation Strategies
For Missing Auto-Scaling:
- Monitor usage patterns
- Create additional container groups manually if load increases
- Plan Container Apps migration when scaling becomes critical
For Service Discovery:
- Use Azure Private DNS Zone (reliable, Azure-managed)
- Document DNS naming convention
- Automate DNS record creation in deployment scripts
For Zero-Downtime Updates:
- Short-term: Accept 10-30 second downtime (dev/test acceptable)
- Medium-term: Add Application Gateway with blue-green backend pools
- Long-term: Migrate to Container Apps
Validation
Success Criteria
- ✅ API container can access AI Foundry private endpoint
- ✅ API container can call all 3 MCP servers
- ✅ MCP servers can be independently updated
- ✅ Deployment completes in < 5 minutes
- ✅ Total cost < Container Apps cost
Testing Plan
- Deploy test MCP server, verify private IP assignment
- Deploy test API, verify can reach MCP via DNS
- Test AI Foundry access from container
- Measure deployment time
- Test update procedure (redeploy with new image)
References
- Microsoft Learn: ACI Virtual Network Scenarios
- Microsoft Learn: Container Groups
- Azure Container Apps VNet Integration
- Azure Private DNS Zones
Related ADRs
- ADR-001: Multi-Agent Architecture Design
- ADR-014: Azure AI Foundry for Multi-Agent Orchestration
- ADR-027: Standardized VNet Architecture with Private Endpoints
Appendix A: Container Apps VNet Troubleshooting History
Attempts Made:
1. Initial deployment with internal: true - Failed (validation timeout)
2. Changed to internal: false - Failed (validation timeout)
3. Added workload profiles (required for new VNet method) - Stuck "Waiting" 30+ min
4. Moved ACR PE to separate subnet - Stuck "Waiting"
5. Added subnet delegation - Stuck "Waiting"
6. Added NSG rules for MCR, AzureFrontDoor, CognitiveServices, AzureML, Storage - Stuck "Waiting"
7. Tried ACR admin credentials instead of managed identity - Failed (environment not provisioned)
Conclusion: Platform issue, not configuration issue. All requirements met per Microsoft documentation.
Appendix B: Cost Comparison
Azure Container Instances (estimated): - API: 0.5 vCPU + 1GB RAM = ~$35/month - UI: 0.25 vCPU + 0.5GB RAM = ~$15/month - MCP servers (3x): 0.25 vCPU + 0.5GB RAM each = ~$45/month - Total: ~$95/month
Azure Container Apps (estimated): - Similar resource allocation - Consumption plan with always-on = ~$100/month - Total: ~$100/month
Costs similar, so not a primary decision factor
Appendix C: Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ Azure Resource Group │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ VNet: ldftest2-test-vnet (10.1.0.0/16) │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ ACI Subnet (10.1.0.0/24) │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────────────────────────────────────┐ │ │ │
│ │ │ │ Container Groups (Each with private IP) │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ • API (10.1.0.4) │ │ │ │
│ │ │ │ • UI (10.1.0.5) │ │ │ │
│ │ │ │ • MCP Verification (10.1.0.6) │ │ │ │
│ │ │ │ • MCP Documents (10.1.0.7) │ │ │ │
│ │ │ │ • MCP Financial (10.1.0.8) │ │ │ │
│ │ │ └──────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Data Subnet (10.1.3.0/24) │ │ │
│ │ │ │ │ │
│ │ │ Private Endpoints: │ │ │
│ │ │ • AI Foundry (ldftest2-test-ai) │ │ │
│ │ │ • ACR (ldftest2acr) │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Private DNS Zone: loan-defenders.internal │ │
│ │ │ │
│ │ A Records: │ │
│ │ • api.loan-defenders.internal → 10.1.0.4 │ │
│ │ • mcp-verification.loan-defenders.internal → 10.1.0.6 │ │
│ │ • mcp-documents.loan-defenders.internal → 10.1.0.7 │ │
│ │ • mcp-financial.loan-defenders.internal → 10.1.0.8 │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Communication Flow:
1. API → AI Foundry: Via private endpoint (10.1.3.x)
2. API → MCP Servers: Via DNS (mcp-*.loan-defenders.internal)
3. UI → API: Via private IP or public endpoint
Status: Accepted (2025-10-24)
Implementation Status: This ADR has been accepted and implementation is in progress via the comprehensive migration guide. ACI provides immediate deployment capability while maintaining VNet security requirements. Migration to Container Apps remains a future option once Azure platform issues are resolved.