Common Deployment Errors and Solutions
MCP Server Connection Errors
Error: Context Manager Timeout / MCP Connection Failed
Symptom:
Root Cause:
MCP servers expose their endpoints at /mcp path, but the API is calling the base URL without the path.
Solution:
Update MCP server URL environment variables to include /mcp path:
// infrastructure/bicep/apps.bicep
{
name: 'MCP_APPLICATION_VERIFICATION_URL'
value: 'http://localhost:8010/mcp' // ✅ Correct - includes /mcp
}
{
name: 'MCP_DOCUMENT_PROCESSING_URL'
value: 'http://localhost:8011/mcp' // ✅ Correct - includes /mcp
}
{
name: 'MCP_FINANCIAL_CALCULATIONS_URL'
value: 'http://localhost:8012/mcp' // ✅ Correct - includes /mcp
}
Verification:
# Test MCP endpoint directly
curl http://localhost:8010/mcp/v1/list_tools
# Should return JSON with available tools
Azure AI Authentication Errors
Error: (403) Public access is disabled. Please configure private endpoint.
Symptom:
Processing failed: (403) Public access is disabled. Please configure private endpoint.
Code: 403
Message: Public access is disabled. Please configure private endpoint.
Root Cause:
Azure AI Services has publicNetworkAccess: "Disabled" but:
1. No private DNS zone configured for services.ai.azure.com, OR
2. ACI containers resolving to public IP instead of private endpoint
Solution 1: Enable Public Access with VNet Restrictions (Recommended)
This keeps public endpoint enabled but restricts access to only ACI subnet:
# Update Azure AI Services network ACLs
az rest --method PATCH \
--url "https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{ai-name}?api-version=2023-05-01" \
--body '{
"properties": {
"publicNetworkAccess": "Enabled",
"networkAcls": {
"defaultAction": "Deny",
"virtualNetworkRules": [
{
"id": "/subscriptions/{subscription-id}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/aci-subnet"
}
]
}
}
}'
Bicep Template:
// infrastructure/bicep/modules/ai-services.bicep
resource aiServices 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
properties: {
publicNetworkAccess: 'Enabled'
networkAcls: {
defaultAction: 'Deny'
bypass: 'AzureServices'
virtualNetworkRules: [for subnetId in allowedSubnetIds: {
id: subnetId
ignoreMissingVnetServiceEndpoint: false
}]
}
}
}
Verification:
# From ACI container, test AI endpoint
az container exec --resource-group {rg} \
--container-name api --name {aci-name} \
--exec-command "curl -I https://{ai-name}.services.ai.azure.com"
# Should return HTTP 200
Key Vault Access Errors
Error: You do not have access to List secrets for this resource
Symptom: - Cannot connect to Bastion VM - Portal shows: "Public network access is disabled and request is not from a trusted service nor via an approved private link."
Root Cause:
Key Vault has publicNetworkAccess: "Disabled" and you're accessing from Azure Portal (public Internet).
Solution:
-
Enable Public Access with Restrictions:
-
Add Your IP to Allowlist:
Bicep Template:
// infrastructure/bicep/modules/security.bicep
module keyVault 'br/public:avm/res/key-vault/vault:0.12.0' = {
params: {
publicNetworkAccess: 'Enabled'
networkAcls: {
bypass: 'AzureServices' // Allows Bastion, Portal
defaultAction: 'Deny'
ipRules: [for ip in adminAllowedIpAddresses: {
value: ip
}]
}
}
}
Parameter File:
// infrastructure/bicep/environments/dev/foundation.bicepparam
param adminAllowedIpAddresses = [
'1.2.3.4/32' // Replace with your actual IP
]
Verification:
# Test Key Vault access
az keyvault secret list --vault-name {kv-name}
# Should list secrets without error
Application Gateway Access Errors
Error: UI Not Accessible / Connection Timeout
Symptom: - Application Gateway running - Containers healthy - UI timeout when accessing public endpoint
Root Cause: NSG for Application Gateway subnet missing HTTP inbound rule.
Solution:
# Add HTTP inbound rule to App Gateway NSG
az network nsg rule create \
--resource-group {rg} \
--nsg-name {app-gateway-nsg-name} \
--name AllowHttpInbound \
--priority 100 \
--source-address-prefixes Internet \
--source-port-ranges '*' \
--destination-address-prefixes '*' \
--destination-port-ranges 80 \
--access Allow \
--protocol Tcp \
--description "Allow HTTP traffic from Internet"
⚠️ Note: Application Gateway auto-creates NSG - cannot easily configure via Bicep. This is a known limitation. Consider post-deployment automation.
Verification:
# Check NSG rules
az network nsg rule list --resource-group {rg} \
--nsg-name {app-gateway-nsg-name} \
--query "[?direction=='Inbound' && access=='Allow'].{name:name, port:destinationPortRange}" \
-o table
# Should show AllowHttpInbound on port 80
Rate Limit Errors
Error: Rate limit is exceeded. Try again in X seconds.
Symptom:
Root Cause: Azure AI model deployment capacity too low for multi-agent workflow (4 agents calling in sequence).
Solution:
-
Increase Model Capacity:
-
Update Bicep Default:
Recommended Capacities: - Development: 150K TPM (handles multi-agent workflows) - Production: 250K+ TPM (handles concurrent users)
Verification:
# Check current capacity
az cognitiveservices account deployment show \
--resource-group {rg} \
--name {ai-name} \
--deployment-name gpt-4o \
--query "sku.capacity" -o tsv
# Should show 150 or higher
DNS Resolution Errors
Error: DNS resolving to public IP instead of private endpoint
Symptom:
# DNS check returns public IP
getent hosts {ai-name}.services.ai.azure.com
20.119.156.137 # ❌ Public IP, not private IP 10.0.3.x
Root Cause: Private DNS zone not configured or not linked to VNet.
Solution:
-
Create Private DNS Zone:
-
Link to VNet:
-
Add A Record:
⚠️ Note: This is complex. Consider using public endpoint with VNet restrictions instead (simpler, more reliable).
Quick Diagnostic Commands
Check Container Health
# All containers in group
az container show --resource-group {rg} --name {aci-name} \
--query "containers[].{name:name, state:instanceView.currentState.state}" -o table
# API logs
az container logs --resource-group {rg} --name {aci-name} --container-name api | tail -50
Check Network Configuration
# Azure AI network ACLs
az cognitiveservices account show --resource-group {rg} --name {ai-name} \
--query "{publicAccess:properties.publicNetworkAccess, networkAcls:properties.networkAcls}" -o json
# Key Vault network ACLs
az keyvault show --resource-group {rg} --name {kv-name} \
--query "{publicAccess:properties.publicNetworkAccess, networkAcls:properties.networkAcls}" -o json
Check Application Gateway
# Gateway state
az network application-gateway show --resource-group {rg} --name {app-gw-name} \
--query "{state:operationalState, provisioning:provisioningState}" -o table
# Backend health
az network application-gateway show-backend-health --resource-group {rg} --name {app-gw-name}
Rollback Procedures
Rollback: Azure AI VNet Restrictions
# Re-enable public access (emergency only)
az rest --method PATCH \
--url "https://management.azure.com/.../Microsoft.CognitiveServices/accounts/{ai-name}?api-version=2023-05-01" \
--body '{"properties":{"publicNetworkAccess":"Enabled","networkAcls":{"defaultAction":"Allow"}}}'
Rollback: Key Vault IP Restrictions
# Remove IP restrictions
az keyvault update --resource-group {rg} --name {kv-name} \
--default-action Allow
Rollback: Model Capacity
# Reduce capacity (cost saving)
az rest --method PATCH \
--url "https://management.azure.com/.../deployments/{model-name}?api-version=2023-05-01" \
--body '{"sku": {"capacity": 50}}'
Prevention Checklist
Before deploying to production:
- MCP server URLs include
/mcppath - Azure AI has VNet restrictions configured
- Key Vault has admin IPs whitelisted
- Application Gateway NSG allows HTTP/HTTPS
- Model capacity set to 150K+ TPM
- All containers can reach Azure AI endpoint
- Bastion connectivity tested
- DNS resolution verified (private endpoint if used)
- Rate limits tested with full workflow
- Token usage logging enabled
Support Resources
- Architecture Decisions:
docs/architecture/decisions/ - Security Configuration:
SECURITY.md - Manual Changes Log:
temp/manual-changes-2025-10-30.md - Deployment Scripts:
infrastructure/scripts/