Skip to content

Common Deployment Errors and Solutions

MCP Server Connection Errors

Error: Context Manager Timeout / MCP Connection Failed

Symptom:

ERROR: Failed to connect to MCP server
ERROR: Context manager timeout

Root Cause: MCP servers expose their endpoints at /mcp path, but the API is calling the base URL without the path.

Solution: Update MCP server URL environment variables to include /mcp path:

// infrastructure/bicep/apps.bicep
{
  name: 'MCP_APPLICATION_VERIFICATION_URL'
  value: 'http://localhost:8010/mcp'  // ✅ Correct - includes /mcp
}
{
  name: 'MCP_DOCUMENT_PROCESSING_URL'
  value: 'http://localhost:8011/mcp'  // ✅ Correct - includes /mcp
}
{
  name: 'MCP_FINANCIAL_CALCULATIONS_URL'
  value: 'http://localhost:8012/mcp'  // ✅ Correct - includes /mcp
}

Verification:

# Test MCP endpoint directly
curl http://localhost:8010/mcp/v1/list_tools

# Should return JSON with available tools


Azure AI Authentication Errors

Error: (403) Public access is disabled. Please configure private endpoint.

Symptom:

Processing failed: (403) Public access is disabled. Please configure private endpoint.
Code: 403
Message: Public access is disabled. Please configure private endpoint.

Root Cause: Azure AI Services has publicNetworkAccess: "Disabled" but: 1. No private DNS zone configured for services.ai.azure.com, OR 2. ACI containers resolving to public IP instead of private endpoint

Solution 1: Enable Public Access with VNet Restrictions (Recommended)

This keeps public endpoint enabled but restricts access to only ACI subnet:

# Update Azure AI Services network ACLs
az rest --method PATCH \
  --url "https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{ai-name}?api-version=2023-05-01" \
  --body '{
    "properties": {
      "publicNetworkAccess": "Enabled",
      "networkAcls": {
        "defaultAction": "Deny",
        "virtualNetworkRules": [
          {
            "id": "/subscriptions/{subscription-id}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/aci-subnet"
          }
        ]
      }
    }
  }'

Bicep Template:

// infrastructure/bicep/modules/ai-services.bicep
resource aiServices 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  properties: {
    publicNetworkAccess: 'Enabled'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
      virtualNetworkRules: [for subnetId in allowedSubnetIds: {
        id: subnetId
        ignoreMissingVnetServiceEndpoint: false
      }]
    }
  }
}

Verification:

# From ACI container, test AI endpoint
az container exec --resource-group {rg} \
  --container-name api --name {aci-name} \
  --exec-command "curl -I https://{ai-name}.services.ai.azure.com"

# Should return HTTP 200


Key Vault Access Errors

Error: You do not have access to List secrets for this resource

Symptom: - Cannot connect to Bastion VM - Portal shows: "Public network access is disabled and request is not from a trusted service nor via an approved private link."

Root Cause: Key Vault has publicNetworkAccess: "Disabled" and you're accessing from Azure Portal (public Internet).

Solution:

  1. Enable Public Access with Restrictions:

    az keyvault update --resource-group {rg} --name {kv-name} \
      --public-network-access Enabled \
      --bypass AzureServices \
      --default-action Deny
    

  2. Add Your IP to Allowlist:

    # Get your IP
    MY_IP=$(curl -s ifconfig.me)
    
    # Add to Key Vault firewall
    az keyvault network-rule add --resource-group {rg} \
      --name {kv-name} \
      --ip-address "${MY_IP}/32"
    

Bicep Template:

// infrastructure/bicep/modules/security.bicep
module keyVault 'br/public:avm/res/key-vault/vault:0.12.0' = {
  params: {
    publicNetworkAccess: 'Enabled'
    networkAcls: {
      bypass: 'AzureServices'  // Allows Bastion, Portal
      defaultAction: 'Deny'
      ipRules: [for ip in adminAllowedIpAddresses: {
        value: ip
      }]
    }
  }
}

Parameter File:

// infrastructure/bicep/environments/dev/foundation.bicepparam
param adminAllowedIpAddresses = [
  '1.2.3.4/32'  // Replace with your actual IP
]

Verification:

# Test Key Vault access
az keyvault secret list --vault-name {kv-name}

# Should list secrets without error


Application Gateway Access Errors

Error: UI Not Accessible / Connection Timeout

Symptom: - Application Gateway running - Containers healthy - UI timeout when accessing public endpoint

Root Cause: NSG for Application Gateway subnet missing HTTP inbound rule.

Solution:

# Add HTTP inbound rule to App Gateway NSG
az network nsg rule create \
  --resource-group {rg} \
  --nsg-name {app-gateway-nsg-name} \
  --name AllowHttpInbound \
  --priority 100 \
  --source-address-prefixes Internet \
  --source-port-ranges '*' \
  --destination-address-prefixes '*' \
  --destination-port-ranges 80 \
  --access Allow \
  --protocol Tcp \
  --description "Allow HTTP traffic from Internet"

⚠️ Note: Application Gateway auto-creates NSG - cannot easily configure via Bicep. This is a known limitation. Consider post-deployment automation.

Verification:

# Check NSG rules
az network nsg rule list --resource-group {rg} \
  --nsg-name {app-gateway-nsg-name} \
  --query "[?direction=='Inbound' && access=='Allow'].{name:name, port:destinationPortRange}" \
  -o table

# Should show AllowHttpInbound on port 80


Rate Limit Errors

Error: Rate limit is exceeded. Try again in X seconds.

Symptom:

System: Processing failed: Rate limit is exceeded. Try again in 22 seconds.

Root Cause: Azure AI model deployment capacity too low for multi-agent workflow (4 agents calling in sequence).

Solution:

  1. Increase Model Capacity:

    # Update deployment capacity
    az rest --method PATCH \
      --url "https://management.azure.com/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{ai-name}/deployments/{model-name}?api-version=2023-05-01" \
      --body '{"sku": {"name": "Standard", "capacity": 150}}'
    

  2. Update Bicep Default:

    // infrastructure/bicep/environments/dev/ai-models.json
    {
      "modelDeployments": [
        {
          "name": "gpt-4o",
          "sku": {
            "name": "GlobalStandard",
            "capacity": 150  // Increased from 50
          }
        }
      ]
    }
    

Recommended Capacities: - Development: 150K TPM (handles multi-agent workflows) - Production: 250K+ TPM (handles concurrent users)

Verification:

# Check current capacity
az cognitiveservices account deployment show \
  --resource-group {rg} \
  --name {ai-name} \
  --deployment-name gpt-4o \
  --query "sku.capacity" -o tsv

# Should show 150 or higher


DNS Resolution Errors

Error: DNS resolving to public IP instead of private endpoint

Symptom:

# DNS check returns public IP
getent hosts {ai-name}.services.ai.azure.com
20.119.156.137  # ❌ Public IP, not private IP 10.0.3.x

Root Cause: Private DNS zone not configured or not linked to VNet.

Solution:

  1. Create Private DNS Zone:

    # For services.ai.azure.com domain
    az network private-dns zone create \
      --resource-group {rg} \
      --name privatelink.services.ai.azure.com
    

  2. Link to VNet:

    az network private-dns link vnet create \
      --resource-group {rg} \
      --zone-name privatelink.services.ai.azure.com \
      --name {ai-name}-vnet-link \
      --virtual-network {vnet-id} \
      --registration-enabled false
    

  3. Add A Record:

    az network private-dns record-set a add-record \
      --resource-group {rg} \
      --zone-name privatelink.services.ai.azure.com \
      --record-set-name {ai-name} \
      --ipv4-address 10.0.3.8  # Private endpoint IP
    

⚠️ Note: This is complex. Consider using public endpoint with VNet restrictions instead (simpler, more reliable).


Quick Diagnostic Commands

Check Container Health

# All containers in group
az container show --resource-group {rg} --name {aci-name} \
  --query "containers[].{name:name, state:instanceView.currentState.state}" -o table

# API logs
az container logs --resource-group {rg} --name {aci-name} --container-name api | tail -50

Check Network Configuration

# Azure AI network ACLs
az cognitiveservices account show --resource-group {rg} --name {ai-name} \
  --query "{publicAccess:properties.publicNetworkAccess, networkAcls:properties.networkAcls}" -o json

# Key Vault network ACLs
az keyvault show --resource-group {rg} --name {kv-name} \
  --query "{publicAccess:properties.publicNetworkAccess, networkAcls:properties.networkAcls}" -o json

Check Application Gateway

# Gateway state
az network application-gateway show --resource-group {rg} --name {app-gw-name} \
  --query "{state:operationalState, provisioning:provisioningState}" -o table

# Backend health
az network application-gateway show-backend-health --resource-group {rg} --name {app-gw-name}

Rollback Procedures

Rollback: Azure AI VNet Restrictions

# Re-enable public access (emergency only)
az rest --method PATCH \
  --url "https://management.azure.com/.../Microsoft.CognitiveServices/accounts/{ai-name}?api-version=2023-05-01" \
  --body '{"properties":{"publicNetworkAccess":"Enabled","networkAcls":{"defaultAction":"Allow"}}}'

Rollback: Key Vault IP Restrictions

# Remove IP restrictions
az keyvault update --resource-group {rg} --name {kv-name} \
  --default-action Allow

Rollback: Model Capacity

# Reduce capacity (cost saving)
az rest --method PATCH \
  --url "https://management.azure.com/.../deployments/{model-name}?api-version=2023-05-01" \
  --body '{"sku": {"capacity": 50}}'

Prevention Checklist

Before deploying to production:

  • MCP server URLs include /mcp path
  • Azure AI has VNet restrictions configured
  • Key Vault has admin IPs whitelisted
  • Application Gateway NSG allows HTTP/HTTPS
  • Model capacity set to 150K+ TPM
  • All containers can reach Azure AI endpoint
  • Bastion connectivity tested
  • DNS resolution verified (private endpoint if used)
  • Rate limits tested with full workflow
  • Token usage logging enabled

Support Resources

  • Architecture Decisions: docs/architecture/decisions/
  • Security Configuration: SECURITY.md
  • Manual Changes Log: temp/manual-changes-2025-10-30.md
  • Deployment Scripts: infrastructure/scripts/