Skip to content

ADR-058: Network Security - Public Endpoints with VNet Restrictions

Date: 2025-10-30
Status: Accepted
Deciders: System Architecture, Security Engineering
Technical Story: Network security hardening for Azure AI Services, Key Vault, and Application Gateway

Context and Problem Statement

After deploying the multi-agent loan processing system to Azure, we encountered several network access issues:

  1. Azure AI Services: Rate limit errors due to low capacity (50K TPM)
  2. Azure AI Services: 403 "Public access disabled" errors from ACI containers
  3. Key Vault: "Public network access disabled" blocking Bastion and Portal access
  4. Application Gateway: Users unable to access UI (NSG missing HTTP rule)
  5. MCP Servers: Connection timeouts due to missing /mcp endpoint path

We needed to balance three competing requirements: - Security: Protect Azure resources from Internet attacks - Functionality: Allow legitimate traffic (containers, admin access, users) - Developer Experience: Minimize manual configuration steps

The key decision: Should we use private endpoints only, or public endpoints with VNet restrictions?

Decision Drivers

Technical Factors

  • Private Endpoint Complexity: Requires Private DNS zones, A records, VNet links, DNS propagation time
  • DNS Resolution Issues: Azure AI Foundry uses services.ai.azure.com (no auto-configured Private DNS)
  • Development Velocity: Need fast iteration cycles for development/testing
  • Operational Simplicity: Minimize moving parts and failure modes

Security Factors

  • Defense in Depth: Multiple security layers (NSG, VNet ACLs, app auth)
  • Attack Surface: Minimize Internet-exposed endpoints
  • DDoS Protection: Avoid $3,000/month DDoS Protection Standard cost
  • Compliance: Meet enterprise security requirements

Cost Factors

  • Private Endpoint Costs: $0.01/GB processed + endpoint charges
  • Internet Egress: $0.08-0.12/GB for public traffic
  • DDoS Protection: $3,000/month if using public endpoints without restrictions
  • Operational Time: Developer hours spent troubleshooting

User Experience Factors

  • Deployment Time: End-to-end deployment should complete in <60 minutes
  • Manual Steps: Minimize required manual configuration
  • Documentation Burden: Clear, actionable troubleshooting guides
  • Error Messages: Understandable, with clear remediation steps

Considered Options

Option 1: Private Endpoints Only (Most Secure)

Configuration: - All Azure resources: publicNetworkAccess: Disabled - Private endpoints on data-subnet (10.0.3.0/24) - Private DNS zones for all Azure services - Access via: Bastion → Jump Box → private IPs

Pros: - ✅ Zero Internet exposure - ✅ Lowest attack surface - ✅ Best compliance posture - ✅ Lower latency (VNet internal routing)

Cons: - ❌ Complex DNS configuration (multiple Private DNS zones) - ❌ services.ai.azure.com domain not auto-configured - ❌ DNS propagation delays (5-15 minutes) - ❌ Difficult to troubleshoot DNS issues - ❌ Bastion required for all admin access - ❌ Higher operational complexity

Testing Result:Failed - DNS resolved to public IP (20.119.156.137) instead of private IP (10.0.3.8)

Option 2: Public Endpoints with IP Allowlist

Configuration: - All Azure resources: publicNetworkAccess: Enabled - Network ACLs: defaultAction: Deny, specific IPs allowed - Access via: Internet from whitelisted IPs

Pros: - ✅ Simple configuration - ✅ No DNS complexity - ✅ Easy to add/remove IPs

Cons: - ❌ IP addresses change (home, office, VPN) - ❌ Doesn't protect container-to-Azure traffic - ❌ Still vulnerable to attacks from allowed IPs - ❌ Doesn't scale (need to whitelist every developer) - ❌ No audit trail of who accessed what

Testing Result: ⚠️ Not tested (insufficient security for production)

Option 3: Public Endpoints with VNet Restrictions (Selected)

Configuration: - Azure AI Services: publicNetworkAccess: Enabled, networkAcls.defaultAction: Deny, allow aci-subnet - Key Vault: publicNetworkAccess: Enabled, networkAcls.defaultAction: Deny, bypass: AzureServices, IP allowlist - Application Gateway: NSG with HTTP rule (automated via deploy-apps.sh)

Pros: - ✅ Simple configuration (no Private DNS needed) - ✅ Blocks Internet attacks (default deny) - ✅ Allows legitimate traffic (VNet rules) - ✅ Saves $3,000/month (no DDoS Protection needed) - ✅ Works immediately (no DNS propagation) - ✅ Easy to troubleshoot (standard Azure networking) - ✅ Azure Services bypass for admin access - ✅ Can add private endpoints later if needed

Cons: - ⚠️ Traffic uses public IP (but restricted by ACL) - ⚠️ Slightly higher attack surface than private endpoint only - ⚠️ Requires service endpoints on subnets

Testing Result:Success - All services accessible, Internet blocked, admin access works

Option 4: Hybrid Approach

Configuration: - Azure AI Services: Public with VNet restrictions (most accessed) - Key Vault: Private endpoint (least accessed, most sensitive) - Application Gateway: Public with NSG (user-facing, must be public)

Pros: - ✅ Best security for Key Vault - ✅ Best performance for Azure AI

Cons: - ❌ Most complex (mix of configurations) - ❌ Key Vault Private DNS still needed - ❌ Higher operational complexity - ❌ Harder to understand for new developers

Testing Result: ⚠️ Not tested (complexity outweighs benefits for reference implementation)

Decision Outcome

Chosen Option: Option 3 - Public Endpoints with VNet Restrictions

Rationale

  1. Proven to Work: Tested successfully in dev environment (ldfdev14-rg)
  2. Industry Standard: Microsoft Azure Quickstart templates use this approach
  3. Security Sufficient: Defense-in-depth with multiple layers
  4. Developer Experience: Works immediately, no complex DNS troubleshooting
  5. Cost Effective: Saves $3,000+/month vs public + DDoS Protection
  6. Maintainable: Simple to understand, document, and troubleshoot
  7. Flexible: Can add private endpoints later if security requirements change

Implementation Details

Azure AI Services Configuration

// infrastructure/bicep/modules/ai-services.bicep
resource aiServices 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  properties: {
    publicNetworkAccess: 'Enabled'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
      virtualNetworkRules: [for subnetId in allowedSubnetIds: {
        id: subnetId  // aci-subnet
        ignoreMissingVnetServiceEndpoint: false
      }]
    }
  }
}

Parameters:

// infrastructure/bicep/environments/dev/substrate.bicepparam
param publicNetworkAccess = 'Enabled'  // Changed from 'Disabled'

Security Impact: - ✅ Blocks all Internet traffic (defaultAction: Deny) - ✅ Allows only ACI containers (via service endpoint) - ✅ Azure Services bypass for management operations - ✅ Saves ~$3,000/month (no DDoS Protection Standard)

Key Vault Configuration

// infrastructure/bicep/modules/security.bicep
module keyVault 'br/public:avm/res/key-vault/vault:0.12.0' = {
  params: {
    publicNetworkAccess: 'Enabled'
    networkAcls: {
      bypass: 'AzureServices'  // Allows Bastion, Portal, deployment scripts
      defaultAction: 'Deny'
      ipRules: [for ip in adminAllowedIpAddresses: {
        value: ip  // e.g., '24.164.141.93/32'
      }]
    }
  }
}

Parameters:

// infrastructure/bicep/environments/dev/foundation.bicepparam
param adminAllowedIpAddresses = [
  '24.164.141.93/32'  // Admin IP - update per environment
]

Security Impact: - ✅ Blocks all Internet traffic by default - ✅ Azure Services bypass enables Bastion/Portal access - ✅ IP allowlist for admin access (configurable per environment) - ✅ Easy to add/remove admin IPs via parameter file

Application Gateway NSG Configuration

# infrastructure/scripts/deploy-apps.sh (automated)
az network nsg rule create \
  --name AllowHttpInbound \
  --priority 100 \
  --source-address-prefixes Internet \
  --destination-port-ranges 80 \
  --access Allow \
  --protocol Tcp

Automation: - ✅ Runs automatically after ACI deployment - ✅ Idempotent (checks if rule exists) - ✅ Includes security warnings - ✅ Logs all actions

Security Impact: - ✅ Enables HTTP access for users (required for web UI) - ⚠️ HTTP is publicly accessible (port 80) - ✅ Clear warnings about HTTPS/WAF for production - ✅ Easy to restrict to specific IPs if needed

Configuration Summary

Resource Public Access Default Action Allowed Sources Security Level
Azure AI Services Enabled Deny aci-subnet only High
Key Vault Enabled Deny Azure Services + Admin IPs High
Application Gateway N/A (public service) N/A Internet (port 80) Medium (WAF available)
Storage Account Disabled N/A Private endpoint only Highest
ACI Containers Private N/A VNet-integrated High

Consequences

Positive

  1. Security Hardening Achieved
  2. Azure AI: Internet attacks blocked, only VNet access
  3. Key Vault: IP-restricted, Azure Services bypass working
  4. Application Gateway: HTTP accessible with security warnings
  5. Cost savings: ~$3,000/month (no DDoS Protection needed)

  6. Developer Experience Improved

  7. No manual NSG configuration (automated in deploy-apps.sh)
  8. Clear security warnings during deployment
  9. Comprehensive troubleshooting documentation
  10. Works immediately after deployment

  11. Operational Simplicity

  12. No Private DNS zones to manage
  13. No DNS propagation delays
  14. Standard Azure networking patterns
  15. Easy to troubleshoot with Azure Portal

  16. Flexibility Maintained

  17. Can add private endpoints later if needed
  18. Can restrict IPs further if required
  19. Can enable WAF for production
  20. Easy to customize per environment

Negative

  1. Traffic Uses Public IPs
  2. ACI → Azure AI traffic uses public endpoint (but restricted by ACL)
  3. Slightly higher latency than private endpoint
  4. Internet egress charges (though minimal)

  5. HTTP Enabled by Default

  6. Application Gateway allows HTTP (port 80)
  7. Requires explicit HTTPS configuration for production
  8. Users must add SSL certificate manually

  9. IP Allowlist Management

  10. Key Vault requires updating adminAllowedIpAddresses when IPs change
  11. Each environment needs separate IP configuration
  12. No automatic IP detection

Mitigation Strategies

  1. For Public IP Concern:
  2. Monitor for suspicious access patterns (Application Insights)
  3. Consider adding private endpoints for production if needed
  4. Document that traffic is still restricted by VNet ACLs

  5. For HTTP Concern:

  6. Clear security warnings in deployment output
  7. Production hardening checklist includes HTTPS/WAF
  8. Document SSL certificate configuration process
  9. Environment-based warnings (dev vs prod)

  10. For IP Management:

  11. Document how to update adminAllowedIpAddresses
  12. Consider Azure Policy to enforce IP updates
  13. Provide script to automatically detect current IP
  14. Use Key Vault for IP list (if rotating frequently)
  • ADR-047: User-Assigned Managed Identity (simplifies RBAC)
  • ADR-048: Key Vault for Deployment Outputs (requires Key Vault access)
  • ADR-049: Azure Container Instances (VNet integration)
  • ADR-050: Azure Bastion for VNet Access (admin access pattern)
  • ADR-054: Application Gateway in Substrate Layer (public endpoint)

References

Microsoft Documentation

Internal Documentation

  • SECURITY.md - Network Security Configuration section
  • docs/troubleshooting/common-deployment-errors.md - NSG, Key Vault, Azure AI troubleshooting
  • temp/app-gateway-nsg-requirements.md - Detailed NSG analysis
  • temp/azure-ai-security-analysis.md - Private endpoint testing results
  • temp/manual-changes-2025-10-30.md - Complete audit trail

Testing Evidence

  • Dev environment: ldfdev14-rg (2025-10-30)
  • Azure AI: VNet restrictions active and working
  • Key Vault: IP allowlist working with Bastion access
  • Application Gateway: NSG automation tested
  • Application: http://ldfdev14-dev.eastus2.cloudapp.azure.com/ (operational)

Implementation Checklist

Foundation Layer

  • Add adminAllowedIpAddresses parameter to foundation.bicep
  • Update security.bicep with Key Vault networkAcls
  • Add parameter to foundation.bicepparam with default IP
  • Document IP update process

Substrate Layer

  • Change publicNetworkAccess from 'Disabled' to 'Enabled' in substrate.bicepparam
  • Verify ai-services.bicep has networkAcls configuration
  • Document VNet restriction behavior

Apps Layer

  • Add NSG automation to deploy-apps.sh
  • Include security warnings in deployment output
  • Make idempotent (check if rule exists)
  • Log all actions to deployment log

Documentation

  • Create ADR-058 (this document)
  • Update SECURITY.md with network security section
  • Create troubleshooting guide for common errors
  • Document production hardening steps

Testing

  • Test in dev environment (ldfdev14-rg)
  • Verify Azure AI accessible from containers only
  • Verify Key Vault accessible via Portal/Bastion
  • Verify Application Gateway UI accessible
  • Test in production environment (pending)

Production Deployment Guidance

Before Production Deployment

  1. Update IP Allowlist

    // foundation.bicepparam (production)
    param adminAllowedIpAddresses = [
      'YOUR_PRODUCTION_IP/32'
    ]
    

  2. Enable HTTPS

  3. Obtain SSL certificate
  4. Configure on Application Gateway
  5. Redirect HTTP → HTTPS

  6. Enable WAF

    // substrate.bicepparam (production)
    param appGatewaySku = 'WAF_v2'
    param enableWaf = true
    

  7. Review Security Checklist

  8. See SECURITY.md for complete checklist
  9. Verify all NSG rules
  10. Test failover scenarios
  11. Enable monitoring alerts

Production-Specific Considerations

  1. Consider Private Endpoints if:
  2. Strict compliance requirements (PCI-DSS, HIPAA)
  3. Zero Internet exposure mandated
  4. Budget allows for operational complexity

  5. Restrict Application Gateway if:

  6. Only specific IP ranges should access
  7. Internal application only
  8. Using Azure Front Door for public access

  9. Monitor and Alert on:

  10. Failed authentication attempts (Key Vault)
  11. Unusual traffic patterns (Azure AI)
  12. NSG rule changes (auditing)
  13. DDoS attempts (if public)

Review Schedule

This ADR should be reviewed: - ✅ Immediately: Before next production deployment - 📅 Q1 2026: After 3 months of production operation - 📅 Q3 2026: When Azure AI Foundry GA releases - 📅 Annually: As part of security review cycle

Next Review Date: 2026-01-30


Decision Made By: System Architecture Team
Approved By: Security Engineering
Implementation Date: 2025-10-30
Status: ✅ Implemented in Dev, Ready for Production