ADR-058: Network Security - Public Endpoints with VNet Restrictions

Date: 2025-10-30
Status: Accepted
Deciders: System Architecture, Security Engineering
Technical Story: Network security hardening for Azure AI Services, Key Vault, and Application Gateway

Context and Problem Statement

After deploying the multi-agent loan processing system to Azure, we encountered several network access issues:

Azure AI Services: Rate limit errors due to low capacity (50K TPM)
Azure AI Services: 403 "Public access disabled" errors from ACI containers
Key Vault: "Public network access disabled" blocking Bastion and Portal access
Application Gateway: Users unable to access UI (NSG missing HTTP rule)
MCP Servers: Connection timeouts due to missing /mcp endpoint path

We needed to balance three competing requirements: - Security: Protect Azure resources from Internet attacks - Functionality: Allow legitimate traffic (containers, admin access, users) - Developer Experience: Minimize manual configuration steps

The key decision: Should we use private endpoints only, or public endpoints with VNet restrictions?

Decision Drivers

Technical Factors

Private Endpoint Complexity: Requires Private DNS zones, A records, VNet links, DNS propagation time
DNS Resolution Issues: Azure AI Foundry uses services.ai.azure.com (no auto-configured Private DNS)
Development Velocity: Need fast iteration cycles for development/testing
Operational Simplicity: Minimize moving parts and failure modes

Security Factors

Defense in Depth: Multiple security layers (NSG, VNet ACLs, app auth)
Attack Surface: Minimize Internet-exposed endpoints
DDoS Protection: Avoid $3,000/month DDoS Protection Standard cost
Compliance: Meet enterprise security requirements

Cost Factors

Private Endpoint Costs: $0.01/GB processed + endpoint charges
Internet Egress: $0.08-0.12/GB for public traffic
DDoS Protection: $3,000/month if using public endpoints without restrictions
Operational Time: Developer hours spent troubleshooting

User Experience Factors

Deployment Time: End-to-end deployment should complete in <60 minutes
Manual Steps: Minimize required manual configuration
Documentation Burden: Clear, actionable troubleshooting guides
Error Messages: Understandable, with clear remediation steps

Considered Options

Option 1: Private Endpoints Only (Most Secure)

Configuration: - All Azure resources: publicNetworkAccess: Disabled - Private endpoints on data-subnet (10.0.3.0/24) - Private DNS zones for all Azure services - Access via: Bastion → Jump Box → private IPs

Pros: - ✅ Zero Internet exposure - ✅ Lowest attack surface - ✅ Best compliance posture - ✅ Lower latency (VNet internal routing)

Cons: - ❌ Complex DNS configuration (multiple Private DNS zones) - ❌ services.ai.azure.com domain not auto-configured - ❌ DNS propagation delays (5-15 minutes) - ❌ Difficult to troubleshoot DNS issues - ❌ Bastion required for all admin access - ❌ Higher operational complexity

Testing Result: ❌ Failed - DNS resolved to public IP (20.119.156.137) instead of private IP (10.0.3.8)

Option 2: Public Endpoints with IP Allowlist

Configuration: - All Azure resources: publicNetworkAccess: Enabled - Network ACLs: defaultAction: Deny, specific IPs allowed - Access via: Internet from whitelisted IPs

Pros: - ✅ Simple configuration - ✅ No DNS complexity - ✅ Easy to add/remove IPs

Cons: - ❌ IP addresses change (home, office, VPN) - ❌ Doesn't protect container-to-Azure traffic - ❌ Still vulnerable to attacks from allowed IPs - ❌ Doesn't scale (need to whitelist every developer) - ❌ No audit trail of who accessed what

Testing Result: ⚠️ Not tested (insufficient security for production)

Option 3: Public Endpoints with VNet Restrictions (Selected)

Configuration: - Azure AI Services: publicNetworkAccess: Enabled, networkAcls.defaultAction: Deny, allow aci-subnet - Key Vault: publicNetworkAccess: Enabled, networkAcls.defaultAction: Deny, bypass: AzureServices, IP allowlist - Application Gateway: NSG with HTTP rule (automated via deploy-apps.sh)

Pros: - ✅ Simple configuration (no Private DNS needed) - ✅ Blocks Internet attacks (default deny) - ✅ Allows legitimate traffic (VNet rules) - ✅ Saves $3,000/month (no DDoS Protection needed) - ✅ Works immediately (no DNS propagation) - ✅ Easy to troubleshoot (standard Azure networking) - ✅ Azure Services bypass for admin access - ✅ Can add private endpoints later if needed

Cons: - ⚠️ Traffic uses public IP (but restricted by ACL) - ⚠️ Slightly higher attack surface than private endpoint only - ⚠️ Requires service endpoints on subnets

Testing Result: ✅ Success - All services accessible, Internet blocked, admin access works

Option 4: Hybrid Approach

Configuration: - Azure AI Services: Public with VNet restrictions (most accessed) - Key Vault: Private endpoint (least accessed, most sensitive) - Application Gateway: Public with NSG (user-facing, must be public)

Pros: - ✅ Best security for Key Vault - ✅ Best performance for Azure AI

Cons: - ❌ Most complex (mix of configurations) - ❌ Key Vault Private DNS still needed - ❌ Higher operational complexity - ❌ Harder to understand for new developers

Testing Result: ⚠️ Not tested (complexity outweighs benefits for reference implementation)

Decision Outcome

Chosen Option: Option 3 - Public Endpoints with VNet Restrictions

Rationale

Proven to Work: Tested successfully in dev environment (ldfdev14-rg)
Industry Standard: Microsoft Azure Quickstart templates use this approach
Security Sufficient: Defense-in-depth with multiple layers
Developer Experience: Works immediately, no complex DNS troubleshooting
Cost Effective: Saves $3,000+/month vs public + DDoS Protection
Maintainable: Simple to understand, document, and troubleshoot
Flexible: Can add private endpoints later if security requirements change

Implementation Details

Azure AI Services Configuration

// infrastructure/bicep/modules/ai-services.bicep
resource aiServices 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  properties: {
    publicNetworkAccess: 'Enabled'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
      virtualNetworkRules: [for subnetId in allowedSubnetIds: {
        id: subnetId  // aci-subnet
        ignoreMissingVnetServiceEndpoint: false
      }]
    }
  }
}

Parameters:

// infrastructure/bicep/environments/dev/substrate.bicepparam
param publicNetworkAccess = 'Enabled'  // Changed from 'Disabled'

Security Impact: - ✅ Blocks all Internet traffic (defaultAction: Deny) - ✅ Allows only ACI containers (via service endpoint) - ✅ Azure Services bypass for management operations - ✅ Saves ~$3,000/month (no DDoS Protection Standard)

Key Vault Configuration

// infrastructure/bicep/modules/security.bicep
module keyVault 'br/public:avm/res/key-vault/vault:0.12.0' = {
  params: {
    publicNetworkAccess: 'Enabled'
    networkAcls: {
      bypass: 'AzureServices'  // Allows Bastion, Portal, deployment scripts
      defaultAction: 'Deny'
      ipRules: [for ip in adminAllowedIpAddresses: {
        value: ip  // e.g., '24.164.141.93/32'
      }]
    }
  }
}

Parameters:

// infrastructure/bicep/environments/dev/foundation.bicepparam
param adminAllowedIpAddresses = [
  '24.164.141.93/32'  // Admin IP - update per environment
]

Security Impact: - ✅ Blocks all Internet traffic by default - ✅ Azure Services bypass enables Bastion/Portal access - ✅ IP allowlist for admin access (configurable per environment) - ✅ Easy to add/remove admin IPs via parameter file

Application Gateway NSG Configuration

# infrastructure/scripts/deploy-apps.sh (automated)
az network nsg rule create \
  --name AllowHttpInbound \
  --priority 100 \
  --source-address-prefixes Internet \
  --destination-port-ranges 80 \
  --access Allow \
  --protocol Tcp

Automation: - ✅ Runs automatically after ACI deployment - ✅ Idempotent (checks if rule exists) - ✅ Includes security warnings - ✅ Logs all actions

Security Impact: - ✅ Enables HTTP access for users (required for web UI) - ⚠️ HTTP is publicly accessible (port 80) - ✅ Clear warnings about HTTPS/WAF for production - ✅ Easy to restrict to specific IPs if needed

Configuration Summary

Resource	Public Access	Default Action	Allowed Sources	Security Level
Azure AI Services	Enabled	Deny	aci-subnet only	High
Key Vault	Enabled	Deny	Azure Services + Admin IPs	High
Application Gateway	N/A (public service)	N/A	Internet (port 80)	Medium (WAF available)
Storage Account	Disabled	N/A	Private endpoint only	Highest
ACI Containers	Private	N/A	VNet-integrated	High

Consequences

Positive

Security Hardening Achieved
Azure AI: Internet attacks blocked, only VNet access
Key Vault: IP-restricted, Azure Services bypass working
Application Gateway: HTTP accessible with security warnings
Cost savings: ~$3,000/month (no DDoS Protection needed)
Developer Experience Improved
No manual NSG configuration (automated in deploy-apps.sh)
Clear security warnings during deployment
Comprehensive troubleshooting documentation
Works immediately after deployment
Operational Simplicity
No Private DNS zones to manage
No DNS propagation delays
Standard Azure networking patterns
Easy to troubleshoot with Azure Portal
Flexibility Maintained
Can add private endpoints later if needed
Can restrict IPs further if required
Can enable WAF for production
Easy to customize per environment

Negative

Traffic Uses Public IPs
ACI → Azure AI traffic uses public endpoint (but restricted by ACL)
Slightly higher latency than private endpoint
Internet egress charges (though minimal)
HTTP Enabled by Default
Application Gateway allows HTTP (port 80)
Requires explicit HTTPS configuration for production
Users must add SSL certificate manually
IP Allowlist Management
Key Vault requires updating adminAllowedIpAddresses when IPs change
Each environment needs separate IP configuration
No automatic IP detection

Mitigation Strategies

For Public IP Concern:
Monitor for suspicious access patterns (Application Insights)
Consider adding private endpoints for production if needed
Document that traffic is still restricted by VNet ACLs
For HTTP Concern:
Clear security warnings in deployment output
Production hardening checklist includes HTTPS/WAF
Document SSL certificate configuration process
Environment-based warnings (dev vs prod)
For IP Management:
Document how to update adminAllowedIpAddresses
Consider Azure Policy to enforce IP updates
Provide script to automatically detect current IP
Use Key Vault for IP list (if rotating frequently)

ADR-047: User-Assigned Managed Identity (simplifies RBAC)
ADR-048: Key Vault for Deployment Outputs (requires Key Vault access)
ADR-049: Azure Container Instances (VNet integration)
ADR-050: Azure Bastion for VNet Access (admin access pattern)
ADR-054: Application Gateway in Substrate Layer (public endpoint)

References

Microsoft Documentation

Internal Documentation

SECURITY.md - Network Security Configuration section
docs/troubleshooting/common-deployment-errors.md - NSG, Key Vault, Azure AI troubleshooting
temp/app-gateway-nsg-requirements.md - Detailed NSG analysis
temp/azure-ai-security-analysis.md - Private endpoint testing results
temp/manual-changes-2025-10-30.md - Complete audit trail

Testing Evidence

Dev environment: ldfdev14-rg (2025-10-30)
Azure AI: VNet restrictions active and working
Key Vault: IP allowlist working with Bastion access
Application Gateway: NSG automation tested
Application: http://ldfdev14-dev.eastus2.cloudapp.azure.com/ (operational)

Implementation Checklist

Foundation Layer

Add adminAllowedIpAddresses parameter to foundation.bicep
Update security.bicep with Key Vault networkAcls
Add parameter to foundation.bicepparam with default IP
Document IP update process

Substrate Layer

Change publicNetworkAccess from 'Disabled' to 'Enabled' in substrate.bicepparam
Verify ai-services.bicep has networkAcls configuration
Document VNet restriction behavior

Apps Layer

Add NSG automation to deploy-apps.sh
Include security warnings in deployment output
Make idempotent (check if rule exists)
Log all actions to deployment log

Documentation

Create ADR-058 (this document)
Update SECURITY.md with network security section
Create troubleshooting guide for common errors
Document production hardening steps

Testing

Test in dev environment (ldfdev14-rg)
Verify Azure AI accessible from containers only
Verify Key Vault accessible via Portal/Bastion
Verify Application Gateway UI accessible
Test in production environment (pending)

Production Deployment Guidance

Before Production Deployment

Update IP Allowlist

// foundation.bicepparam (production)
param adminAllowedIpAddresses = [
  'YOUR_PRODUCTION_IP/32'
]

Enable HTTPS
Obtain SSL certificate
Configure on Application Gateway
Redirect HTTP → HTTPS

Enable WAF

// substrate.bicepparam (production)
param appGatewaySku = 'WAF_v2'
param enableWaf = true

Review Security Checklist
See SECURITY.md for complete checklist
Verify all NSG rules
Test failover scenarios
Enable monitoring alerts

Production-Specific Considerations

Consider Private Endpoints if:
Strict compliance requirements (PCI-DSS, HIPAA)
Zero Internet exposure mandated
Budget allows for operational complexity
Restrict Application Gateway if:
Only specific IP ranges should access
Internal application only
Using Azure Front Door for public access
Monitor and Alert on:
Failed authentication attempts (Key Vault)
Unusual traffic patterns (Azure AI)
NSG rule changes (auditing)
DDoS attempts (if public)

Review Schedule

This ADR should be reviewed: - ✅ Immediately: Before next production deployment - 📅 Q1 2026: After 3 months of production operation - 📅 Q3 2026: When Azure AI Foundry GA releases - 📅 Annually: As part of security review cycle

Next Review Date: 2026-01-30

Decision Made By: System Architecture Team
Approved By: Security Engineering
Implementation Date: 2025-10-30
Status: ✅ Implemented in Dev, Ready for Production