ADR-058: Network Security - Public Endpoints with VNet Restrictions
Date: 2025-10-30
Status: Accepted
Deciders: System Architecture, Security Engineering
Technical Story: Network security hardening for Azure AI Services, Key Vault, and Application Gateway
Context and Problem Statement
After deploying the multi-agent loan processing system to Azure, we encountered several network access issues:
- Azure AI Services: Rate limit errors due to low capacity (50K TPM)
- Azure AI Services: 403 "Public access disabled" errors from ACI containers
- Key Vault: "Public network access disabled" blocking Bastion and Portal access
- Application Gateway: Users unable to access UI (NSG missing HTTP rule)
- MCP Servers: Connection timeouts due to missing
/mcpendpoint path
We needed to balance three competing requirements: - Security: Protect Azure resources from Internet attacks - Functionality: Allow legitimate traffic (containers, admin access, users) - Developer Experience: Minimize manual configuration steps
The key decision: Should we use private endpoints only, or public endpoints with VNet restrictions?
Decision Drivers
Technical Factors
- Private Endpoint Complexity: Requires Private DNS zones, A records, VNet links, DNS propagation time
- DNS Resolution Issues: Azure AI Foundry uses
services.ai.azure.com(no auto-configured Private DNS) - Development Velocity: Need fast iteration cycles for development/testing
- Operational Simplicity: Minimize moving parts and failure modes
Security Factors
- Defense in Depth: Multiple security layers (NSG, VNet ACLs, app auth)
- Attack Surface: Minimize Internet-exposed endpoints
- DDoS Protection: Avoid $3,000/month DDoS Protection Standard cost
- Compliance: Meet enterprise security requirements
Cost Factors
- Private Endpoint Costs: $0.01/GB processed + endpoint charges
- Internet Egress: $0.08-0.12/GB for public traffic
- DDoS Protection: $3,000/month if using public endpoints without restrictions
- Operational Time: Developer hours spent troubleshooting
User Experience Factors
- Deployment Time: End-to-end deployment should complete in <60 minutes
- Manual Steps: Minimize required manual configuration
- Documentation Burden: Clear, actionable troubleshooting guides
- Error Messages: Understandable, with clear remediation steps
Considered Options
Option 1: Private Endpoints Only (Most Secure)
Configuration:
- All Azure resources: publicNetworkAccess: Disabled
- Private endpoints on data-subnet (10.0.3.0/24)
- Private DNS zones for all Azure services
- Access via: Bastion → Jump Box → private IPs
Pros: - ✅ Zero Internet exposure - ✅ Lowest attack surface - ✅ Best compliance posture - ✅ Lower latency (VNet internal routing)
Cons:
- ❌ Complex DNS configuration (multiple Private DNS zones)
- ❌ services.ai.azure.com domain not auto-configured
- ❌ DNS propagation delays (5-15 minutes)
- ❌ Difficult to troubleshoot DNS issues
- ❌ Bastion required for all admin access
- ❌ Higher operational complexity
Testing Result: ❌ Failed - DNS resolved to public IP (20.119.156.137) instead of private IP (10.0.3.8)
Option 2: Public Endpoints with IP Allowlist
Configuration:
- All Azure resources: publicNetworkAccess: Enabled
- Network ACLs: defaultAction: Deny, specific IPs allowed
- Access via: Internet from whitelisted IPs
Pros: - ✅ Simple configuration - ✅ No DNS complexity - ✅ Easy to add/remove IPs
Cons: - ❌ IP addresses change (home, office, VPN) - ❌ Doesn't protect container-to-Azure traffic - ❌ Still vulnerable to attacks from allowed IPs - ❌ Doesn't scale (need to whitelist every developer) - ❌ No audit trail of who accessed what
Testing Result: ⚠️ Not tested (insufficient security for production)
Option 3: Public Endpoints with VNet Restrictions (Selected)
Configuration:
- Azure AI Services: publicNetworkAccess: Enabled, networkAcls.defaultAction: Deny, allow aci-subnet
- Key Vault: publicNetworkAccess: Enabled, networkAcls.defaultAction: Deny, bypass: AzureServices, IP allowlist
- Application Gateway: NSG with HTTP rule (automated via deploy-apps.sh)
Pros: - ✅ Simple configuration (no Private DNS needed) - ✅ Blocks Internet attacks (default deny) - ✅ Allows legitimate traffic (VNet rules) - ✅ Saves $3,000/month (no DDoS Protection needed) - ✅ Works immediately (no DNS propagation) - ✅ Easy to troubleshoot (standard Azure networking) - ✅ Azure Services bypass for admin access - ✅ Can add private endpoints later if needed
Cons: - ⚠️ Traffic uses public IP (but restricted by ACL) - ⚠️ Slightly higher attack surface than private endpoint only - ⚠️ Requires service endpoints on subnets
Testing Result: ✅ Success - All services accessible, Internet blocked, admin access works
Option 4: Hybrid Approach
Configuration: - Azure AI Services: Public with VNet restrictions (most accessed) - Key Vault: Private endpoint (least accessed, most sensitive) - Application Gateway: Public with NSG (user-facing, must be public)
Pros: - ✅ Best security for Key Vault - ✅ Best performance for Azure AI
Cons: - ❌ Most complex (mix of configurations) - ❌ Key Vault Private DNS still needed - ❌ Higher operational complexity - ❌ Harder to understand for new developers
Testing Result: ⚠️ Not tested (complexity outweighs benefits for reference implementation)
Decision Outcome
Chosen Option: Option 3 - Public Endpoints with VNet Restrictions
Rationale
- Proven to Work: Tested successfully in dev environment (ldfdev14-rg)
- Industry Standard: Microsoft Azure Quickstart templates use this approach
- Security Sufficient: Defense-in-depth with multiple layers
- Developer Experience: Works immediately, no complex DNS troubleshooting
- Cost Effective: Saves $3,000+/month vs public + DDoS Protection
- Maintainable: Simple to understand, document, and troubleshoot
- Flexible: Can add private endpoints later if security requirements change
Implementation Details
Azure AI Services Configuration
// infrastructure/bicep/modules/ai-services.bicep
resource aiServices 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
properties: {
publicNetworkAccess: 'Enabled'
networkAcls: {
defaultAction: 'Deny'
bypass: 'AzureServices'
virtualNetworkRules: [for subnetId in allowedSubnetIds: {
id: subnetId // aci-subnet
ignoreMissingVnetServiceEndpoint: false
}]
}
}
}
Parameters:
// infrastructure/bicep/environments/dev/substrate.bicepparam
param publicNetworkAccess = 'Enabled' // Changed from 'Disabled'
Security Impact: - ✅ Blocks all Internet traffic (defaultAction: Deny) - ✅ Allows only ACI containers (via service endpoint) - ✅ Azure Services bypass for management operations - ✅ Saves ~$3,000/month (no DDoS Protection Standard)
Key Vault Configuration
// infrastructure/bicep/modules/security.bicep
module keyVault 'br/public:avm/res/key-vault/vault:0.12.0' = {
params: {
publicNetworkAccess: 'Enabled'
networkAcls: {
bypass: 'AzureServices' // Allows Bastion, Portal, deployment scripts
defaultAction: 'Deny'
ipRules: [for ip in adminAllowedIpAddresses: {
value: ip // e.g., '24.164.141.93/32'
}]
}
}
}
Parameters:
// infrastructure/bicep/environments/dev/foundation.bicepparam
param adminAllowedIpAddresses = [
'24.164.141.93/32' // Admin IP - update per environment
]
Security Impact: - ✅ Blocks all Internet traffic by default - ✅ Azure Services bypass enables Bastion/Portal access - ✅ IP allowlist for admin access (configurable per environment) - ✅ Easy to add/remove admin IPs via parameter file
Application Gateway NSG Configuration
# infrastructure/scripts/deploy-apps.sh (automated)
az network nsg rule create \
--name AllowHttpInbound \
--priority 100 \
--source-address-prefixes Internet \
--destination-port-ranges 80 \
--access Allow \
--protocol Tcp
Automation: - ✅ Runs automatically after ACI deployment - ✅ Idempotent (checks if rule exists) - ✅ Includes security warnings - ✅ Logs all actions
Security Impact: - ✅ Enables HTTP access for users (required for web UI) - ⚠️ HTTP is publicly accessible (port 80) - ✅ Clear warnings about HTTPS/WAF for production - ✅ Easy to restrict to specific IPs if needed
Configuration Summary
| Resource | Public Access | Default Action | Allowed Sources | Security Level |
|---|---|---|---|---|
| Azure AI Services | Enabled | Deny | aci-subnet only | High |
| Key Vault | Enabled | Deny | Azure Services + Admin IPs | High |
| Application Gateway | N/A (public service) | N/A | Internet (port 80) | Medium (WAF available) |
| Storage Account | Disabled | N/A | Private endpoint only | Highest |
| ACI Containers | Private | N/A | VNet-integrated | High |
Consequences
Positive
- Security Hardening Achieved
- Azure AI: Internet attacks blocked, only VNet access
- Key Vault: IP-restricted, Azure Services bypass working
- Application Gateway: HTTP accessible with security warnings
-
Cost savings: ~$3,000/month (no DDoS Protection needed)
-
Developer Experience Improved
- No manual NSG configuration (automated in deploy-apps.sh)
- Clear security warnings during deployment
- Comprehensive troubleshooting documentation
-
Works immediately after deployment
-
Operational Simplicity
- No Private DNS zones to manage
- No DNS propagation delays
- Standard Azure networking patterns
-
Easy to troubleshoot with Azure Portal
-
Flexibility Maintained
- Can add private endpoints later if needed
- Can restrict IPs further if required
- Can enable WAF for production
- Easy to customize per environment
Negative
- Traffic Uses Public IPs
- ACI → Azure AI traffic uses public endpoint (but restricted by ACL)
- Slightly higher latency than private endpoint
-
Internet egress charges (though minimal)
-
HTTP Enabled by Default
- Application Gateway allows HTTP (port 80)
- Requires explicit HTTPS configuration for production
-
Users must add SSL certificate manually
-
IP Allowlist Management
- Key Vault requires updating adminAllowedIpAddresses when IPs change
- Each environment needs separate IP configuration
- No automatic IP detection
Mitigation Strategies
- For Public IP Concern:
- Monitor for suspicious access patterns (Application Insights)
- Consider adding private endpoints for production if needed
-
Document that traffic is still restricted by VNet ACLs
-
For HTTP Concern:
- Clear security warnings in deployment output
- Production hardening checklist includes HTTPS/WAF
- Document SSL certificate configuration process
-
Environment-based warnings (dev vs prod)
-
For IP Management:
- Document how to update adminAllowedIpAddresses
- Consider Azure Policy to enforce IP updates
- Provide script to automatically detect current IP
- Use Key Vault for IP list (if rotating frequently)
Related Decisions
- ADR-047: User-Assigned Managed Identity (simplifies RBAC)
- ADR-048: Key Vault for Deployment Outputs (requires Key Vault access)
- ADR-049: Azure Container Instances (VNet integration)
- ADR-050: Azure Bastion for VNet Access (admin access pattern)
- ADR-054: Application Gateway in Substrate Layer (public endpoint)
References
Microsoft Documentation
- Azure AI Services Network Security
- Application Gateway NSG Requirements
- Key Vault Firewall and Virtual Networks
Internal Documentation
SECURITY.md- Network Security Configuration sectiondocs/troubleshooting/common-deployment-errors.md- NSG, Key Vault, Azure AI troubleshootingtemp/app-gateway-nsg-requirements.md- Detailed NSG analysistemp/azure-ai-security-analysis.md- Private endpoint testing resultstemp/manual-changes-2025-10-30.md- Complete audit trail
Testing Evidence
- Dev environment: ldfdev14-rg (2025-10-30)
- Azure AI: VNet restrictions active and working
- Key Vault: IP allowlist working with Bastion access
- Application Gateway: NSG automation tested
- Application: http://ldfdev14-dev.eastus2.cloudapp.azure.com/ (operational)
Implementation Checklist
Foundation Layer
- Add adminAllowedIpAddresses parameter to foundation.bicep
- Update security.bicep with Key Vault networkAcls
- Add parameter to foundation.bicepparam with default IP
- Document IP update process
Substrate Layer
- Change publicNetworkAccess from 'Disabled' to 'Enabled' in substrate.bicepparam
- Verify ai-services.bicep has networkAcls configuration
- Document VNet restriction behavior
Apps Layer
- Add NSG automation to deploy-apps.sh
- Include security warnings in deployment output
- Make idempotent (check if rule exists)
- Log all actions to deployment log
Documentation
- Create ADR-058 (this document)
- Update SECURITY.md with network security section
- Create troubleshooting guide for common errors
- Document production hardening steps
Testing
- Test in dev environment (ldfdev14-rg)
- Verify Azure AI accessible from containers only
- Verify Key Vault accessible via Portal/Bastion
- Verify Application Gateway UI accessible
- Test in production environment (pending)
Production Deployment Guidance
Before Production Deployment
-
Update IP Allowlist
-
Enable HTTPS
- Obtain SSL certificate
- Configure on Application Gateway
-
Redirect HTTP → HTTPS
-
Enable WAF
-
Review Security Checklist
- See SECURITY.md for complete checklist
- Verify all NSG rules
- Test failover scenarios
- Enable monitoring alerts
Production-Specific Considerations
- Consider Private Endpoints if:
- Strict compliance requirements (PCI-DSS, HIPAA)
- Zero Internet exposure mandated
-
Budget allows for operational complexity
-
Restrict Application Gateway if:
- Only specific IP ranges should access
- Internal application only
-
Using Azure Front Door for public access
-
Monitor and Alert on:
- Failed authentication attempts (Key Vault)
- Unusual traffic patterns (Azure AI)
- NSG rule changes (auditing)
- DDoS attempts (if public)
Review Schedule
This ADR should be reviewed: - ✅ Immediately: Before next production deployment - 📅 Q1 2026: After 3 months of production operation - 📅 Q3 2026: When Azure AI Foundry GA releases - 📅 Annually: As part of security review cycle
Next Review Date: 2026-01-30
Decision Made By: System Architecture Team
Approved By: Security Engineering
Implementation Date: 2025-10-30
Status: ✅ Implemented in Dev, Ready for Production