The Scenario
CloudTech Inc. operates a distributed workforce with 3,000 remote employees across the United States. All remote access is mediated through two Palo Alto Networks GlobalProtect VPN concentrators (active-passive redundancy configuration) serving as the single point of access to internal resources.
On February 13, 2025, Palo Alto Networks released a critical security advisory:
CVE-2025-4847: Remote Code Execution in GlobalProtect
- CVSS Score: 9.8 (Critical)
- Vulnerability: An unauthenticated attacker on the internet can send specially crafted packets to the VPN portal, triggering a buffer overflow that allows arbitrary code execution with the privileges of the VPN service
- Impact: Complete compromise of the VPN concentrator, ability to intercept all remote access traffic, ability to create backdoor accounts
- Active Exploitation: The advisory noted that proof-of-concept exploits are circulating on security forums and the vulnerability is being actively exploited in the wild
- Patch Timeline: Available immediately
Chief Information Security Officer Jennifer Lee’s phone started ringing within the hour. Her peers at other companies were calling with the same concern: their entire remote workforce depends on patching this vulnerability, but patching means downtime.
The patch process required:
- Firmware upgrade on both VPN concentrators
- Server restart — the patch couldn’t be applied without rebooting the VPN service
- Estimated downtime: 45 minutes per concentrator to reboot and verify functionality
- With active-passive failover: When the primary fails over to the secondary, all remote workers can continue accessing resources (no net downtime). But then the secondary must be patched and rebooted (another 45 minutes), and the failover must be tested
The challenge: the secondary VPN concentrator wasn’t fully redundant.
An audit discovered that the active-passive configuration was documented, but in practice:
- The primary VPN concentrator carried 95% of the load (3,000 users), processed by 8 CPU cores running at 87% utilization
- The secondary VPN concentrator was configured to pass traffic but had never been fully tested with 3,000 simultaneous users
- Estimated capacity of secondary: 2,000-2,500 users maximum
If the primary went offline, the secondary would accept maybe 2,000-2,500 of the 3,000 remote workers. The remaining 500-1,000 workers would be locked out of corporate resources.
Jennifer convened an emergency call with the VP of Infrastructure, Chief Technology Officer, and General Counsel. The question was stark: What’s worse—four hours of VPN downtime while we patch both concentrators simultaneously, or running an unpatched critical vulnerability that’s actively being exploited?
The General Counsel’s input was immediate: “If we get compromised via an unpatched, known critical vulnerability, our liability is enormous. We knowingly allowed a weakness to exist. Patch immediately.”
Jennifer made the call: simultaneous patching of both VPN concentrators Saturday evening at 6 PM (a planned weekend), with four-hour downtime window from 6 PM to 10 PM.
The patching plan:
Phase 1: Advance Notice (Thursday)
- Email all 3,000 remote employees: VPN will be down 6-10 PM Saturday for critical security update
- Instructions: download necessary files, prepare laptops to work without VPN access during the window
Phase 2: Pre-Patch Testing (Friday)
- Test the patch on a lab VPN concentrator identical to production
- Verify that the firmware upgrade and reboot process works as expected
- Estimate actual downtime (might be faster than predicted)
Phase 3: Primary VPN Patch (6 PM Saturday)
- Begin upgrade on primary VPN concentrator
- Monitor that all traffic successfully fails over to secondary
- Remote workers in active sessions should experience brief connection drop, then automatic reconnection to secondary
- Expected duration: 45 minutes
Phase 4: Secondary VPN Patch (7 PM Saturday)
- Begin upgrade on secondary VPN concentrator (if primary is recovering well)
- ALL remote traffic is now on the primary
- Significantly fewer simultaneous connections if any workers took the outage as a sign to disconnect
Phase 5: Verification and Failback (8-10 PM Saturday)
- Verify both concentrators are healthy and patched
- Test failover and failback procedures
- Ensure all remote workers can reconnect
The Execution:
The patch execution on Saturday evening went mostly as planned, with one significant issue:
At 6:30 PM, the primary VPN concentrator began the firmware upgrade. The upgrade process was slower than expected—23 minutes instead of the projected 10 minutes. The concentrator rebooted at 6:43 PM.
Remote workers experienced a VPN disconnection. The GlobalProtect client (installed on all laptops) is configured to automatically reconnect, so most workers’ connections automatically failed over to the secondary concentrator.
But there was a surge. 2,847 simultaneous VPN connections hit the secondary concentrator within 60 seconds. The secondary’s CPU spiked to 99% utilization. Connections were being throttled. Some workers couldn’t reconnect. Help desk tickets started coming in.
Jennifer’s team discovered the problem: the automatic failover didn’t distribute load gracefully. All 3,000 workers tried to reconnect to the secondary simultaneously. The secondary simply couldn’t handle it.
By 7:00 PM, the primary had completed its reboot and was starting up the VPN service (another 8 minutes). By 7:08 PM, the primary was accepting connections again, and load naturally balanced back across both concentrators as workers reconnected.
At 7:15 PM, they began the secondary patch. This time, it was less disruptive—the primary was carrying full load (95% utilization but still stable), and only 200-300 workers who had been on the secondary experienced disconnection.
The secondary came back online at 8:03 PM. Both concentrators were now patched. By 8:30 PM, all systems were fully operational and stable.
The final downtime: 2 hours and 23 minutes of partial impact (some users couldn’t connect to the secondary), not the planned 4 hours, but longer than hoped.
Post-Incident Analysis
Jennifer’s team identified several design weaknesses:
Failure 1: No True Always-On VPN Resilience
- The secondary concentrator was supposed to be hot-standby but wasn’t truly redundant
- It could handle maybe 2,500 users but had never been tested at that load
- The surge of simultaneous reconnections overwhelmed it
Failure 2: No Load Balancing During Failover
- The VPN concentrator’s failover mechanism is binary: primary is up or primary is down
- There’s no gradual load shedding or connection prioritization
- All users tried to reconnect simultaneously, causing a thundering herd problem
Failure 3: Patch Timing Wasn’t Optimized
- Patching Saturday evening meant it was weekend downtime (good for impact), but also meant less staff was available to handle problems
- A weekday patch with advance notice and staff on-site might have been easier to manage
Failure 4: Encrypted Failover Connection State
- When the primary went down, connections in-flight were lost—workers had to manually reconnect
- Some VPN products support stateful connection failover (the secondary picks up in-flight TCP connections), but Palo Alto GlobalProtect doesn’t
Remediation
The team began planning upgrades:
1. True Secondary Redundancy
- Upgrade the secondary VPN concentrator to identical specifications as the primary
- Load-balance traffic across both using an external load balancer (DNS-based or appliance-based)
- Test that both concentrators can handle 3,000 simultaneous users
2. Gradual Failover
- Implement connection pooling and graceful degradation
- If a VPN concentrator reaches 90% capacity, start rejecting new connections (rather than allowing them to queue)
- Implement circuit breaker patterns for VPN client reconnection logic
3. Patch Strategy Revision
- For future critical patches, execute phased rollout: patch secondary first, test, then patch primary
- This requires true active-active load balancing, not active-passive failover
4. VPN Concentrator Monitoring
- Implement real-time monitoring of CPU, memory, connection count, and error rates
- Alert if secondary capacity utilization exceeds 80% (signal that it can’t handle primary’s load)
- Track connection churn during failover to measure “reconnection storms”
What Went Right
- Urgent patching decision was made quickly: Once the vulnerability and active exploitation were understood, there was no hesitation about applying the patch.
- Advance notice was given: Employees received 48 hours’ notice, allowing them to plan and avoid critical work during the window.
- Lab testing was performed: Pre-patching testing on a lab instance caught that the upgrade process was slower than expected.
- Both concentrators were eventually patched: Despite the partial impact, the critical vulnerability was remediated.
- No actual compromise occurred: Unlike many organizations, CloudTech patched before being exploited.
What Could Go Wrong
- If the patch had been delayed: Active exploitation meant attackers could have compromised the VPN concentrator within days, gaining access to all remote worker traffic and internal resources.
- If both concentrators had gone offline simultaneously: If the secondary patch had been attempted while the primary was still rebooting, there would have been zero VPN access for everyone.
- If the secondary truly couldn’t handle the failover load: 1,000 workers completely locked out would have escalated to a company-wide incident.
- If no patching plan existed: Waiting for the “perfect time to patch” while a critical vulnerability was active would have been irresponsible.
Key Takeaways
- Always-On VPN requires true redundancy testing: An active-passive system documented on paper is not the same as an active-passive system tested under full load. Load-test the secondary regularly.
- Critical vulnerability patches can’t wait: Active exploitation changes the calculus entirely. Downtime to patch is better than downtime from compromise.
- Patch windows should account for failover surge: Plan for all users attempting to reconnect simultaneously. Load-test the failover path, not just the primary.
- VPN concentrators need capacity headroom: Running at 87% CPU utilization on the primary concentrator means there’s no room for surge traffic or failover load. Target 60-70% for headroom.
- Phased patching is better than simultaneous: If possible, patch the secondary first, test failover, then patch the primary. This requires true load balancing, but it’s worth it.
- Encrypted connection state can’t be seamlessly migrated: GlobalProtect doesn’t support stateful failover. Plan for connection drops and automatic reconnection as a standard behavior.
- Weekend patching for visibility, but with support: Patching at 6 PM on a Saturday meant fewer employees on the system, but also fewer IT staff available. A weekday morning patch might have been better with full staffing.
Related Cases
- case-encryption — Encryption in transit for VPN connections and their protection mechanisms
- case-network-security-architecture — VPN’s role in the broader remote access architecture
- case-zero-trust — Modern alternatives to VPN-based remote access using zero-trust models