The Setup
A client running an OpenCart store and a WordPress site reported intermittent Cloudflare 520 errors. Initial reports mentioned issues with custom security software they were running (a proof-of-work challenge being injected via auto_prepend_file), which we helped them disable. The 520s continued.
The Pattern That Made No Sense
Over the following days, the client did remarkably thorough testing on their end and identified a pattern none of us could explain:
- 520 errors only occurred when traffic routed through Cloudflare’s LAX (Los Angeles) PoP
- Other Cloudflare PoPs worked perfectly
- DNS-only mode (no proxy) worked perfectly
- Direct origin access via hosts file worked perfectly
- A clone of the site on a different host, behind the same Cloudflare configuration, worked perfectly
By every measure available to them, the issue was specific to the combination of their site + Cloudflare LAX + our infrastructure.
Why We Were Skeptical (And Why That Was Reasonable)
Out of ~37,000 hosted domains, roughly 75% behind Cloudflare, this was the only such report. A search of ~154,000 historical tickets turned up zero comparable cases. We had no mechanism that would treat traffic from one Cloudflare PoP differently from another. All ~1.5 million Cloudflare IPs were whitelisted in our firewall, and a daily script verified that whitelist stayed accurate.
We tested ~3,000 sites across our network, specifically sites actively using Cloudflare proxying, through a Los Angeles proxy. Every single one returned 200 OK.
Server load was healthy: 80% CPU idle, sub-1% I/O. Nothing in our metrics suggested a problem.
Given all that, the statistical prior was overwhelmingly that the issue was something specific to the client’s configuration. We pushed for them to either pause Cloudflare or pull the edge-to-origin diagnostics from Cloudflare’s dashboard. Pausing was important to us not just for “see what the server returns” (we knew we could do that ourselves with a hosts file bypass, and we did), but because some of the diagnostic tools we wanted to deploy required Cloudflare to actually be out of the path temporarily.
A Bit of Background: Why We Block Most Cloudflare Traffic
Before going further, some relevant context. For many years now, we’ve maintained a script that explicitly blocks Cloudflare’s entire IP range at our firewall on every port except 80 and 443.
That sounds counterintuitive on the surface, but the reasoning is straightforward. Cloudflare is a proxy. Anyone can put any service behind it. When attackers route attacks through Cloudflare against a server, the source IP we see is a Cloudflare IP. When fail2ban, brute-force protection, or other automated tools react to those attacks by banning the source IP, they end up banning Cloudflare IPs that thousands of legitimate sites also use. The result: legitimate traffic to other customers gets caught in collateral damage.
By restricting Cloudflare’s IPs to only the ports they should ever legitimately use (80 and 443 for web traffic), we ensure that attempts to brute-force SSH, FTP, cPanel, mail authentication, etc. through Cloudflare simply can’t connect. No connection means no failed login means no automated ban means no collateral blocks. It’s a defensive posture that’s served us well for a long time.
The relevant detail for this story is that this script wrote tcp|in|d=PORT|s=CLOUDFLARE_RANGE deny rules into CSF for every admin/mail port, but it had no corresponding explicit allow rules for ports 80 and 443. Web traffic was simply “not denied” rather than “explicitly allowed.” That distinction matters, and it’s where the failure lived.
The Breakthrough: Reproducing It Ourselves
Eventually, after the client provided extremely specific reproduction steps including a URL and the LAX colo confirmation method, we got onto a Los Angeles proxy and reproduced it ourselves. A specific URL would 520 through LAX consistently, but load fine through ORD (Chicago). Even brand-new test files exhibited the failure.
Critically, we confirmed: the failed requests never reached our web server. There were no log entries on our side at all. The requests were dying somewhere between Cloudflare LAX and us, but we didn’t yet know where.
The Diagnostic Pivot: Development Mode
We turned on Cloudflare’s Development Mode, which bypasses cache and a handful of optimization features but leaves the proxy, WAF, and security rules fully active. The issue completely vanished. Hundreds of test loads, no 520s.
That isolated the problem to Cloudflare’s caching/optimization layer, or so we thought. We were leaning toward a Cloudflare-internal issue (maybe a poisoned cache entry at LAX, maybe a bad machine in the PoP).
The Real Breakthrough: A Single IP Address
The client opened a ticket with Cloudflare’s support. They responded with a description of the problem along with a curl trace showing exactly what was happening:
“I have been able to reliably reproduce this issue from LAX. All requests from LAX are not receiving a Server Hello during the TLS Handshake from your origin server”
* Trying (client-ip-address):443... * Name '172.70.206.156' family 2 resolved to '172.70.206.156' family 2 * Local port: 0 * ALPN: curl offers h2,http/1.1 } [5 bytes data] * TLSv1.2 (OUT), TLS handshake, Client hello (1):
The client passed this back to us, and the trace contained the piece of information we’d been missing the entire time: the specific Cloudflare IP that was failing to connect to our origin (172.70.206.156).
That single IP address told us exactly where to look. With it, we could query our firewall and see in seconds what was happening. Without it, every diagnostic question we asked was guess-and-check.
The catch is that this kind of detail isn’t readily exposed in Cloudflare’s user-facing dashboard, even though it would be enormously helpful for diagnosing origin-side issues. To get it, the client had to open a support ticket, and ticket access for technical issues is not available on Cloudflare’s free plan. The client ended up upgrading to Pro to get this single piece of information relayed back to us.
We’re not faulting Cloudflare for having paid tiers, they’re a business and support costs money. But for incidents like this one, where only the CDN has visibility into what’s actually failing at the edge, the absence of an easily-accessible “here’s the IP we couldn’t reach” indicator in the dashboard adds significant friction. If the client had been able to find that IP themselves on day one, this entire ticket would have been resolved in minutes instead of days.
The Actual Root Cause
Running csf -g 172.70.206.156 against the failing Cloudflare IP revealed three simultaneous truths:
IPSET: Set:i360.ipv4.remote_proxy_static Match:172.70.206.156
IPSET: Set:i360.ipv4.whitelist.static Match:172.70.206.156
IPSET: Set:bl_EMERGINGTHREATS Match:172.70.206.156 Setting:EMERGINGTHREATS
The Cloudflare IP was simultaneously:
- Recognized by Imunify360 as a Cloudflare proxy ✓
- Whitelisted in Imunify360 ✓
- Also matched by the Emerging Threats blocklist in CSF
Here’s the critical realization that took us longer than it should have to reach: the Imunify360 whitelist only applies to Imunify360.
Both whitelists and blocklists were technically loaded as ipsets in the same kernel firewall. But CSF and Imunify360 are separate pieces of software that maintain their own chains, their own rules, and their own logic. Imunify360’s whitelist tells Imunify360’s chain to bypass Imunify360’s blocklists. It says nothing to CSF. And CSF’s chains, including its blocklist matching, run before Imunify360’s chains in the packet processing path.
So the packet flow was:
Packet arrives from Cloudflare LAX (172.70.206.156)
↓
CSF chain: bl_EMERGINGTHREATS matches → DROP
↓
[never reaches Imunify360, never reaches the web server]
The Imunify360 whitelist was working perfectly. It just wasn’t being asked.
Why only LAX? Cloudflare’s LAX PoP was issuing requests from a specific IP range (172.64.0.0/13 includes the failing 172.70.206.156) that had ended up on the Emerging Threats compromised-IPs list. Other PoPs were issuing requests from IPs that hadn’t been listed. The blocklist wasn’t blocking all of Cloudflare, it was blocking specific Cloudflare IPs that ET had flagged as compromised, presumably because someone, somewhere, had run something malicious from behind that Cloudflare IP.
Why only this one customer? That’s still partially a mystery. When we tested through Cloudflare LAX from a Los Angeles proxy, our requests came from different Cloudflare IPs that weren’t on the blocklist. Something about the routing or session affinity for this particular zone was consistently sending traffic through a Cloudflare IP that was listed. Other zones at LAX got different IPs and worked fine.
Why Development Mode “Fixed” It
This is the part that surprised even us in retrospect. Development Mode disables caching, so cache MISSES on the LAX PoP would have to fetch fresh content from origin. Why would that prevent the failure?
Best theory: Cloudflare’s cache fill traffic and Cloudflare’s edge-serving traffic come from different IP pools. With cache enabled, certain requests were being routed through a specific Cloudflare cache-tier IP that was on the blocklist. With cache disabled, the requests went through Cloudflare’s standard edge IPs, which weren’t on the blocklist. The “fix” was actually accidentally avoiding the bad IP path, not addressing the real cause.
The Resolution
The fix turned out to be remarkably small once we understood the architecture. We updated the same script that’s been blocking Cloudflare on admin ports for years, and added explicit tcp|in|d=80|s=CLOUDFLARE_RANGE and tcp|in|d=443|s=CLOUDFLARE_RANGE entries to csf.allow for every Cloudflare CIDR (IPv4 and IPv6) plus the Railgun IP.
CSF processes csf.allow entries before csf.deny and before any blocklist match. With the explicit allow in place, Cloudflare web traffic hits the ACCEPT in CSF’s ALLOWIN chain immediately and short-circuits before EMERGINGTHREATS gets a chance to evaluate.
After the change, csf -g 172.70.206.156 looks like this:
filter DENYIN tcp ... 172.64.0.0/13 ... tcp dpt:2082 ← cPanel ports DROPPED ✓
filter DENYIN tcp ... 172.64.0.0/13 ... tcp dpt:2083 ← cPanel ports DROPPED ✓
... (other admin/mail ports) ...
filter DENYIN tcp ... 172.64.0.0/13 ... tcp dpt:587 ← mail submission DROPPED ✓
filter ALLOWIN tcp ... 172.64.0.0/13 ... tcp dpt:443 ← HTTPS ACCEPTED ✓
filter ALLOWIN tcp ... 172.64.0.0/13 ... tcp dpt:80 ← HTTP ACCEPTED ✓
IPSET: Set:i360.ipv4.remote_proxy_static Match:172.70.206.156
IPSET: Set:bl_EMERGINGTHREATS Match:172.70.206.156 Setting:EMERGINGTHREATS
The blocklist match is still there, the IP is still considered “bad” by Emerging Threats, but it cannot drop web traffic because the explicit allow on ports 80/443 wins on rule order. Admin ports remain blocked. The Emerging Threats list is back to protecting against everything else. The 520s stopped immediately for the client.
The full lesson, the way we summarized it back to the client and to ourselves: the Imunify360 whitelist only applies to Imunify360. If we want a Cloudflare IP to survive every layer of our defenses, we need to allow it explicitly at every layer that could drop it.
What We Got Wrong
Several things, in order of importance:
- A whitelist in one firewall doesn’t apply to another firewall. This sounds obvious in retrospect. It wasn’t obvious in the moment. We had Cloudflare comprehensively whitelisted in Imunify360 and verified it daily, and we trusted that whitelist as if it were a global truth. It was always only a local truth, governing Imunify360’s behavior. CSF, sitting in front, had no awareness of it. The two layers were independent and we’d internalized them as cooperative.
- Defensive logic should be explicit, not implicit. Our long-standing script blocked Cloudflare on admin ports but didn’t explicitly allow Cloudflare on web ports. It just left web ports unblocked. For years that was indistinguishable from “allowed,” because no other rule existed to drop those packets. The day a third-party blocklist update gave one of those rules a reason to fire, the implicit allow disappeared. Explicit allows survive the addition of new deny rules; implicit ones don’t.
- “Statistical priors say it’s the customer” isn’t a debugging strategy. It’s a reasonable initial hypothesis, but it cannot be the conclusion when the customer has done genuinely good controlled testing. The client’s clone-on-different-host test was strong evidence pointing at our infrastructure, and we underweighted it because the rest of the evidence pattern didn’t fit anything we’d seen before. They were right; we were taking too long to accept that.
- The single most useful piece of data was the hardest to get. Once we had the failing Cloudflare IP, the answer fell out of a single firewall command. Everything before that was working blind. If that information had been visible to the client in their Cloudflare dashboard from the beginning, the ticket would have closed in an hour.
What We’re Changing
- Updated our long-standing Cloudflare port-blocking script to also write explicit allow rules for ports 80 and 443 across all Cloudflare IPv4 ranges, IPv6 ranges, and the Railgun IP. Future blocklist updates can no longer drop legitimate Cloudflare web traffic, regardless of how the IP got listed.
- Updated our daily Cloudflare monitor to check not just whether the Imunify360 whitelist exists but whether Cloudflare IPs also appear in any CSF-loaded blocklist. This is computationally expensive (1.5M IPs × multiple blocklists) but worth it. Exactly this scenario is what we were silently vulnerable to.
- Internal note: when a customer presents controlled testing that points at a specific environmental variable (PoP, region, IP range), take “this is a false positive” off the table earlier. Reproduce the conditions ourselves before falling back on statistical priors.
Takeaways For Us Moving Forward
When the CDN is the only one with visibility, escalate to them earlier. A Cloudflare 520 means the failure is between Cloudflare and us, and the customer’s HAR file from their browser will never show that. Next time we see this pattern, we ask the customer to engage Cloudflare support on day one rather than after we’ve exhausted our own ideas, because the failing edge IP is the single piece of data that unblocks everything.
A whitelist only applies to the layer that owns it. Layered firewalls each have their own allowlists, blocklists, and evaluation order. Telling one layer “trust this IP” doesn’t tell any other layer the same thing. Trusted infrastructure needs to be allowed explicitly at every layer that could touch it.
Implicit allows are fragile. “Not denied yet” is a different thing from “explicitly allowed.” It looks the same in normal operation and behaves differently the moment a new deny source enters the picture. Make trust decisions explicit so they survive future rule additions.
Threat intel feeds are statistical instruments, not perfect ones. A CDN’s IP getting on a “compromised IPs” list isn’t impossible (anyone can be behind a CDN) and the cost of a false positive is invisible breakage for one customer somewhere on our network. We should periodically audit any blocklist we subscribe to for overlap with infrastructure we explicitly trust.
A customer reproducing an issue we can’t is signal, not noise. When a customer presents controlled testing that isolates an environmental variable (PoP, region, IP range, etc.), we need to take “this is a false positive on their end” off the table sooner. Reproduce the conditions ourselves before falling back on statistical priors about how rare the issue is across our fleet.