Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I use both services heavily at work. The networking in GCP is terrible. We experience minor service degradation multiple times a month due to networking issues in GCP (elevated latency, errors talking to the DB, etc). We've even had cases where there was packet corruption at the bare metal layer, so we ended up storing a bunch of garbage data in our caches / databases. Also, the networking is less understandable on GCP compared to AWS. For instance, the external HTTP load balancer uses BGP and magic, so you aren't control of which zones your LB is deployed to. Some zones don't have any LBs deployed, so there is a constant cross-zone latency hit when using some zones. It took us months to discover this after consistent denials from Google Cloud support that something was wrong with a specific zone our service was running in.

AWS, on the other hand, has given us very few problems. When we do have an issue with an AWS service, we're able to quickly get an engineer on the phone who, thus far, has been able to explain exactly what our issue is and how to fix it.




> We've even had cases where there was packet corruption at the bare metal layer,

I'd love to know how this happens in the modern world. I've seen it myself only once (not GCP, but our own network with cisco equipment.)

Is something in the chain not checking the packet's CRC?


Had something similar last year because of a core router fabric issue. A few years ago, there was a batch of new servers with buggy motherboards corrupting/dropping packets, can't begin to imagine how hard it was to diagnose.

That's in own datacenters, not cloud.


> can't begin to imagine how hard it was to diagnose.

Yeah, when it happened to me, it completely threw me for a loop. We had reports of corruption in video files, which started the debug cycle. It was shocking when we isolated the box causing the issue.

But I guess your bigger comment has to be right: About the only way to have this sort of error is at the hardware level, because basic CRC checking should otherwise raise some sort of alarm.


Keep in mind that hardware run with a firmware. What is called a hardware issue can actually be a software issue.

It wasn't just one box for us. Basically, the part number was defective (motherboard NIC), every single one that was manufactured. This affected a variety of things, since servers are bought in batch and shipped to multiple datacenters, damn impossible to root cause.

CRC can be computed by the OS (kernel driver) or offloaded to the NIC. I think it's unlikely for buggy CRC code to shipped to a finished product, it would be noticed that nothing works.


Just curious, is this on a specific region(s)?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: