I use both services heavily at work. The networking in GCP is terrible. We exper...

jpmattia · on June 2, 2019

> We've even had cases where there was packet corruption at the bare metal layer,

I'd love to know how this happens in the modern world. I've seen it myself only once (not GCP, but our own network with cisco equipment.)

Is something in the chain not checking the packet's CRC?

user5994461 · on June 2, 2019

Had something similar last year because of a core router fabric issue. A few years ago, there was a batch of new servers with buggy motherboards corrupting/dropping packets, can't begin to imagine how hard it was to diagnose.

That's in own datacenters, not cloud.

jpmattia · on June 2, 2019

> can't begin to imagine how hard it was to diagnose.

Yeah, when it happened to me, it completely threw me for a loop. We had reports of corruption in video files, which started the debug cycle. It was shocking when we isolated the box causing the issue.

But I guess your bigger comment has to be right: About the only way to have this sort of error is at the hardware level, because basic CRC checking should otherwise raise some sort of alarm.

user5994461 · on June 2, 2019

Keep in mind that hardware run with a firmware. What is called a hardware issue can actually be a software issue.

It wasn't just one box for us. Basically, the part number was defective (motherboard NIC), every single one that was manufactured. This affected a variety of things, since servers are bought in batch and shipped to multiple datacenters, damn impossible to root cause.

CRC can be computed by the OS (kernel driver) or offloaded to the NIC. I think it's unlikely for buggy CRC code to shipped to a finished product, it would be noticed that nothing works.

captn3m0 · on June 2, 2019

Just curious, is this on a specific region(s)?