Hacker News new | comments | show | ask | jobs | submit login

It's worth noting that the instance migration basically null-routed the redis VM for a good 30 minutes, until we manually intervened and restarted it. The instance was completely disconnected from the internal network immediately following the migration. From what we could gather from instance logs, the routing table on the VM was completely dropped and it could not even connect to the magic metadata service (metadata.internal - we saw "no route to host" errors for that). This is a pretty serious bug within GCP and we've already opened a case with them hoping they can get a fix. I think this is the 4th or 5th major bug we've encountered with their live migration system that could have, or has led to an outage or internal service degradation. GCP team has seriously investigated and fixed every bug we've reported to them so far, so props to them for that! Live migration is incredibly difficult to get right.

We believe this triggered a bug in the redis-py python driver we use (specifically this one: https://github.com/andymccurdy/redis-py/pull/886) that made us have to rolling restart our API cluster in the first place, to get the connection pools back into a working state. redis-sentinel had appropriately detected the instance going away, and initiated a fail-over almost immediately following the instance going offline, but due to the odd network situation that was caused by the migration (absolute packet loss instead of connections being reset) - the client driver was unable to properly fail-over to the new master. We already have work planned for our own connection pooling logic for redis-py - as right now the state of the drive in HA redis is actually pretty awful, and the maintainer doesn't appear to have the time to close or look at PRs that address these issues (we opened one that fixes a pretty serious bug during fail-over in march https://github.com/andymccurdy/redis-py/pull/847 that has yet to be addressed).

For those of us unfamiliar with GCP, do you mean that the default-route of your VM was unable to route its traffic? Or is there a routing config running on customer VMs that GCP live-manages?

GCP has a virtual networking stack to support a bunch of crazy (and awesome) features Google has built. Unfortunately the complexity here seems to hurt power-users like us. In this case it appears that for some unknown reason the node failed to program its network stack when coming up, meaning it was completely unavailable (even the metadata service used internally by google failed).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact