Hacker News new | past | comments | ask | show | jobs | submit login
Having fun with Redis Replication between Amazon and Rackspace (3scale.github.com)
79 points by solso on July 31, 2012 | hide | past | favorite | 38 comments

With a 2.6 system this problems will be simpler to detect, because at this point we have explicit (configurable) limits for clients, of type normal, pubsub, slaves, so the connection is dropped if the amount of pending data goes over a given limit (and the event logged).

This added a bit of complexity to a few code paths that now require to estimate the size of objects, data, and so forth, but I think this will pay back in the long run.

This blog post makes me wonder if we should implement compression in the replication link as a feature in the future, or at least binary encoding. In short there is a plan for a binary version of AOF / replication-link that can save a lot of bandwidth.

woops! @antirez in person :-) amazing job you have done with Redis.

Initially we didn't add compression to save bandwidth but to solve the network bandwidth issues between amazon and rackspace. However, it turns out that you can save on the bill quite a lot, in our case about ~ $700.

Regarding the binary, would be very nice to have. The first though is that compression would work better for us than binary since our keys are quite long (50/100 bytes) and very regular on their patterns, so compression works very well in our case. Probably can be extrapolated for other cases where the keys are typically heavier than the values.

Wouldn't compression between redis and hiredis make sense as well?

How much load would compression put on a Redis instance that talks to n clients where each client sends/receives medium sized JSON docs. And how much would be gained in terms of faster responses from Redis.

Something like lzf or snappy maybe?

In my experience, snappy with very small messages (< 200 bytes or so) can increase bandwidth usage, so I'd say the way in which one uses redis can impact if compression is useful or not.

If you're running out of memory, one would think adding more swap would be the best way to offload the overhead. Hacks aside, if you run into a similar problem again, letting the VM take over will save you time to keep replicating until your load settles down.

Also, if you're using ssh and you need maximum performance you should probably be using an arcfour cipher (`ssh -c arcfour`, alternatives are arcfour128 and arcfour256). I often see a 10x increase in bandwidth using these potentially less-secure algorithms, and slightly lower latency. If you need to save even more bandwidth, a UDP-based SSL VPN may improve things (depending on distance; long links are of course notoriously horrible) or SCTP patched into SSH as a compromise (I haven't played with this yet).

If you're not CPU bound, lzma might gain you bigger savings with marginally higher CPU if you need it down the road.

(I'd also like to take this moment to point out to the Redis devs that 'optimization' of the protocol could have prevented this hack from being necessary until much later. And to the admins of the systems, that monitoring metrics of network bandwidth could have shown the immensity of the traffic for replication. But hindsight is 20/20; let's just remember these examples!)

Surely the fastest encryption scheme would be none at all? It's times like this that it would be nice if ssh still supported the equivalent of "-c none" (or was it "-e none" in v1 ssh?).

Of course, this is based on the assumption that the OP was happy to run their redis sync over the Internet without any other form of encryption. Either the underlying protocol provides encryption or they aren't concerned about their traffic being snooped or MITM'd (albeit very unlikely between two large providers) and so adding extra encryption on top is just an unnecessary CPU overhead.

I've run plenty of (production) stuff over ssh tunnels but it was always nervy with it and happy when the deployment was revamped to avoid the use of ssh tunnels completely.

I'd much rather each underlying application was able to provide tunable encryption and compression itself (using existing trusted libraries such as openssl) as part of its protocol. ssh tunnels are a kludge not a viable production solution (IMHO).

Yes, for a real production solution you'd want some kind of hardened VPN, though I recommended UDP or SCTP based transport for less overhead. But hacks are hacks, and since it's just replication, ssh won't break anything that isn't already broken when it fails.

I would never recommend anyone tunnel anything unencrypted over the internet, especially a database. Arcfour is incredibly fast, comparable to 'none' encryption. RFC4345 specifies the main known attack on arcfour is password auth, so using keys may keep it relatively safe (and using arcfour256 for the longest key).

It's best to assume developers will not implement encryption correctly no matter what library they link to. Always consider how highly vetted the application has been first, and how long it's been around without major issues.

HPN-SSH patches -c none back into the server and client, among other patches that help if you have a long and fat pipe. [0]

[0]: http://www.psc.edu/index.php/hpn-ssh

Adding more swap may not help depending in the workload. If you're using Resque then you need to process jobs faster than they are queued. In that case you also need to ensure jobs are queued on a new instance while the old overloaded one drains (very slowly).

It's hard to see running a critical service over an SSH tunnel as a long-term solution. Even with monit and autossh guarding the connection, SSH isn't exactly rock-solid when used this way.

If the redis protocol is so compressible, maybe this behavior should be integrated as a protocol option like [1]?

[1] http://dev.mysql.com/doc/refman/5.5/en/replication-options-s...

A lot of redundancy can be eliminated, and we have good past experience (and results) with this approach because we do it in the RDB format used for storage.

For the replication link it's easy to imagine a simple way to encode command names as operation IDs, the same for prefixed length stuff. So now you have:

But this may become like: <set-opcode><2><3>foo<3>bar where the opcode is 1 byte, and the rest of the prefixed lengths are variable depending on the argument size, but most of the times 1 or two bytes. This of course is much more compact.

However it's not bad to have the current format to be exactly like the AOF, and the client protocol itself, so everything is the same currently... and is future/backward compatible without issues. But well performances always have some price.

It's hard to see running a critical service over an SSH tunnel as a long-term solution.

Since I'm tempted to set this up myself (not for Redis though), what is the alternative?

vtund http://vtun.sourceforge.net/

I've only used it for tunnelling some traffic out of my home network to a VPS but it's been rock solid for me over several years of frequent use.

Has anyone used this for something like the OP's problem in production?

OpenVPN advertises LZO compression but it seems to be poorly documented (just an --comp-lzo argument in the manpage). Is OpenVPN considered "enterprise" enough for this sort of application, even disregarding the compression support?

OpenVPN actually works very well with compression: used it for 2 years to synchronize databases between 2 sites in India and one on the US and the compression cut the sync time nearly in half. It was pretty much fire-and-forget: I think I only looked at it twice in the 2 years after I set it up, both times to replace certs. I'd use that over an ssh tunnel, if only because you wouldn't need the autossh tricks.

that's a good point... as mentioned in the post we use ssh tunnels and have experienced quite a few glitches, but in this case it has been running without a hiccup for 3 weeks (proof of nothing of course).

I suspect (guess) that not having an idle connection help the tunnel to not randomly drop.

Can you talk more about what the glitches were like? We're a few months away from production with our first product that uses redis, so this is very interesting.

Have you simulated a network partition? How does redis handle the reconnect?

The glitches with the ssh tunnel were not for the redis replication setup. This has been running fine for 3 weeks already.

However we also use ssh tunnels for our continuous integration (jenkins) and the irc bots that we use for development (http://3scale.github.com/2012/06/29/irc-driven-development-p...).

In this setup we have the autossh going awol every one or two weeks. But as the blog post mentions, the ssh tunnel runs between our HQ (fiber) and Amazon, not very reliable. Furthermore, there is a lot of idle periods, which seems to trigger most of the issues. We cannot give more specifics since we forcefully just restart the daemon with a monit/munin combo.

Firewalls (which may be beyond your control and that you may not even be aware of) between you and the other end often drop connections that have been idle for a while (anything from 5 minutes to a couple of hours), especially if the firewall is busy tracking tens of thousands of active connections.

The outbound firewall of our company drops my outbound ssh connections if they are idle for just 5 minutes. Adding:-

  Host *
    ServerAliveInterval 60
in my ~/.ssh/config file prevented most of the drops (the rest are explained by proper network outages or my DSL connection at home flapping.)

A corresponding ClientAliveInterval setting in the sshd_config file also helps mask the problem [EDIT] if I ssh home from a machine at work that doesn't share my home directory ssh config file.

Depending on length of time the network is "out" there's a avariety of things that could occur. In general, the master would kick off a bgsave and the slave would re-bootstrap once it received said snapshot.

that's totally correct. In the case of a redis replication, the ssh tunnel failing would be quite costly, since the whole database will have to be send again since the slave would do a SYNC op. And that can be multiple GB in our case. But so far, we haven't experience it.

I don't know what sort of data they're storing in redis, but it seems like you would want any over-the-internet replication of any kind to be encrypted via ssh tunnel or some other means.

I'm trying to understand 3scale's product. On one of their landing pages they state, "Handling 10’s or 100’s of millions of API calls per day? – No problem. We scale 3scale’s management systems so you don’t have to – our infrastructure is highly distributed and scalable on demand for your needs."

Now, their architecture diagrams show that that developers directly connect to the API servers. So the origin servers must be scaled out to handle "10's or 100's of millions of API calls per day" anyway.

So 3scale provides some kind of external dashboard that you can point an event stream at and get analytics? That is, something like New Relic for APIs? Is that correct?

The infrastructure handles rate limits, access keys, analytics, dev portal, billing etc. for the API but does it out of band. So yes, the traffic does go to the API origin and then control agents there enforce the policies, do the tracking etc. (indeed a bit like New Relic).

It's used for some very big APIs but yes, the origin sees the traffic. The system can offload it though - there are agents for Varnish for example which you can sit in front of the app or integration with Akamai, in which case policies are enforced at CDN edge nodes.

If the API is handling high volumes you still have to plan for that - but the point here is that the management elements scale with it. You're not stuck with a black box in front of the API which might blow up at inconvenient moments.

That does sound quite useful; thank you for the explanation.


I've used Redis once over an ssh tunnel (over a domestic internet connection that's everything but fast, it's a long story...). It is slow (as in agonizingly slow), even for a couple of queries, probably not redis fault, still.

One thing you really want to do in this case (described in the article) is to carefully select the encryption protocol for one that uses the least CPU or has the best data rate.

See: http://blog.famzah.net/2010/06/11/openssh-ciphers-performanc...

Here we are not talking about the client on the other end, but just the replication link over ssh that's not an issue because replication in Redis is just an (async) stream so latency is not an issue.

Compressing the replication stream will likely buy you time (at the cost of added complexity resulting from having to maintain and monitor the ssh tunnel), but it won't solve your problem in the long term.

If your replication stream bandwidth continues to grow, the returns on compression will eventually diminish. As the cloud providers can't guarantee network throughput, it might be a good time to start planning to evacuate the cloud (or at least this part of the architecture).

If your business is growing that fast, this is a good problem to have. :)

Indeed it's a nice problem to have :-) we are not yet there, we double every 3 months so far, so we still have 9 months to go :-)

however, getting out of cloud does not help. The problem is the high-availability. To maintain 99.9% we cannot rely on a single data-center, no matter what the claims on availability zones say. We have even seen network partition which are the worst of the worst. Once you go on the "Internet" there is no guarantees (unless dedicated lines).

Of course you'll have to have presence in multiple DCs. I didn't mean to imply that one would suffice.

I'm not sure whether you've analyzed the bandwidth of the replication stream, but my experience is that if you've got a good circuit provider, you should have adequate bandwidth for 30Mbps+ streams, except under extremely unusual conditions, even over the open Internet.

The replication stream is about 20-25 Mbps for the current throughput of 25K-30K req/s, our peak to valley ratio is low.

It seems that we have found this unusual conditions you mention :-) Before compression our replication was often lagging (when below the required bandwidth), which is very dangerous, even if after a minute bandwidth goes well above the threshold.

The case of replication is kind of worst case, it's no good to have an average of 30 Mbps if it stays a substantial fraction of the time below the threshold. For every hours at 75% memory grew 2GB, but even one minute below the threshold is bad... if the replication queue builds up and there is a crash, consistency goes out of the window. So better to over-dimension capacity.

We even did a quick test with Hetzner instead of Rackspace, it was worse. The average over 4 hours was about 15 Mbps.

Isn't there a way to get guaranteed bandwidth / throughput / speed from both parties in this? Not that compression isn't great but this seems to upper-bound pretty much anything.

In the pure theoretical definition, packet switched networks (the internet) are not capable of giving you any guarantees on the throughput. Because of this, parties such as Amazon / Rackspace are unlikely to give you these guarantees either.

The closest thing you can get to guaranteed bandwidth today is to rent (dark) fiber between multiple DC's yourself.

You can use AWS Direct Connect (http://aws.amazon.com/directconnect/) to set up a dedicated 1Gbit or 10 Gbit network connection from an AWS region to the location of your choice.


True - although you can get pretty good approximations. Still though, the point is valid: it would be very expensive to guarantee. Performance profile could drop badly pretty much any time.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact