1. It's tricky to scale out a Redis node when it gets too big. Because RDB files are just a single dump of all data, it's not easy to make a specific partitioning of the dataset. This was a very important requirement for us in order to ease scaling (redis-cluster wasn't ready yet -- we've been following that carefully).
2. When you store hundreds of GB of persistent data in Redis, the startup process can be very slow (restoring from RDB/AOF). Since it can't serve reads or writes during this time, you're unavailable (setting up a slave worsens the following problem).
3. The per-key overhead in Redis (http://stackoverflow.com/questions/10004565/redis-10x-more-m...). We have many billions of sets that are often only a few elements in size -- think of slicing data by city or device type -- which means that the resulting overhead can be larger than the dataset itself.
If you think about these problems upfront, they're not too difficult to solve for a specific use case (partition data on disk, allow reads from disk on startup), but Redis has to be generic and so can't leverage the optimizations we made.
Regarding the sets database, I had to solve quite a similar problem at the company where I work and instead of sets I actually chose to use the Redis HypeLogLog structure instead of sets because for near real time results you just need an approximate count of the sets / or their intersection and you don't need to know the specific set members. I just wanted to let you know that it works great for us for with doing intersections (PFMERGE) on sets containing hundreds of millions of members. If anybody is interested I can do a writeup about it.
Did you ever consider using that?
For us, however, it's important to get the set members at the end of the day. Amplitude is unique from other analytics products in that we put a lot of emphasis on the actual users that correspond to a data point on a graph -- one of our key features, Microscope, is the ability to view those users, see more context around the events they are performing, and potentially create a dynamic cohort out of them. As such, approximations that don't allow us to get the set members don't quite satisfy our use case.
If you do need the actual set members in real time then of course you can't use HLL :)
That said, we have looked at Druid, which is also a good example of using lambda architecture in practice (http://druid.io/docs/0.8.0/design/design.html -- note the historical vs realtime distinction). They use many of the same design principles as us, and one of our sub-systems is very similar to it. We still believe the pre-aggregation approach is critical for performance in our use case, though. Lastly, when we started building the architecture (mid-2014), Druid was very new, and I'm generally wary of designing everything around a new and potentially unstable piece of software.
So while today Druid may be in use by "numerous large technology companies", at the time the commenter was researching it wasn't showcasing as many large companies.
so how in the heck does this work? at query time you decide what file to get our of s3 (hwo do u decide this?), parse it, filter it, and merge with the results from the custom made Redis like real time database?
For the real time layer I see it as not being mission critical for most data sets to be 100% correct, but for the ETL part of the process, the guarantees provided by Camus (ensured by the OutputCommitters part of MR I believe) are invaluable.
MemSQL is not just in-memory, but also has column-store (note: I don't know VoltDB). You can think of MemSQL not as "does everything in-memory", but "uses memory at the best".
It seems like without some limits in place you could end up with huge number of sets, especially if you are calculating these based on event properties.
I guess we'll find out in a future post.