
Scaling Analytics at Amplitude - blader
https://amplitude.com/blog/2015/08/25/scaling-analytics-at-amplitude/
======
jallmann
What shortcomings of Redis set operations does the in-memory data store
address, and how?

Unrelated rant: regardless of its merits, "Lambda" Architecture is probably
the most annoying overloaded term in use today, second only to "Isomorphic"
Javascript. Just because something has a passing resemblance to the functional
style doesn't grant license to re-appropriate a well understood term of art.

~~~
paladin314159
Redis is a great piece of software, and we leverage it for several uses cases
outside of managing sets. For our use case, there were a couple of blockers
that prevented Redis from being a viable solution:

1\. It's tricky to scale out a Redis node when it gets too big. Because RDB
files are just a single dump of all data, it's not easy to make a specific
partitioning of the dataset. This was a very important requirement for us in
order to ease scaling (redis-cluster wasn't ready yet -- we've been following
that carefully).

2\. When you store hundreds of GB of persistent data in Redis, the startup
process can be very slow (restoring from RDB/AOF). Since it can't serve reads
or writes during this time, you're unavailable (setting up a slave worsens the
following problem).

3\. The per-key overhead in Redis
([http://stackoverflow.com/questions/10004565/redis-10x-more-m...](http://stackoverflow.com/questions/10004565/redis-10x-more-
memory-usage-than-data)). We have many billions of sets that are often only a
few elements in size -- think of slicing data by city or device type -- which
means that the resulting overhead can be larger than the dataset itself.

If you think about these problems upfront, they're not too difficult to solve
for a specific use case (partition data on disk, allow reads from disk on
startup), but Redis has to be generic and so can't leverage the optimizations
we made.

~~~
chupy
Hello Jeffrey, First I wanted to say that your post is very nicely written and
full of juicy details! :)

Regarding the sets database, I had to solve quite a similar problem at the
company where I work and instead of sets I actually chose to use the Redis
HypeLogLog structure instead of sets because for near real time results you
just need an approximate count of the sets / or their intersection and you
don't need to know the specific set members. I just wanted to let you know
that it works great for us for with doing intersections (PFMERGE) on sets
containing hundreds of millions of members. If anybody is interested I can do
a writeup about it.

Did you ever consider using that?

~~~
paladin314159
Thanks! We have considered using HLL, and it's a pretty cool algorithm.

For us, however, it's important to get the set members at the end of the day.
Amplitude is unique from other analytics products in that we put a lot of
emphasis on the actual users that correspond to a data point on a graph -- one
of our key features, Microscope, is the ability to view those users, see more
context around the events they are performing, and potentially create a
dynamic cohort out of them. As such, approximations that don't allow us to get
the set members don't quite satisfy our use case.

~~~
chupy
Sorry, I wanted to say that I am not actually not familiar with your product
and was not aware of the feature in which you can create audiences for running
ad campaigns. From the article I thought the sets were used mostly for real
time analytics and that is why I started to talk about HLL.

If you do need the actual set members in real time then of course you can't
use HLL :)

------
angryasian
Out of curiosity why weren't products like Druid
[http://druid.io/](http://druid.io/) or influxdb
[https://influxdb.com/](https://influxdb.com/) or possibly opentsdb taken into
consideration ?

~~~
paladin314159
To be totally honest, there are so many technologies out there that claim to
solve analytics that it's tough to seriously consider all of them.

That said, we have looked at Druid, which is also a good example of using
lambda architecture in practice
([http://druid.io/docs/0.8.0/design/design.html](http://druid.io/docs/0.8.0/design/design.html)
\-- note the historical vs realtime distinction). They use many of the same
design principles as us, and one of our sub-systems is very similar to it. We
still believe the pre-aggregation approach is critical for performance in our
use case, though. Lastly, when we started building the architecture
(mid-2014), Druid was very new, and I'm generally wary of designing everything
around a new and potentially unstable piece of software.

~~~
fangjin
Druid does pre-aggregation (roll-up) of data at ingestion time and is also
used at scale (30+ trillion events, ingesting over 1M+ events/s) by numerous
large technology companies: [http://druid.io/druid-
powered.html](http://druid.io/druid-powered.html)

~~~
dsp1234
Note that the commenter mentioned mid-2014. That page first appeared on (or
about) July 29th, 2014[1], and at that time only contained 4 names:

Metamarkets

Netflix

LiquidM

N3twork

So while today Druid may be in use by "numerous large technology companies",
at the time the commenter was researching it wasn't showcasing as many large
companies.

[1]
[https://web.archive.org/web/20140729014707/http://druid.io/d...](https://web.archive.org/web/20140729014707/http://druid.io/druid-
powered.html)

------
yresnob
"Finally, at query time, we bring together the real-time views from the set
database and the batch views from S3 to compute the result"

so how in the heck does this work? at query time you decide what file to get
our of s3 (hwo do u decide this?), parse it, filter it, and merge with the
results from the custom made Redis like real time database?

~~~
paladin314159
The files in S3 are pre-aggregated results keyed by how we fetch them (e.g.
there will be a file containing all of the users active on a particular day).
What you've described is a pretty accurate description of what happens :)

~~~
sskates
We'll be sharing more about our query architecture in the future as well as
other parts of the stack that we haven't included here. The query layer is an
impressive piece of architecture that handles fast access to multiple
distributed data stores.

------
paladin314159
Author of the post here. Happy to talk about how we've designed/built our
architecture at Amplitude!

~~~
tinco
How did you guys split the databases per customer? Is it all one big stream of
data for you or does it get split at a pretty early level? Is data of multiple
customers in every database or do you maintain a cluster per customer?

~~~
paladin314159
Most of our databases are multi-tenant, so a single cluster will handle all
customer data. The exception is Redshift, which has a separate cluster for
each customer since we allow them to have direct access to it
([https://amplitude.com/blog/2015/06/05/optimizing-redshift-
pe...](https://amplitude.com/blog/2015/06/05/optimizing-redshift-performance-
with-dynamic-schemas/)).

------
ecesena
> in-memory database holds only a limited set of data

MemSQL is not just in-memory, but also has column-store (note: I don't know
VoltDB). You can think of MemSQL not as "does everything in-memory", but "uses
memory at the best".

------
msaspence
How do you decide what sets of users you pre aggregate?

It seems like without some limits in place you could end up with huge number
of sets, especially if you are calculating these based on event properties.

~~~
paladin314159
That's a great observation. Somewhere along the spectrum of query flexibility
you reach a point where pre-aggregation doesn't work anymore. We have a
separate column-store based system in place for certain types of queries which
we'll almost certainly blog about in the future!

~~~
msaspence
How does that then affect your cost efficiency. What clever things are you
doing with this to keep your costs so low?

I guess we'll find out in a future post.

------
hopel
Do you store raw data ingested from Kafka directly in S3 or have an
intermediate database for hot data?

~~~
paladin314159
The raw data is stored directly into S3 every hour, but there are multiple
systems that process the data directly from Kafka in order to produce the
real-time pre-aggregated views.

------
msaspence
Its easy to see how you can calculate segments, retention and trends using
user sets, but how do you calculate funnels.

------
jchrisa
The lambda architecture and the split between the heavy slow processing and
the interactive processing reminds me of how a few of our customers are
blending Hadoop and Couchbase for similar use cases:
[http://www.couchbase.com/fr/ad_platforms](http://www.couchbase.com/fr/ad_platforms)

