
TrailDB – An Efficient Library for Storing and Processing Event Data - vtuulos
http://tech.adroll.com/blog/data/2016/05/24/traildb-open-sourced.html
======
spenczar5
It looks like this must be the successor to Deliroll, which was described in
an absolutely _fantastic_ talk from 2013[0]. I've been hoping this would be
open sourced for years. Really glad to see this!

[0]
[https://www.youtube.com/watch?v=rXj5nayS7Yg](https://www.youtube.com/watch?v=rXj5nayS7Yg)

------
Judson
I was just watching this[0] excellent talk from one of the Heap Analytics
engineers on their use of CitusDB + Postgres to do event processing.

What use-case would TrailDB be the obvious, hands-down way to go (vs maybe
Postgres + Citus)?

[0]:
[https://www.youtube.com/watch?v=NVl9_6J1G60](https://www.youtube.com/watch?v=NVl9_6J1G60)

------
jflatow
I'm so glad AdRoll finally open-sourced this. TrailDB is by far the easiest
way to process trillions of events on a single machine. Can't wait for them to
start publishing their internal ecosystem around it too.

~~~
vtuulos
Thanks! I hope we can open-source the rest of the stack now that the core
libraries are out.

Btw, I am happy to answer any questions about the project here.

~~~
buremba
It looks like TrailDB solves a particular problem (which is actually quite
common btw) the right way. However the language barrier makes it hard to use
it as DB. I believe that Python binding is nice for development environment
but in production we would probably want to use a low level language for
performance reasons because we'll be dealing with billions of events in a
single machine. Do you plan to implement a query language which compiles to
native code such as SQL?

How do you handle continuous data with TrailDB? It seems that you store raw
events Kinesis, buffer events and periodically write TrailDB files to S3. When
you want to process events for a specific user, normally the events might be
in a random TrailDB file so the timeline would be mixed up. Do you merge
TrailDB files in a single instance when processing the data or have a sharding
mechanism?

~~~
dialtone
We have an internal framework that compiles a state machine DSL into lower
level code to execute queries on TrailDBs. However the binding in Python is
fairly thin, and the same can be said for the other ones like the golang one.
So before optimizing with a lower level language or a special DSL I would
verify that Python or any other of the supported languages doesn't satisfy
your requirements.

And continuous data is handled by sharding TrailDBs across some fields in our
log lines, this way all the related events for a cookie in a given day belong
in the same shard, each day the shard mapping is the same and we can just
download the same shard ids from S3 and process the files sequentially using
our DSL language. With a bit of code you can make this whole process of
downloading from S3 and processing completely automated, this is in fact what
we do with our data pipeline[0][1].

[0]: [http://tech.adroll.com/blog/data/2015/09/22/data-
pipelines-d...](http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-
docker.html) [1]:
[http://tech.adroll.com/blog/data/2015/10/15/luigi.html](http://tech.adroll.com/blog/data/2015/10/15/luigi.html)

------
biokoda
An easily embeddable C lib. Fantastic choice. If only more projects in this
space did this. So much stuff in this area is java.

~~~
DLA
Could not agree more, biokoda. A C lib with all the bindings offered with
TrailDB is absolutely a winner. C, Python, and Go. A happy day for sure.

Thank you so much Ville and AdRoll for sharing this with the world! Your talk
on the Trillion Row Data Warehouse in Python was an inspiration for much work
I have done since. This library will carry that even further.

~~~
srean
Indeed. Adroll seems to be one of those few places not totally blinded by JVM.
Strong presence of D and Python on their stack seems hard to resist. Sadly,
they are not in India (yet?).

------
georgewfraser
Are SQL window functions really so bad? Setting up a completely separate
system for a particular kind of query seems like the wrong approach.

~~~
dialtone
The approach here is very different. It's not that window functions are bad,
but that the traditional SQL database, or even column oriented database,
doesn't have a disk format that is versed to this type of analysis for the
high volumes that AdRoll has. TrailDB can filter through 10s of millions of
events per second on a Macbook, just storing billions of events in a more
traditional DB or warehouse would use far more than the space that TrailDBs
use. On top of that it's relatively easy to build visualisations on top of
these databases with an interactive UI to do exploration. So the window
function being complicated was perhaps just the spark but there are real
advantages in speed, efficiency and expressiveness in this type of database.

------
temuze
How does this compare to PipelineDB? I'm pretty sure the cofounders of
Pipeline are ex-AdRollers, so there should be a lot of overlap since it was
made with the same problem in mind.

~~~
dialtone
It's not quite the same thing, PipelineDB is a database to execute streaming
queries at massive volumes. That is definitely one of the challenges at
AdRoll.

TrailDB is more about a different way of grouping and events, and granularly
querying and analysing each trail of data. TrailDBs are materialized and
stored in S3 typically.

------
angryasian
I'm assuming this is primarily used for analytics ? Metamarkets another ad
company open sourced Druid for a similar purpose. I'm curious as to the
differences in these solutions.

~~~
dialtone
AFAIK Druid is for time series and it's a columnar database. Their format has
dimensions and then pre-aggregated fields on those dimensions. Afterwards you
can run SQL queries on that data format to get full aggregations in return.

This is a library to read and write a data format that is optimized to give
access to granular events and actors within an event stream. For example this
could be used to trail all of the events generated by one entity (credit card,
cookie, email, account and so on) over a dataset. At that point you can choose
what you want to do with it: extract features for ML, train ML directly on raw
data, run arbitrary queries for outliers and anomaly detection and what have
you.

------
Chris2048
Might have to check these out. I'd be interested the match of storing data
like this.

Also, another event DB project:
[https://github.com/benbjohnson/skydb.io/blob/master/source/b...](https://github.com/benbjohnson/skydb.io/blob/master/source/blog/2012-04-30-introduction-
to-behavioral-databases.html.markdown)

~~~
Leftium
Interesting, but the open source SkyDB seems to be dead:

[https://twitter.com/benbjohnson/status/590626780173721601](https://twitter.com/benbjohnson/status/590626780173721601)

~~~
Chris2048
Actually, just reading:

> So I understand, by "pure go" you mean you've thrown out the lua, julia and
> ruby bits?

wow, that was quite the polyglot codebase...

------
gtrubetskoy
Very interesting. I wonder how this compares to writing events to a Parquet
([http://parquet.io/](http://parquet.io/)) or Avro (or both) file - this way
all manners of distributed tools like Hive or Spark could already read it.

~~~
vtuulos
TrailDB is optimized for a very specific data model
([http://traildb.io/docs/technical_overview/#data-
model](http://traildb.io/docs/technical_overview/#data-model)) which allows it
to do compression that you couldn't do with other data layouts.

TrailDB is a C library, Parquet and Avro are not. Depending on your use case
this might be a pro or a con.

------
fiatjaf
At which point in the curve of users-in-your-app it becomes better to use
aggregate queries like this instead of just looking at each user individually?

~~~
dialtone
This is a tool to look at each user individually actually, not aggregated.

~~~
fiatjaf
[http://traildb.io/docs/images/wikipedia-
sessions.png](http://traildb.io/docs/images/wikipedia-sessions.png)

~~~
dialtone
That's just what the example script does. In the database all the data is
stored granularly so you can, and we (AdRoll) do, use TrailDB to examine
individual trails one by one after you select the subset that you are
interested in.

------
LaurenceW1
This shows incredible promise for being used as the basis for an Event Store
type database system.

------
lobster_johnson
Looks very promising. Someone really should build a distributed server on top
of this.

------
ademarre
How does this compare to something like InfluxDB and other time-series
databases?

~~~
vtuulos
TrailDB is designed for discrete events (think JSON objects) rather than
continuous metrics (like CPU utilization).

Most time-series databases manage to store a large number of data points by
aggregating data over time. Discrete events can't be aggregated. Instead,
TrailDB leverages predictability of events to compress data over time.

~~~
ademarre
I've not yet used any of these, but I think InfluxDB stores discrete events.
In fact, your description sounds very similar to this comparison of Prometheus
and InfluxDB[0]:

> _InfluxDB is better geared towards the following use cases: Storing all
> individual events, not just time series of values. E.g. storing every HTTP
> request with full metadata vs. storing the cumulative count of HTTP requests
> for certain dimensions._

[0]
[https://prometheus.io/docs/introduction/comparison/#promethe...](https://prometheus.io/docs/introduction/comparison/#prometheus-
vs.-influxdb)

~~~
oavdeev
Yes, schema wise that's pretty close.

However, last time I looked that didn't seem like a use case InfluxDB invested
a ton of effort optimizing for, especially if you tried to store a lot of
separate "event histories" either as separate measurements or by using tags.
By "a lot" I mean 100K-100M histories. It may have improved since.

Besides, there is somewhat fundamental tradeoff between allowing efficient
realtime granular writes, which I believe a priority for InfluxDB, and
building efficient indexed store for millions of histories/event trails, which
is more of a TrailDB use case. Kind of OLAP vs OLTP.

------
ralusek
Any plans for NodeJS client?

~~~
fenguin
Played around with it today. Made some progress here:
[https://github.com/poynt/traildb-node](https://github.com/poynt/traildb-node)

Let me know if you have any use cases that are in the API but not covered yet
by this library :)

------
erichocean
How important is Judy in achieving the level of performance you reached?

~~~
vtuulos
Good question. Judy arrays tend to be really efficient for low-level mappings
like the 128-bit -> 64-bit map
([https://github.com/traildb/traildb/blob/master/src/judy_128_...](https://github.com/traildb/traildb/blob/master/src/judy_128_map.h))
that we use to keep track of UUIDs.

It would be interesting to benchmark Judy against another similar well-
optimized data structure.

------
mrits
if you are considering this I'd check out druid.io

------
dedalus
Nicely Done!

------
nametakenobv
traildb-r link to gihub is 404. Checked the account. No repo for R :(

~~~
vtuulos
It is coming soon, sorry for the delay!

~~~
vtuulos
You can find the R bindings now at
[https://github.com/traildb/traildb-r](https://github.com/traildb/traildb-r)

~~~
nametakenobv
Oh you really meant soon. Great, thanks!

------
chinathrow
What other use cases are suitable with this setup than obnoxious ad-tech?

~~~
DLA
Well let's see. Perhaps, processing social media data, analysis of stock
market data, looking at weather or other observational data, working with log
events, handling IoT data streams, ... and on, and on. If you look at the
world, an amazing amount of data use cases reduce to events by time by some
other dimension.

