
The Log: Real-time data's unifying abstraction - boredandroid
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
======
justinsb
The idea of the log goes well beyond just real-time data (as the blog post
describes, although the title does not). I think it might well turn out to be
one of core building blocks of _all_ stateful systems. Amazon's Kinesis has
for the first time exposed a reliable Log-aaS on the cloud; I think we'll
start to see more systems built around it.

My personal "24 commits for December" project is to build a set of open-source
cloud data-stores, all backed by a distributed log using Raft, "blogging all
the way". I'm half-way through, and I've implemented a simple key-value store,
put a Redis front-end on it, used that to implement a Git server, and am
currently working on building a document store with SQL querying. All with the
same architecture: the log provides fault-tolerance and consistency, we have a
data structure specific to the particular service (e.g. message queue or key-
value store), we periodically take state snapshots so that we don't have to
reply the whole log after every failure.

Feel free to follow along / provide feedback:
[http://blog.justinsb.com/blog/categories/cloudata/](http://blog.justinsb.com/blog/categories/cloudata/)

~~~
boredandroid
Yeah totally agree. I focused on real-time data because offline data
processing is often able to be less principled about the usage of time and
doesn't need an explicit log (e.g. many ETL pipelines work this way).

~~~
justinsb
Hi there - I see you're Mr Kafka :-) I wanted to use Kafka instead of Raft for
my project, but for my application I couldn't tolerate the (even highly
unlikely) possibility of losing a message (when we lose all the nodes in the
In-Sync Replica set). I understand why this tradeoff is there (for the
clickstream use-case), but I hope it might be possible to retrofit reliable
writes to support other use cases as well. The more open-source reliable log
services we have the better!

~~~
boredandroid
Yup, agreed. Wanna help? :-)

~~~
justinsb
I'd love to help (I opened KAFKA-1050 on the issue), and I even started
coding, but I got stuck on how to deal with the rollback problem when a write
isn't acknowledged by a quorum. I think it requires a "real" distributed
consensus protocol (unlike the current design), which is a fairly big change!
It may be that the answer is to use Kafka's efficient storage design with
Raft, which is something I explored a little on day 1 of my project:
[http://blog.justinsb.com/blog/2013/12/07/cloudata-
day-1/](http://blog.justinsb.com/blog/2013/12/07/cloudata-day-1/)

------
chaz
> Finally, we implemented another pipeline to load data into our key-value
> store for serving results. This mundane data copying ended up being one of
> the dominate items for the original development.

I'm always amazed by how difficult it is to simply get the data. Sounds so
simple, but it's fraught with all kinds of issues in structure, timing,
reliability, and scale -- and it's usually underestimated. Every BI project
I've ever worked on was mostly spent simply getting the required data together
into a single database. After that, it's a relative snap.

A bit off topic -- I'm guessing that healthcare.gov's biggest technical
hurdles were similar. Simplistically put, it's a shopping comparison site and
the UI/functionality is fairly trivial (which is why several HN comments
suggested that a small team could built out the whole thing in days/weeks).
But imagine if the data to support it was from 36 different legacy sources
that were unreliable, poorly documented, and built/managed by completely
different vendors. That's going to take up the majority of your time and
frustration. Database-driven websites are easy if the data is already built
for you.

Great article -- thanks for writing it.

------
strictfp
Hmm.

>Event Sourcing. As far as I can tell this is basically the enterprise
software engineer's way of saying "state machine replication". It's
interesting that the same idea would be invented again in such a different
context. Event sourcing seems to focus on smaller, in-memory use cases.

Fowlers article doesn't read like that at all if you ask me. He is talking
about the general concept of using an event log as a storage mechanism. Very
similar to the OPs afticle. And if you look at the date, Martins article is
from 2005. Credit where credit is due.

In the Java world, the idea of event sourcing was made publicly known by the
project 'prevlayer'. These guys boldly suggested to store the current snapshot
in RAM, yes, but had also built mechanisms for event logging, snapshotting and
replay. The log was persisted on disk and replayed at startup.

The prevlayer guys were in fact not enterprise at all, quite the opposite.
Their ideas caused quite a stir in the enterprise world.

~~~
kitsune_
A couple of years ago, Event Sourcing / CQRS was a really hot topic in the DDD
enterprise world, especially within the .NET community:

[http://msdn.microsoft.com/en-
us/library/jj554200.aspx](http://msdn.microsoft.com/en-
us/library/jj554200.aspx)

[http://codebetter.com/gregyoung/2010/02/13/cqrs-and-event-
so...](http://codebetter.com/gregyoung/2010/02/13/cqrs-and-event-sourcing/)

[http://blog.jonathanoliver.com/2011/05/why-i-still-love-
cqrs...](http://blog.jonathanoliver.com/2011/05/why-i-still-love-cqrs-and-
messaging-and-event-sourcing/)

[http://vimeo.com/28457510](http://vimeo.com/28457510)

LinkedIn's approach sounds fairly similar in a lot of ways.

------
pixelmonkey
Great article. A highly relevant quote:

    
    
        The log is similar to the list of all credits and debits 
        and bank processes; a table is all the current account
        balances. If you have a log of changes, you can apply 
        these changes in order to create the table capturing the
        current state. This table will record the latest state 
        for each key (as of a particular log time). There is a
        sense in which the log is the more fundamental data
        structure: in addition to creating the original 
        table you can also transform it to create all kinds
        of derived tables.
    

Also, a good architecture diagram:

[http://engineering.linkedin.com/sites/default/files/full-
sta...](http://engineering.linkedin.com/sites/default/files/full-stack.png)

At Parse.ly, we just adopted Kafka widely in our backend to address just these
use cases for data integration and real-time/historical analysis for the
large-scale web analytics use case. Prior, we were using ZeroMQ, which is
good, but Kafka is better for this use case.

We have always had a log-centric infrastructure, not born out of any
understanding of theory, but simply of requirements. We knew that as a data
analysis company, we needed to keep data as raw as possible in order to do
derived analysis, and we knew that we needed to harden our data collection
services and make it easy to prototype data aggregates atop them.

I also recently read Nathan Marz's book (creator of Apache Storm), which
proposes a similar "log-centric" architecture, though Marz calls it a "master
dataset" and uses the fanciful term, "Lambda Architecture". In his case, he
describes that atop a "timestamped set of facts" (essentially, a log) you can
build any historical / real-time aggregates of your data via dedicated "batch"
and "speed" layers. There is a lot of overlap of thinking in that book and in
this article. It's great to see all the various threads of large-scale data
analytics / integration coming together into a unified whole of similar theory
and practice. Interestingly, I also recently discovered that Kafka + Storm are
widely deployed at Outbrain, Loggly, & Twitter. LinkedIn with Kafka + Samza
and AWS deploying a developer preview of Kinesis suggests to me that real-time
stream processing atop log architectures has gone mainstream.

~~~
pragmatic
The link to Nathan Marz's book:
[http://www.manning.com/marz/](http://www.manning.com/marz/)

Still in early access as of Dec, 17 2013.

------
adolgert
It makes me so happy to see such a clear picture of how service logs relate to
the input and output token streams of finite state machines. We usually think
of finite state machines as these little objects with a few states, but the
category theory version of them is an essential definition of what it means
for a system to be deterministic and depend only on current state and given
inputs. It's a set of allowable input tokens (the input log), a set of
allowable output tokens (the output log of actions taken by the system), an
internal state Q, a dynamics delta that decides the next state from the
previous one, and an output function lambda that decides what output token to
return given the current input state.

By making the statement that logs are streams of tokens which have
deterministic effect, this author is assuming that the services are finite
state machines. This may not be the case if, for instance, they do not set
random number generators to a known state. Any way a service doesn't just
depend on its previous state violates this principle. If it meets this
principle, then the logs are, by definition, taken from the strings of
allowable input tokens X or the strings of allowable output tokens Y.

The debate in the article about at what level to log is a debate about which
portion of the service to treat as an FSM. It boils down nicely.

Oh, a dense but beautiful article on this is Machines in a Category by Arbib
and Manes.

------
akrymski
Logs should be studied in CS together with Turing Machines - they are a vital
component of today's architecture. I applaud the effort of clearly describing
the role of logs in today's distributed architectures in concise and easy to
grasp way. Everyone studying database systems and distributed architectures
should read this article. Thank's Jay!

We too have arrived at using logs at Post.fm, however with a slightly
different application: syncing email clients that can go offline with remote
servers (similar to Exchange). Instead of the traditional approach taken by
most web apps - calling remote APIs directly (the new-age remote procedure
calls in effect) I believe the new client-server architectures for web-apps
will use logs to synchronise state. This is increasingly possible with the
availability of local storage (web-sql, indexed-db, etc).

Another fascinating concept is Acid-State ([http://acid-
state.seize.it](http://acid-state.seize.it)) which "keeps a history of all the
functions (along with their arguments) that have modified the state. Thus,
recreating the state after an unforeseen error is a simple as rerunning the
functions in the history log." The idea of a log being transparently generated
at application run-time is fascinating. Function calls elegantly map to
'transactions' when modifying multiple rows this way.

Another interesting outcome of thinking about database systems as logs, is
that the tables are in effect read-only. You don't really "modify a row in a
table", but add an entry to the log. At some point the database system updates
the table to reflect the additions to the log (eventual consistency). If you
make the database system wait for the log processing to complete before
returning - you essentially get ACID.

Sometimes I wish there was a simpler, more transparent database system that
made the log front and center, letting me specify if a SELECT requires the
table to be updated with respect to the log or not. Current DBMSes seem to
hide lots of functionality instead of providing a simple model that can be
tweaked to a particular application.

------
irickt
Just a coincidence I guess that logs will be denied to Linkedin's customers.
From an email this week:

"... We'll be retiring the LinkedIn Network RSS Feed on Dec. 19th. All of your
LinkedIn updates and content can still be viewed on LinkedIn, or through the
LinkedIn mobile app. ..."

------
chubot
This is a fantastic and insightful article, and I'm sure the relatively few
comments are a result of people's minds slowly bending to this new way of
thinking :)

I particularly liked the list of related resources of the end. I have been
looking through academic papers, open source, and also "enterprise
integration" stuff, and it always strikes me how people re-invent the same
things under different names.

One question though: What about access control and security? Everyone having
the chance to subscribe to all data at a company is of course fantastic for
product development and productivity. But as a company grows it will also
become the case that not every system _should_ potentially access all
information.

------
timmclean
> I suspect we will end up focusing more on the log as a commoditized building
> block irrespective of its implementation in the same way we often talk about
> a hash table without bothering to get in the details of whether we mean the
> murmur hash with linear probing or some other variant. The log will become
> something of a commoditized interface, with many algorithms and
> implementations competing to provide the best guarantees and optimal
> performance.

Very insightful. Thanks for the in-depth write-up.

------
EGreg
What can I say:
[http://qbixstaging.com/QP/features/streams#column=2](http://qbixstaging.com/QP/features/streams#column=2)

