
Logs Are Streams, Not Files (2011) - fbuilesv
http://adam.herokuapp.com/past/2011/4/1/logs_are_streams_not_files/
======
colmmacc
From the article:

> a better conceptual model is to treat logs as time-ordered streams

At scale it's probably better still to re-think logs as weakly-ordered lossy
streams. One form of weak-ordering is the inevitable jitter that comes with
having multiple processes, threads or machines; without some kind of global
lock (which would be impactful to performance) it stops being possible to have
a true before/after relationship between individual log entries.

Another form of weak ordering is that it's very common for log entries to be
recorded only at the end of an operation, irrespective of its duration; so a
single instantaneous entry really represents a time-span of activity with all
sorts of fuzzy before/after/concurrent-to relationships to other entries.

But maybe the most overlooked kind of weak ordering is one that is rarely
found in logging systems, but is highly desirable: log streams should ideally
be processed in LIFO order. If you're building some kind of analytical or
visualisation system or near real-time processor for log data, you care most
about "now". Inevitably there are processing queues and batches and so on to
deal with; but practically every logging system just orders the entries by
"time" and handles those queues as FIFO. If a backlog arises; you must wait
for the old data to process before seeing the new. Change these queues and
batching systems to LIFOs and you get really powerful behavior; recent data
always takes priority but you can still backfill historical gaps. Unix files
are particularly poorly suited to this pattern though - even though a stack is
a simple data-structure, it's not something that you can easily emulate with a
file-system and command line tools.

~~~
jimmaswell
>it stops being possible to have a true before/after relationship between
individual log entries.

Timestamps solve this easily. Synchronizing the time between machines isn't
hard

~~~
jasode
> Synchronizing the time between machines isn't hard

Well, maybe "hard" can be relative to scope and scale. It took google
engineers some trial and error before settling on the concept of "smearing a
leap second" across the entire day. Relying on plain vanilla NTP sync and GPS
satellites was not enough.

[http://googleblog.blogspot.com/2011/09/time-technology-
and-l...](http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-
seconds.html)

------
falcolas
> Programs that send their logs directly to a logfile lose all the power and
> flexibility of unix streams.

That's because if they push their data to stdout, and it's not caught by a
pipe, the program will halt when the OS stdout buffer is filled.

> How many programs end up re-writing log rotation, for example?

This one is because files over a certain size cause certain file management
systems, or kernels, to break logging. If they didn't rotate the files, the
system would become unresponsive at worst, or the program would go down at
best. Plus, if you take care of rotation and compression yourself (either
directly or through a logrotated conf), you don't have to worry about filling
a disk & causing an outage.

In short, logging is hard, because systems are managed by people. And people
rarely get the logging setups right the first time.

------
philsnow
sink/drain, not source/sink ? Does anybody use "sink" to mean the place where
stuff comes out of (from a particular system's perspective) rather than the
place where stuff goes ?

~~~
stinos
last 'streaming' applications we wrote, some for audio processing, others for
image processing, we used the source/processor/sink notion (didn't even think
of using drain) - source = origin of data, sink = destination fordata,
processor = data goes in and comes out again, so basically sink at input end
and source at output end. But I don't think I ever saw sink to be used as a
source of data.

------
farva
It's not like there's really a difference between the two, under *nix.

~~~
kiyoto
I see what you are getting at, but my experience as one of the "mailing list"
people for Fluentd (an open source log collector) has been a bit different,
especially when a lot of data is written to a file rapidly: For example, when
you try to tail a log file with log rotation reliably, you run into all sorts
of edge cases. Such issues shouldn't exist if files and streams were truly
interchangeable in the context of logging.

~~~
antocv
> when you try to tail a log file with log rotation reliably,

Then you do tail --follow -f yourfile.

Files and streams are the same thing.

~~~
bluefinity
I think you mean tail -F yourfile.

\--follow and -f are the same thing.

~~~
antocv
No, I meant --follow=name because the default is descriptor.

-f is short for --follow=descriptor

------
antocv
Files are streams.

