Hacker News new | past | comments | ask | show | jobs | submit login
Grepping logs is terrible (madhouse-project.org)
94 points by madhouse on May 6, 2015 | hide | past | favorite | 102 comments

Binary logs are opaque! Just as much as text logs.

I don't agree with the second assertion there. Text logs are only opaque as far as the format is concerned, but not so much as far as the content goes. Using the example in the article; - - [04/May/2015:16:02:53 +0200] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0"
You can read a lot of information without knowing the format, the application that generated it, or even which file it was in - you know it's something to do with localhost, you know when it happened, you know the protocol, from which you can infer the "304" means Not Modified, and you know it came from a Mozilla agent. That's a lot more information than you could get from a binary log without any tools.

That isn't necessarily an argument against binary logging, but the notion that text log files are opaque in the same way as binary logs isn't really true.

> That's a lot more information than you could get from a binary log without any tools.

The environment I work in I am frequently looking at logs that other teams generate. If I needed to ramp up on their custom logging toolset just to perform simple queries I am going to give up and waste the the teams time by getting them to perform the queries for me.

(Background: I'm not a journal apologist. For a fact I'm finding it challenging to make the mental shifts required to become adept with this new suite of system tools on my myriad Debian boxen.)

> That's a lot more information than you could get from a binary log without any tools.

Arguably you need a tool to get the information you showed above - a single line from an apache log. The tool may have been grep, cat, vi, awk, less, or whatever. That it was installed as part of a base-build on your computer, or at the behest of your usual configuration management system, is either kind of aside, or kind of the point.

Journal uses a bunch of diagnostic & query tools that get installed at the same time that the journal is installed. Yes, the tool / command to get the same type of data you're looking at above -- something that is comparably readable to a line from an apache log file -- is going to be different. But only different.

It is not only different. It is also less universal.

With a text based logging system, I can take the usb stick with the system that does not boot on my headless homeserver to any computer and read the logs there. I could even boot the original linux system on that server, running a really old kernel and practically no userland tools, and read them there. Cause that server was using journald, that was not possible. Still don't know what went wrong.

Bringing a few extra tools for forensics should not be a problem. You bring grep, strings, less, and a bunch more to read text logs. Why not bring one more to read the binary dump too?

I'm sorry, but I don't find the "but I can view text on a machine from the last century" argument convincing. We're not in the past century, and when doing forensics, we usually do that on a reasonable machine, where all the tools we need are available. Otherwise its an exercise in futility.

> Why not bring one more to read the binary dump too?

For one, because it is not packaged for my distribution. For two, because I get exactly nothing in return. All binary logs do for me is forcing me to use an additional tool.

> I'm sorry, but I don't find the "but I can view text on a machine from the last century" argument convincing.

POGO-E02. I really don't know how old this is, but it has USB-2 and I bought it 2 years ago, though it was marked as classic then. Maybe 2009?

> and when doing forensics, we usually do that on a reasonable machine, where all the tools we need are available

I'm normally doing that at my own environment, with the tools I am used to, and on my machine. Nothing of that includes a binary log viewer.

> For one, because it is not packaged for my distribution.

Are you running Slackware?

I wrote that below, that is a Ubuntu 14.04 LTS. The important point is that the theoretical availability - there probably is a PPA somewhere - of journalctl is an additional, unnecessary hurdle.

A few years ago, I was called by the manufacturing team to troubleshoot a simulator bench of our own system. The bench had been developed 15 years ago by a subcontractor and was working happily since then. The issue was pressing because it could halt the manufacturing line. I had 0 information or documentation on the bench. All the respective owners had been gone years ago. I was quite happy when I found there was a basic logging system on a serial line.

I think is it more telling about lack of organization rather than wrong technical choice, but sometime you have to deal with legacy systems and it is good to be able to rely on something as universal as text.

I know it's pretty weak to say so, but that universality is a result of the stage of the migration -- that particular problem will undeniably reduce over time.

But you're right that there are some work flows and use cases where it'll bite you big time. A recent migration to systemd on my Debian lvm-on-dmcrypt laptop caused me some hours of pain, so I'm not unsympathetic.

Back in the early 90's I was involved in managing a very large network of MS-DOS + Windows 3.x machines. The migration to Windows 95 introduced the same concerns, with similar responses. That's the nice thing about working in IT long enough.

Thanks for the sympathy ;)

> The migration to Windows 95 introduced the same concerns, with similar responses.

For me, that is the second big large negative point, apart from the missing universal access (which like you said might get better over time, maybe). This route of having a binary journal with its dedicated journal viewers feels awful lot like being on windows. It's the same negative feeling I get when I get in contact with Gnomes regedit clone. Stepping back to Windows 95 is hardly progress.

Yeah, I may has mis-worded that sentiment. My point is, and I'm not the first to have noticed, that much of IT seems to be profoundly cyclical. Not necessarily bad, other than the implication we don't really learn from the mistakes of each cycle. Compare and contrast the trisolarans.

Anyway, memory may be failing, but the big problem was one of configuration data (typically small volumes) that used to be kept in .ini (text) files, now being shuffled into the registry. There wasn't a size or complexity issue that drove that move, unlike the challenge of managing and merging many large log files from disparate services on multiple hosts.

In the particular case the toolkit did eventually catch up, but it took a very long time (3-5 years for us, I think, to recover the same level of deployment, configuration, automation). With Journal, in contrast, the toolkit's already there, and ultimately I'm just not convinced that 'I don't have Journal tools installed on this computer' is a persuasive argument against the tool.

I'm not saying there are no compelling arguments, just that one isn't.

Could you not view the journal on another system? I'm curious why:

journalctl -D /<mnt>/<other_system>/var/log/journal

wouldn't work in this case

You assume the other system is a linux box with systemd installed. That may be true, or may not be true at all. :)

Yes, exactly. That would've worked if my other system would have had journalct. It is a Ubuntu 14.04 LTS, and to my knowledge it does not have a package for that. Even if it had and I just did not know, that is kind of the point why binary journals (and systemd) suck.

It's not like specific part's of journalctl can't be ported to other systems and packaged separately. It's a distribution issue, not a fundamentaly flaw in binary logs.

Sure, the question is why fix something that is not broken? Why should someone write a systemd journal format reader (but journald is just an example here) when text files work already? Why rewrite tools to support it?

If you get to the point when standard unix tools are useless, well, it's time to use a _real_ database and/or log management system. Not the time to write your own. No one is (should be?) going around grepping 100 GBs/day worth of logs.

If your logs aren't text, and it's a small system, I'm not going to look at them. Therefore they don't exist. That's one reason why people don't like binary logs - they are effectively useless.

On the flip side, if the system is huge - then we can use tools like splunk.

grep/tail/awk are the first three tools I use on any system - if you create logs that I can't manipulate with those three tools, then you haven't created logs for your system that I can use.

Yes, grepping logs is terrible if "you have 100Gb of logs a day". I'm not sure why the author is thinking his use case is anything near the norm or why he's shocked in most use cases people prefer text files.

I'm also not getting why he just doesn't use scripts to parse the logs and insert them into a database at that point. Why use some ad-hoc logging binary format if you're doing complex queries that SQL would be better suited for anyway, on proven db systems?

Maybe I'm missing something.

Grepping is just fine if you have a few hundred megabytes of data a day, so wanting to kill text based logging, because YOU reached multiple gigabytes a day is going to be met with resistance from the people who don't have those issues.

As the author himself points out: "I'm sorry, but deciding how much and what we log is not your job. Its ours, and this is the amount we have to deal with."

That goes both ways. If I only have one or two servers, having to run a centralized logging services doesn't scale either, the overhead is not worth the trouble.

If I want to look for an IP in logs from multiple service, text files are perfect. Doing the same across multiple servers, yes, then you want centralized logging. Binary logging ruins the first case, while text based works in both (sort of).

I don't really see the point of binary logs. Either you're small enough that text files won't be an issue, or you're large enough to have centralized logging.

It seems that there's a push towards "scaleable solution" for everything, but people keep forgetting that you need to scale down as well. Most of us will never have to run more than a handful of servers, and in these cases the Twitter/Google/Facebook-like infrastructure just isn't worth the hassle.

I think I'm missing the same thing. He keeps going on about structure, but it wasn't obvious to me where the solution (?) actually introduces query-able structure.

He needs a log database, clearly. And when you put it that way, it's obvious why grepping logs is a nice, quick solution in many cases when you aren't getting "100Gb of logs a day".

I think his point was that querying a structured data is better than grepping unstructured text. SQL vs Regex, for example. I get the impression that he didn't state what solution to use but simply that binary/structured > text/unstructured. He even says that Journal isn't his ideal solution and never will be.

Throwing 100GB of data into a relational database and being able to run your queries quickly isn't exactly a no brainer

One of the initial challenges I see from an OPS perspective is that the most recent logs are often the most interesting. The latency of the logs being ingested into a DB would prevent me from using the DB. Generally, I find my self grepping logs on the prod servers.

"you have 100Gb of logs a day"

Logs have lots of redundancy, so they compress quite nicely. So it is actually practical to grep those files since on disk they are not so large, and 100Gb of memory data is not a problem to grep.

The author shows a use case for both a small and a large logging system. The use case is complex queries which spans multiple applications and don't need regex ninja skills but sensible queries.

he does not. His small logging system is not small at all, it spans multiple systems and has requirements that are not at all typical for small systems.

My small system is usually two computers. Having a router/proxy/firewall box at home is not all that uncommon, and some examples I gave apply there nicely.

Do you really think that your requirements apply? I think none do in case of really a small system. See, I think I understand your requirements, but I think they are not common at all.

For a small system - a desktop PC and maybe a custom router box - you do not need one central place for the logs. Thus you don't need an easy way to change it. You don't need to preserve logs in a more efficient way than logrotate does. They don't need to be stored more structured than the filesystem does, the queries are local, and grep is more than efficient enough.

Maybe a binary log is the best choice for you - it seems to be what you want. But that does not generalize to the general public. That is why the rant feels very misplaced for me.

You're missing the point. I'm not using a custom logging format. I'm using binary log storage, with emphasis on the storage. There is a database and a search engine behind it.

Logging format and log storage format are two very different things.

Also, I'm not shocked people prefer text files. I'm shocked why they're so much against binary log storage. There's an important distinction between the two: you can prefer text, if that fits your case better, without hating on binary storage.

> There's an important distinction between the two: you can prefer text, if that fits your case better, without hating on binary storage.

Except according to the article (which you posted and are defending all over this thread, so I'm guessing you actually wrote it?) the author has NO intention of honoring those who prefer text logs, in fact using the phrase "so vigilantly against text based log storage". To use your own reply, you can prefer binary, if it fits your case better, BUT DON'T HATE ON TEXT STORAGE.

Change for the sake of change is anti-engineering. It is anti-productive. Your changes must be improvements, and they must not cost more than they save or generate in a reasonable period of time.

Many organizations have a fully functional, well-debugged logging infrastructure. The basic design happened years ago, was implemented years ago, and was expected to be useful basically forever. Growth was planned for. Ongoing expenses expected to be small.

That's what happens when you build reliable systems on technologies that are as well understood as bricks and mortar. You get multiple independent implementations which are generally interoperable. You get robustness. And you get cost-efficiency, because any changes you decide to make can be incremental.

Where are the rsyslogd and syslog-ng competitors to systemd's journald? Where is the interoperability? Where is the smooth, useful upgrade mechanism?

Short term solutions are generally non-optimal in the long term. Using AWS, Google Compute and other instant-service cloud mechanisms trades money, security and control for speed of deployment. An efficient mature company may well wish to trade in the opposite direction: reducing operating costs by planning, understanding growth and making investments instead of paying rent.

Forcing a major incompatible change in basic infrastructure rather than offering it as an option to people who want to take advantage of it is an anti-pattern.

"Growth was planned for."

One interesting problem with almost all of the "advantages" of binary logs, is if they're good reasons today, they would have been really awesome reasons in '93 when I started admining my first linux box. The problem with changing the way I've been doing things is I'm already used to the staggering change in performance from a 40 meg non-DMA PATA drive in '93 to dual raid fractional terabyte SSDs. Its really quite a boost in raw power. Yet what I need to log hasn't changed much. So performance gains have been spectacular. So the comparative appeal is incredibly low. It wasn't a "real problem" in '93. Its maybe a thousandth of that problem level today due to technological improvement.

"Hey, if you change everything in your infrastructure, and all your machines, and all your command lines and procedures and ways of thinking to access logs, you MIGHT be 5% more efficient, well, eventually, in the long term" "Eh so what I remember transitioning from spinning rust to SSD and getting 100x the overall system-wide performance a couple years ago, if I want 5% its more economic just to wait for the next tech boost. Also shrinking basically zero load and effort by half is worthless if there's any cost at all, and unfortunately the cost is absolutely huge."

I assume this comment is related to the journal. The article is not.

But, to reply: yes, many organisations have fully functional, well-debugged logging infrastructures. A lot of them also use binary log storage, and have been for over a decade, and are more than satisfied with the solution.

Both rsyslog and syslog-ng have been able to assist with setting such a thing up for about a decade now.

> Where are the rsyslogd and syslog-ng competitors to systemd's journald? Where is the interoperability? Where is the smooth, useful upgrade mechanism?

The journal has a syslog forwarded, but both rsyslog and syslog-ng can read directly from the journal. Interoperability was there from day one. Smooth upgrade mechanism took a while to iron out, but it's there now, too.

People don't want it because it's binary, not because you can't grep it.

* you need to use a new proprietary tool to interact with them

* all scripts relating to logs are now broken

* binary logs are easy to corrupt, e.g. if they didn't get closed properly.

>You can have a binary index and text logs too! / You can. But what's the point?

The point is having human-readable logs without having to use a proprietary piece of crap to read them. A binary index would actually be a perfect solution - if you're worried about the extra space readable logs take, just .gz/.bz2 them; on decent hardware, the performance penalty for reading is almost nonexistent.

If you generate 100GB/day, you should be feeding them into logstash and using elasticsearch to go through them (or use splunk if $money > $sense), not keeping them as files. Grepping logs can't do all the stuff the author wants anyway, but existing tools can, that are compatible with rsyslog, meaning there is no need for the monstrosity that is systemd.

What's wrong with Splunk? Honest question.

Price, mostly. It's good, but there are alternatives that aren't as ridiculously expensive.

Any alternative that you'd recommend? Thanks.

ELK stack - Elasticsearch, Logstash, and Kibana. The whole stack is opensource :)

Interesting, but it's not a SaaS. It doesn't look like a direct rival to Splunk.

It is a direct rival to Splunk :) They do very similar things however IMHO Splunk is the better solution right now. There are LaaS companies that use ELK if you need a cloud solution - Loggly is the first one that springs to mind and I think another is LogSene.

It's expensive

* Why would you need a proprietary tool? * What if they get broken? I don't want to look at them raw anyway. * Text logs are easy to corrupt as well. Oh, append only? Well, you can do that with binary storage too.

And again, there is no need for proprietary tools at all. Everything I want to do is achievable with free software - so much so, that I use only such software in all my systems.

As for compressing - yeah, no. Please try compressing 100Gb of data and tell me the performance cost is nonexistent.

As for LogStash & ES: Guess what: their storage is binary.

Also note that my article explicitly said that the Journal is unfit for my use cases.

Why does it have to be proprietary?

It doesn't have to be - but let's look at reality here. NIH syndrome is everywhere, we have millions of competing protocols and formats, everyone thinks they can build a better solution than someone else, etc.

I suppose that if there was a large push to universally log things in binary the possibility exists that sanity would prevail and we'd get one format that everyone agreed upon, but I don't see any reason that this would be the case when historically it basically never happens.

So, at least from my prediction of a future where binary logging is the norm, we have a half dozen or so competing primary formats, and then random edge cases where people have rolled their own, all with different tools needed to parse them.

Or we could stick with good ol' regular text files and if you want to make it binary or throw it in ELK/splunk or log4j or pipe it over netcat across an ISDN line to a server you hid with a solar panel and a satellite phone in Angkor Wat upon which you apply ROT13 and copy it to 5.25 floppy, you can do it on your own and inflict whatever madness you want while leaving me out of it.

It's not like text formats are universal. Thankfully we have settled on utf-8. The same could happen to a slightly more structured universal binary format that would be more suitable for many applications (like logging) and would have an established toolset just like 'text' now.

That doesn't make sense. There can be no such thing as a universal binary format, unless you reinvent text/plain. Analyze a little on why we refer to some formats as being text-based and you'll see it.

It doesn't, but nothing is universal like `grep`. If you find a machine that's logging stuff which doesn't have `grep`, you're already having a bad day.

You just can't say that about binary log formats. Text is a lowest common denominator; and yes, that cuts both ways, but the advantages of universality can't be trivially thrown away.

The machines I'm administering will all log the same way, therefore, within that context, they are universal. We have well documented tools and workflows, so anyone new to the system can catch up and start working with the logs within minutes.

We don't unexpectedly find machines that don't conform to our policies. We control the machines, we know where and how to find the logs. If we found any where we had to grep, we'd be having a very bad day.

Our lowest common denominator is not text, because we control the environment, and we can raise the bar. Being able to do that is - I believe - important for any admin.

Right, but this isn't an argument about log formats. You're making a bigger argument about workflows, and you're saying that yours is unconditionally better. In your environment, it might be most appropriate to put in the up-front investment to totally control all your log formats. Within your context, you get to define a lowest common denominator which isn't text, and it sounds like that makes sense. With the services you run, you might be able to dictate that the log formats are restrictive enough that writing a parser for each one isn't a problematic overhead.

To get the benefits you're claiming, the storage format of your logs is actually irrelevant. If you're going to have an environment where you have to exert that much control over the output of your applications, when you parse the logs doesn't matter. You could do your parsing with grep and awk as the very last step before the user sees results, and you'd see the same benefits. Parsing up-front, assuming you know what data you can safely throw away, might appear to some as a premature optimisation.

> We have well documented tools and workflows, so anyone new to the system can catch up and start working with the logs within minutes.

It sounds like this is something which could be usefully open-sourced, to show how it's done.

> Our lowest common denominator is not text, because we control the environment, and we can raise the bar. Being able to do that is - I believe - important for any admin.

It's a question of what you choose to optimise for. Pre-parsed binary logs in a locked-down environment might be as flexible as freeform text, but I'd need to see a running system to properly judge.

> you're saying that yours is unconditionally better

I don't think I'm saying that. The article presents two setups and a few related use cases, where I believe binary log storage is superior.

> With the services you run, you might be able to dictate that the log formats are restrictive enough that writing a parser for each one isn't a problematic overhead.

I don't need to dictate all log formats. If I can't parse one, I'll just store it as-is, with some meta-data (timestamp, origin host, and so on). My processed logs do not need to be completely uniform. As long as they have a few common keys, I can work with them.

For some apps or groups of apps, I can create special parsers, but I don't necessarily need that from day one. If I'm ok with only new logs being parsed according to the new rules (and most often, I am), I can add new rules anytime.

> Parsing up-front, assuming you know what data you can safely throw away, might appear to some as a premature optimisation.

>> We have well documented tools and workflows, so anyone new to the system can catch up and start working with the logs within minutes. > It sounds like this is something which could be usefully open-sourced, to show how it's done.

LogStash is a reasonable starting point. Our solution has a lot of common with it, at least on the idea level.

> Pre-parsed binary logs in a locked-down environment might be as flexible as freeform text, but I'd need to see a running system to properly judge.

Only our storage is binary. That is all the article is talking about. Within that binary blob, there are many traces of freeform text, mostly in the MESSAGE keys of application logs which we care less about (and thus, parse no further than basic syslog parsing). You still have the flexibility of freeform text, even if you store it in a binary storage format.

    > Embedded systems don't have the resources!
    > ...
    > I'd still use a binary log storage, because
    > I find that more efficient to write and parse,
    > but the indexing part is useless in this case.
This is yet again a case of a programmer completely misjudging how an actual implementation will perform in the real world.

When I wrote the logging system for this thing http://optores.com/index.php/products/1-1310nm-mhz-fdml-lase... I first fell for the very same misjudgement: "This is running on a small, embedded processor: Binary will probably be much more efficient and simpler."

So I actually did first implement a binary logging system. Not only logging, but also the code to retrieve and display the logs via the front panel user interface. And the performance was absolutely terrible. Also the code to manage the binary structure in the round robin staging area, working in concert with the storage dump became an absolute mess; mind you the whole thing is thread safe, so this also means that logging can cause inter thread synchronization on a device that puts hard realtime demands on some threads.

Eventually I came to the conclusion to go back and try a simple, text only log dumper with some text pattern matching for the log retrieval. Result: The text based logging system code is only about 35% of the binary logging code and it's about 10 times faster because it doesn't spend all these CPU cycles structuring the binary. And even that text pattern matching is faster than walking the binary structure.

Like so often... premature optimization.

I've worked with a number of implementations, both embedded and others (ranging from a PC under my desk, through dozen-node clusters to ~hundred nodes). For most cases, binary storage triumphed. Most often, we kept text based transport.

Again, transport and storage are different. While I prefer binary storage, most of my transports are text (at least in large part, some binary wrapping may be present here and there).

Cool tech you have there, but I only understood it once I saw the video. You basically have a very fast laser that can do volumetric scans at a high framerate, did I get this right? What do people typically use it for?

    > You basically have a very fast laser that
    > can do volumetric scans at a high framerate,
    > did I get this right?
Sort of. The laser itself is constantly sweeping its wavelength (over a bandwidth of >100nm). Using it as a light source in a interferometer where one leg is reflected by a fixed mirror and the other leg goes into the sample something interesting happens: The interferometric fringes produced for a certain wavelength correspond to the spatial frequency of scattering in the sample. So the fringe distribution over wavelengths is the Fourier transform of the scattering distribution. So by applying an inverse Fourier transform to the wavelength spectrum of the light coming out of the interferometer you get a depth profile.

Now the challenge is to get the wavelength spectrum. You can either use a broadband CW light source and a spectrometer. But these are slow, so you can't generate depth scans at more than about 30kHz (which is too slow for 3D but suffices for 2D imaging). Or you can encode the wavelength in time and use a very fast photodetector (those go up to well over 4GHz bandwidth).

This is what we do: Have a laser that sweeps over 100nm at a rate >1.5MHz and use a very fast digitizer (1.8GS/s) to obtain a interference spectrum with over 1k sampling points. Then apply a little bit of DSP (mapping time to wavelength, resampling, windowing, iFFT, dynamic range compression) and you get a volume dataset.

BTW, all the GPU OCT processing and visualization code I wrote, too.

    > What do people typically use it for?
Mostly for OCT, but you can also use it for fiber sensing (using fiber optics as sensors in harsh environments), Raman spectroscopic imaging, short pulse generation and a few other applications. But OCT is the bread and butter application for these things.

Alright, gotta say, that's cool.

Frequency-sweeping... How are you doing that? Is the laser itself able to frequency sweep? Or are you chirping pulses?

    > Frequency-sweeping... How are you doing that?
The basic principle is called FDML; there's a short description of how it works on our company website:


A much more thorough description is found in the paper that introduced FDML for the first time:


    > Is the laser itself able to frequency sweep?
The laser itself is doing the sweeps.

    > Or are you chirping pulses?
No. In fact one of the PhDs that came out of our group was generating pulses by compressing the sweeps:



What you're doing sounds a lot like time-domain spectroscopy in an odd sort of way.

What are the advantages of this versus just chirping a pulsed supercontinuum source?

    > What you're doing sounds a lot like time-domain
    > spectroscopy in an odd sort of way.
The measurement principle is definitely related to TDS.

    > What are the advantages of this versus just
    > chirping a pulsed supercontinuum source?
Output power: Our system can emit >100mW

Sweep uniformity: The phase evolution of the sweeps is very stable; the mean deviation in phase differences between sweeps is in the order of millirad. Which means that for the time→k-space mapping the phase evolution has to be determined only one time and can then be used for hours of operation; in fact the system operates to repeatable that even after being powered off over the night, the next morning you can often reuse the phase calibration of the previous day. Without that, you'd have to use a second interferometer and sample a k-space reference signal for each and every sweep in parallel and use that for k-space remapping.

Ease of synchronization: Trigger signals have very small jitter. Also the jitter between electrical and optical synchronization is in the order of few ps, which is important for things like Doppler-OCT.

Coherence: Supercontinuum Sources have issues with coherence stability, which degrades the imaging range.

Sentisivity issued: Chirping Pulsed Supercontinuum Sources (which are actually used for OCT) is challenging. It requires a lot of dispersion. High dispersion means a lot of loss, which in turn means it requires another output amplification stage, which in turn will also produce significant optical noise. And optical noise is the bane of OCT, since that reduces the sensitivity. In contrast to that if properly dispersion compensated an FDML laser will exhibit very little noise.

Price: Pulsed Supercontinuum Sources suitable for chirping and OCT applications are quite expensive. Our laser is not cheap as well, but it's still more price effective.

All good answers. Thank you.

I guess that much of the resistance against the binary logs of systemd is the unfamiliarity and to some extent lack of well known tools for dealing with them. Sysadmins that have years of experience with traditional Unix tools now suddenly have to start almost from scratch when it comes to everyday tools for examining the system. Not only that, programmers are also most familiar with text based formats, and libraries for handling these formats have to become more available in the most popular programming languages and become familiar for programmers that develop tools for analysing systems. Until that happens, sysadmins feel that they are set back by the introduction of binary logs, even if binary logs are technically superior.

It's like no one remembers the reasons we switched away from fixed format records. The biggest of which is that text based logging is a lot more future proof. Sure I might have to change a regex when time stamps improve their resolution to milliseconds, but at least I won't have to rebuild my entire suite and deal with two incompatible binary files on disk.

I don't have experience with binary logs. I think the fragility of binary logs is not baseless though. AFAIK there was (is?) a problem in systemd's journal where a local corruption of the log could cause a global unavailability of the logged data.

People like text logs because local corruptions remain local. Some lines could be gibberish, but that's all. I'm not suggesting that this couldn't be done with binary logs, but you have to carefully design your binary logging format to keep this property.

Otherwise I agree with the author that we shouldn't be afraid of binary formats in general, we need much more general formats and tools though (grep, less equivalents).

I'm not fond of "human readable" tree formats like XML or JSON either. bencode could be equally "human readable" as an utf-8 text if one has a less equivalent for bencode.

> I don't have experience with binary logs. I think the fragility of binary logs is not baseless though. AFAIK there was (is?) a problem in systemd's journal where a local corruption of the log could cause a global unavailability of the logged data.

From my experience (I do not want to troll and presume you have not tried it), systemd starts off where it picked up when an old log is corrupted and stars a new one. There is a command line utility to verify the integrity of these files (on my Windows laptop at work, cannot check). Now, I am not sure the state of log file repair. I was told it is not possible. However, it seems this means the file is corrupted in a way it is not easily indexed. It is likely it is still readable. I wish I had seen this last time.


Granted, I use Arch Linux on an old laptop. I had these corruptions routinely happen when I had disabled ACPI controls (I do not use the fancy WMs, I am back to Ratpoision) and completely, and I mean completely drained the battery until it came crashing to a halt). So, I am not surprised about these corruptions.

Anyone using systemd boxes in production who can comment on this? Flamewar or not, I would like to know more. I do not really care for it one way or the other. Parts I like, parts I do not.

I was thinking exactly the same, once you want to create a binary efficient format which you can query, you then have the same problems as a database. And if there is something we have learned in the history of computing, it's that databases are hard to design properly, and especially from scratch.

And especially when you want it to be immune to random failures without data loss.

The last few entries of a log file before something catastrophic happens are precisely the entries that are the most important to make sure they aren't lost.

This applies more generally than just to logs. I love Unix, but "everything is text" is not actually great. It's better that Unix utils output arbitrary ASCII than that they output arbitrary binary data, but it's obvious why people don't do serious IPC 'the Unix way.' Imagine if instead of exchanging JSON, or ProtoBufs, or whatever, your programs all exchanged text you had to regex into some sort of adhoc structure. So why do we manage our logs and our pipelines that way? There's no actual reason that the terminal couldn't interpret structured data into text for us so that, in the world of intercommunicating processes on the other side of the TTY, everything is well-structured, semantically comprehensible data.

This is the PowerShell argument. It's a step in the right direction, but it needs the tooling and user community to come along with it.

The advantage of the traditional unix pipe manipulation tools is that most of them are simpler and faster than regex.

> There's no actual reason that the terminal couldn't interpret structured data into text for us so that, in the world of intercommunicating processes on the other side of the TTY, everything is well-structured, semantically comprehensible data.

I think you just described PowerShell (or things that follow down the same path, e.g. TermKit) ;-)

JSON is text!

Text is not synonymous with unstructured.

Of course JSON is encoded in Unicode, making it "text," but when it is said that text is the universal protocol of Unix, it means that the only guarantee a well-behaving Unix utility can make is that it will output ASCII. You cannot leverage the further structure of JSON or any other protocol because utilities that interpret JSON do not compose with those many Unix utilities which emit non-JSON data.

Only entropic bits are truly "unstructured data." The question is one of how much semantic structure you can rely on in the data you are processing, which is a continuum.

The title is misleading, I was expecting to discover a better way of dealing with logs in the general case. Instead I got served an attempt of the author to generalize its way as if his quite specific use case could apply to the outside world.

Reading this was a waste of my time.

Being a universal open format text is a better format than binary, unless you don't care about being able to read your data in the future. There's already enough issue with filesystems and storage media, no need to add more complexity to the issue.

logs should be in text. The last thing you want is to find out that your binary format cannot be decoded due to a bug in the logging or because file got corrupted. Not to mention that you won't be able to integrate with a lot of log systems like Splunk and friends.

On the other hand, if you have logs, you need to store them in a centralized place and have an aging policy, etc... Grepping is definitely not the answer. Systems like Splunk exist for a reason.

Please don't confuse log storage with log transport. We can transform the stored format into any other, if so need be.

(For example, I use Kibana at home. Works great, though I have no text logs stored.)

The greatest thing that I've found recently was fluentd and elasticsearch - we have fluentd on all of our nodes that aggregate logs to a central fluentd search which dumps all of the data into elastic search, then we use kibana as a graphical frontend to elasticsearch

It took a while to get developers to use it, but now it's indispensable - particularly when someone asks me 'what happened to the 1000 emails I sent last month'

I now know, as previously, the data would have been logrotated

I think the author is conflating several problems here. There are several ways logs can be used, and efficiency is a scale. For example, if I receive a bug report, I like to be able to locate the textual logs from when the incident occurred and actually just sit and read what was happening at the time. On the other hand, if I'm doing higher-level analysis such as what features do users use most, clearly it's more efficient to have some sort of structure format because you're interested in the logs in aggregate. The author makes it sound like they're advocating optimizing for the aggregate use case at the expense of other use cases. I think that the declaration that textual logs are terrible is an oversimplification of the considerations in play.

Also, if the author has a 5-node cluster producing 100Gbs of logs a day, the logs may also be too verbose or poorly organized. I work on a system that produces 100s of Gbs of logs a day but with proper organization they're perfectly manageable.

I think that a more nuanced solution is to log things that are useful to manual examination in text form, but high-frequency events that are not particularly useful could reasonably be logged elsewhere (e.g. a database or binary log that is asynchronously fed into a database).

In conclusion, as is frequently the case with engineering, I think the author oversimplifies the problem here and tries to present a one-size-fits-all solution instead of taking a more pragmatic solution. Textual logs are useful when meant for human consumption (debugging) and when they can be organized such that the logs of interest at any time are limited in size, and some other binary-based format is useful for aggregate higher-level analysis.

With a binary log storage system, nothing stops you from browsing all logs that happened around the time of the incident. Instead of locating the files, you just tell the engine to show you the logs from that time onwards (or from a little bit before).

As for our logs being too verbose: nope, read the article.

Also, it's not an one-size-fits-all solution: I have no problem with people using text. All the article wants to show, is that binary logs are not evil, bad, useless, etc, and that there are actually very good reasons to use them.

For example, storing logs in a database is one kind of binary log storage: most databases don't store the data as text.

One solution to the problem of too much logging data can be what I call "session-based logging" (also known as tracing). You can enable logging on a single session (e.g. a phone call), and for that call you get a lot of logging data, much more than a typical logging system.

This obviously only works when you are trouble shooting a specific issue, not when you need to investigate something that happened in the past (where the logging for the session wasn't enabled). However, it has proven to be an excellent tool for troubleshooting issues in the system.

I have used session-based logging both when I worked at Ericsson (the AXE system), and at Symsoft (the Nobill system), and both were excellent. However, I get a feeling that they are not in widespread use (may be wrong on that though), so that's why I wrote a description of them: http://henrikwarne.com/2014/01/21/session-based-logging/

Depending on the language, this can be expensive even if you're not actually logging the data.

And it invites timing-based heisenbugs (enable tracing, problem goes away).

Still a neat approach, however.

Text logs let me do all the things I want to do.

Grep them, tail them, copy and paste, search, transform them, look at them in less, open them in any editor. I love two write little bash oneliners that answer questions about logs. I can use these onliners everywhere anytime.

I dont have any of the ­efficiency problems the author talks about.

The author's use of logs is sophisticated and proactive. Sadly, most Linux installations I've dealt with are lazy and reactive, where logs are kept around "just in case" for future forensics (hah!).

I think binary logging is the wrong word to use. As far as I can tell it's not binary he means, but database logging. Storing things in a database sounds far less scary than binary.

At best it's a NUL separated database structure where the fields are not compressed, which IS greppable just use \x00 in your regexp. At worst he might mean BER, which is an ASN.1 data encoding structure.


So some people want a log format that is more structured than plain text lines. That is going to require some sort of specialized tool. So if a dependency is allowable (instead of leaving the log in a format that is already readable by ~everything), why can't the specialized tool generate an efficient index?

A traditional log with a parallel index would be completely backwards compatible, the query tool should work the same way, and you could even treat the index file as a rebuildable cache which can be useful. The interface presented by a specialized tool doesn't have to depend on any specific storage method.

Really, this recent fad of trying to remove old formats in the believe the old format was somehow preventing any new format from working in parallel reminds me of JWZ's recommendations[1] on mbox "summary files" over the complexity of an actual database. Sometimes you can get the features you want without sacrificing performance or compatibility.

[1] http://www.jwz.org/doc/mailsum.html

This is all well and good if you want to, and can, spend time up front figuring out how to parse each and every log line format which might appear in syslog so you can drop it in your structured store.

The alternative is to leave everything unstructured, and understand the formats minimally and lazily. Laziness is a virtue, right?

Why would I need to be able to parse everything up front? Taking the syslog example, that has a commonly understood format. As a default case, I can just split the parts and have structured data (esp. with RFC5424, where structured data is part of the protocol to begin with).

Then, I can add further parsers for the MESSAGE part whenever I feel like it, or whenever there is need. I don't need that up front.

Because in my experience, the interesting stuff isn't in the syslog metadata. It's in the message part. Until you add that further parser, you're grepping.

What binary logging solution is the author using if he's not using the systemd journal ?

Look at a first year computer science student. He will already put prints in his programs and if he is smart and has a bigger assignment he might already start to write other programs to parse that output. You can't beat that, because it is nearly impossible for a newbie to even know that there might be a problem with text logging and that binary logging might be a solution. In fact he might not even know that what he does is called logging. But he is already doing it!

So even if binary logging is way better (I can't say, not enough experience) you simply can't beat text logging, because text logging is natural. It just happens.

print("Hello World!")

If you need to grep logs on regular basis, you're doing it wrong.

Store important data in the database so that you can query it efficiently.

Keep logs for random searches when something unexpected happens. I log gigabytes per day, but only grep maybe once-twice a year.

Agreed. But if I keep my logs in a database, I may aswell use the database to query my logs, instead of grepping in them.

(And voila, you have binary log storage.)

Separate the data you are sure you need often and only store that in the database. Store everything else in the textual logs.

On a slightly unrelated note, as a largely amateur Linux user: have people made systems that instead of grepping for info, use machine learning do detect normal patterns of a log file (like what type of events, similar, at different intervals) and report the anomalous output via email or report to an admin?

I was thinking this would be a cool area of research for me to try programming again, but it seems so daunting I am not sure where to start.

I don't know of any systems that do this.

As an software developer, I generally use log levels to indicate severity in my logs. So grepping for ERROR should catch anything I had the foresight to log at the ERROR level.

Simple heuristics like the number of WARN level logs a minute may be useful.

Beyond that it sounds interesting. It may be hard to do in a general way, so focusing on Apache logs or something common may be a simpler task.

In addition to logging, you can send out a statsd[0] message, graph it, and use something like Skyline[1] for alerting based on trend issues. You can also use logstash to generate metrics on logs when sending them up to Elasticsearch.

[0] https://github.com/etsy/statsd [1] https://github.com/etsy/skyline

Excellent sample projects, especially Skyline. This seems the closest thing to what I had envisioned sofar.

Very cool stuff. Do you use it?

Skyline was a bit too much overhead, but we took the concept and adapted it to our needs.

When I say too much overhead, I'm referring to the carbon proxy and redis requirements. We found that just using the json output from graphite was sufficient to feed a trend monitoring system.

The output is pretty sensitive, moreso than Icinga2 (Nagios) expects, so we had to turn down a few of the "is this really down" re-checks, since it would silence legitimate trend alerts.

I use the logwatch program for that. There is no machine learning, it's entirely manual with a large list of things it filters out, but the defaults are quite good.

It emails me any log entires it doesn't know about. I did have to add a large number of ssh lines that it should not bother me about, but other than that it works very well and I find it very useful.

Cool tip. I remember hearing the name but knowing it had these features. I will definitely check it out.

you can use fail2ban for this. It is used to automatically ban IP that, for instance, tries to bruteforce your SSH, but it really is an engine that match regexp log file lines, and fires an action if the regexp match.

So you can use it for other usages (such as sending an admin a mail if suddenly your server sends 500 errors, or a unusual amount of 404 errors for instance)

Of course. Not to be dismissive, but I am familiar with fail2ban. I was wondering if anyone had this idea that did not require manual or pre-set rules, like that the program would go passive for a few days, reading log files and learning certain log entries will be indentical minus timestamp, then some change with a small amount of text in addition, and others have never been seen (or will not match in the next stage). Next stage turns active, and the machine filters down and sends you anything it has not seen over time and knows must be something anomolous.

I like fail2ban, a lot, and alternatives in that field, but when I looked at the Arch Linux package last time there were dozens of commented-out, but heavily commented nonetheless regexp template files like you describe. I think this would be a neat machine learning thing.

What I am going for: use AI to train a passive entry-level sysadmin to warn you.

I experimented with that, and heard others toying with the idea too. There are even products out there that do something similar.

Of course grepping log is terrible! Grep is a generic tool, why shouldn't it be defeated by specialised tools?


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact