Hacker News new | comments | show | ask | jobs | submit login

I've seen several articles like this, and there are a number of things to consider.

Logging to ascii means that the standard unix tools work out of the box with your log files. Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.

As an upside you are aren't storing the column definition in every single line, which if you are doing large volume traffic definitely matters. For instance we store gigabytes of log files per hour, grossing up that space by a significant margin impacts storage, transit and process times during write (marshallers and custom log formatting). Writes are the hardest to scale, so if I'm going to add scale or extra parsing time, I'd rather handle that in Hadoop where I can throw massive parallel resources at it.

Next you can achieve much of the advantages of json or protocol buffers by having a defined format and a structured release process before someone can change the format. Add fields to the end and don't remove defunct fields. This is the same process you have to use with protocol buffers or conceptually with JSON to have it work.

Overall there are advantages to these other formats, but the articles like this that I've seen gloss over the havoc this creates with a standard linux tool chain. You can process a LOT of data with simple tools like Gawk and bash pipelines. It turns out you can even scale those same processes all the way up to Hadoop streaming.




You make a very good point about unix tools. However, as a Javascript/JSON guy myself, I really like the way JSON works. And for a small site like mine, JSON would work much better out of the box than some sort of tab structure.

I work with node.js which means I can console.log and then pipe into a log file. Any object sent into console.log are automatically converted to JSON. I can also do stack traces with JSON if I ever end up with some sort of nasty bugs.

When you log Gb a day, absolutely, tabs are the way to go. But when you have a tiny little thing like mine, saving the effort for something more value adding is probably the better choice.


The use of JSON logging plus a tool like Record Stream [1] is very powerful and solves the tool chain issue. Recs is complimentary to standard unix tools as well.

1- http://search.cpan.org/~bernard/App-RecordStream-3.7.3/READM...


We used to do just that, logging as CSV, but switched to JSON.

At a certain scale you don't use Unix tool chain anymore except for a tail and that's for pure debugging. We log >10TB per day and they go to Hadoop for processing.

JSON is crazy verbose. But you pay for flexibility, we want to remove and add fields at will everyday with new business requirements. Its a pain to maintain csv log versions or use the "never remove a column" rule.

Compression rocks on JSON you get 90% gzip compression easily.

We would have considered logging to thrift if we didn't have this huge flexibility requirement.


Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.

What if the fields contain tabs? For every human-readable format that can contain arbitrary user input (nearly all of them) you need some form of escaping (I guess you could do length prefixing in ASCII decimal, but it'd not be pretty either and incompatible with basic tools).

But by far the biggest problem is the logging of text messages aimed at a humans, not the delimiting. Regular expressions can help in searching logs in quick-hack jobs, but if you need to parse logs for visualization or reporting, which is very common in organizations, using them is error-prone. After all, you rely on English messages of a certain form. The complexity of that could quickly move from "easy with regexps" to "we need NLP in our log parser!" (never mind security problems with one field leaking into another due to a slightly wrong regexps).

The application might change the message to make the message more readable for humans, or even move around fields, and your automated parser breaks. Structured messages, on the other hand, won't change for those concerns as the formatting for humans happens in the back-end.

I get a bit annoyed at the "WTF kid, learn UNIX tools" kind of responses here. UNIX tools are one way of doing things, not the holy one perfect way. Tool support is important, but there are also tools for processing JSON, XML, streams available. Shouldn't you use the best tool for the job, not the one you happen to know?

(I don't mean that JSON is necessarily the best format in every use-case, but for automated processing every structured format at all trumps arbitrarily delimited and escaped files. You can easily convert between structured formats should the need arise)


> "Shouldn't you use the best tool for the job, not the one you happen to know?"

Well, 'best' in this case should be determined by combining a number of factors. Certainly technological superiority should be weighted quite heavily, but in certain circumstances the "Nobody ever got fired for buying IBM" effect is also very important.

Of course there is also the "The best tool for the job is the one you have/know." quip, which I don't generally agree with myself... so I guess what I'm saying is that your mileage may vary.

(disclaimer: I log with JSON)


But such thinking can block innovation. It's a form of path lock-in. Both UNIX and Windows "gurus" are guilty of this, of seeing their way as the "one true way". Just because it's always been that way.

New ideas are not always better, but sometimes they might be. In the longer run "I'm used to this" (on its own) is not a good argument as there will always be new people that are not used to your specific blub, and if they can be more productive or make more reliable/secure systems then eventually you'll be out of the market.

See also: https://news.ycombinator.com/item?id=3892410


Oh certainly, I fully agree. I think I'm just saying that I think logging is going to be one of those things, for better or worse, that most developers look at and do the mental calculus of "It's a good idea, but do I want to go out on a limb here, with this issue?" In at least some cases all the factors added together just won't work out to it being worth the risk/effort.

Basically just the IBM thing. Was IBM always the best choice? No. But even so, it was often the best choice for the individuals making that call. This is the sort of thing that you have to recognize and contend with if you want to introduce change.


Hosting Sysadmin here -- logs don't change that often unless you're developing an app that logs, and for that, I like logging to MySQL ("just add a column in a few places, an voila".)

You should use the best tool for the job. If your job is webpage stats for apache, there are literally hundreds of pre-existing tools to parse apache logs, and the name of the *nix game is plaintext.


Why would logging to MySQL be a better tool for the job than NoSQL?

Unless you're routinely purging the MySQL tables fairly often (during which case, it almost doesn't matter what you're using), my best guess is that you're going to end up with a slower database on read than you would with almost any NoSQL alternative.

Of course, you could keep the MySQL tables flat, but if that's what you're doing, why use MySQL at all?


> Logging to ascii means that the standard unix tools work out of the box with your log files.

Check out Jshon, it acts as a bridge between JSON and the standard unix tools. For small files it is faster than cat.

http://kmkeen.com/jshon/


I like jsonpipe better - then you don't have to remember all the command line flags, and you can compose pipes with your familiar unix tools.

https://github.com/dvxhouse/jsonpipe

  $ echo '{"a": 1, "b": 2}' | jsonpipe
  /   {}
  /a  1
  /b  2
  $ echo '["foo", "bar", "baz"]' | jsonpipe
  /   []
  /0  "foo"
  /1  "bar"
  /2  "baz"


I would not say better - different certainly. Jsonpipe is a lot slower and heavier. (Matters when you are adding json based web 2.0 integration to your router. For small cases jshon is 15x faster and uses 1/14th the ram.) And I would really want to avoid using jsonpipe inside of a loop.

It also seems to handle typical use cases inelegantly. Probably the most common thing I use Jshon for is turning json into a tab deliminated text file. To compare both, here is a query that returns json search results:

    curl -s "https://aur.archlinux.org/rpc.php?type=search&arg=python"
If I want to get the name, version, votes and description into a single tab deliminated output with jshon:

    jshon -e results -a -e Name -u -p -e Version -u -p -e NumVotes -u -p -e Description -u | \
    sed 's/^$/-/' |  paste -s -d "\t\t\t\n"
With jsonpipe it looks like:

    jsonpipe | grep -e '^results/[0-9]*/(Name|Version|NumVotes|Description)\t' > matches.tmp
    grep 'Name\t' matches.tmp | cut -f 2 > name.tmp
    grep 'Version\t' matches.tmp | cut -f 2 > version.tmp
    grep 'NumVotes\t' matches.tmp | cut -f 2 > vote.tmp
    grep 'Description\t' matches.tmp | cut -f 2 > desc.tmp
    sed -i 's/^$/-/' *.tmp
    paste -d '\t\t\t\n' name.tmp version.tmp vote.tmp desc.tmp
    rm {name,version,vote,desc}.tmp
Most of that awkwardness is from `paste` really wanting real files to operate on. But if you are going to use jsonpipe, you might as well just write the whole thing in Python.

The one thing that I do like about jsonpipe is that each line has the fully self contained path. So you can shuffle (or otherwise destroy) the output but still have something with usable context. Except for the example above, where the order matters a lot. For really simple cases jsonpipe's method is nice. I might just port it to C so that there can be a fair comparison.


Thanks for the very enlightening reply. :)


> Logging to ascii means that the standard unix tools work out of the box with your log files.

Oh Perl, you came, and you gave without taking...

https://github.com/rcaputo/app-pipefilter


If you gzip your logs (which you should do anyway), the column definitions (i.e., data which is widely repeated) will take only a few bytes.

Use bson (mongodb's binary json format) if you are really worried about this.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: