

High-Performance Log Parsing in Haskell: Part One - qubitcoder
https://blog.safaribooksonline.com/2015/03/30/high-performance-log-parsing-in-haskell-part-one/

======
fnord123
""" We write a lot of Python here at Safari, in fact it’s our most widely-used
language. Because it’s a very high-level language, it was likely it would
satisfy my second requirement. But my experience with writing log parsers in
Python before–even when using tricks like lazy evaluation–led me to suspect
that it would not do so well in satisfying the first requirement. """

How can one say one parse is high performance and another isn't if no
comparison is made? e.g. the author could use pylogsparser and benchmark it:

[https://pypi.python.org/pypi/pylogsparser/0.8](https://pypi.python.org/pypi/pylogsparser/0.8)

Then we can get an idea of whether the project set out to meet its goal of
being quicker than python.

In any event, I'd be interested to see how luajit-lpeg compares to their
haskell impl. It even has a nice online test tool:

[http://lpeg.trink.com/share/syslog](http://lpeg.trink.com/share/syslog)

~~~
mazelife
Author here. Thanks for pointing me to pylogsparser, I'll definitely take a
look at that. Your point is well taken: without building a parallel
implementation in Python, we don't have a way of knowing for sure if Haskell
is faster. The only data points I have are that we've built a couple of
logparsers for custom formats in Python before this, and the number of lines
parsed/second was far smaller than the attoparsec-based parser.[1] It's not
apples-to-apples, since the formats differ a bit, but I don't think that it
has no predictive value. So in the second part of this post, which I'm working
on now, I'm hoping to be able to provide a fully-functional NCSA combined log
format parser in Haskell alongside the blog post. I think that would be fairly
easy to benchmark since it's a common-enough log format.

[1] that's just measuring the time to parse log files into some sort of
structured data, not necessarily to do anything with it

~~~
mijoharas
Thanks for the article, I think you're missing some code, it seems that every
time you have a `do` block in your code samples most of the code is cut off (I
assume).

~~~
mazelife
Thanks for pointing that out; I think the syntax highlighter may have
clobbered some code. It's now fixed.

------
slashnull
Hey, that's pretty cool!

After playing with Haskell for the last two years, I figured that it would be
pretty damn hard to write conventional LAMP/JS webpages in Haskell, but I
still wanted to try to solve some of my _real world_ problems with it.

So I'm parsing the log files that my LAMP stuff produces.

So far it's a very fun and frustrating waste of time, but given my recent
progress, it will probably become a slightly less fun and frustrating time
saver.

I just recently cleared up some confusion and type errors coming from
ByteString/Lazy ByteString/Text/Whatever conversion issues (which made up 95%
of the recent frustration, as every other language I'm using right now has
_one_ single sort of string (which more than often serves to contain ints ;))

... and database CRUD and JSON typeclass instances have appeared as if by
magic.

Awesome!

And now I can celebrate by unleashing that stuff tens of megs of logs, opening
top, and then seeing the executable instantly eat up all my RAM, before slowly
conceding space to the postgres server.

Hopefully your post will help me implement streaming.

Cheers!

~~~
codygman
> After playing with Haskell for the last two years, I figured that it would
> be pretty damn hard to write conventional LAMP/JS webpages in Haskell, but I
> still wanted to try to solve some of my real world problems with it.

Really?

You should check out:

[http://adit.io/posts/2013-04-15-making-a-website-with-
haskel...](http://adit.io/posts/2013-04-15-making-a-website-with-haskell.html)

~~~
slashnull
I already got a few libraries and frameworks up and running (including scotty,
actually, which is really cute), but somehow I can never have everything I
want at the _same time_...

Some tutorials will have a few routes and a DB, some other will have an
intricate rendering and templating architecture and basically no data storage
solution, some are more or less a hello world with a session manager... And
there's Yesod, which seems to do everything out of the box, but which has such
a colossal amount of dependencies that I never got anything to build beyond
fresh yesod-init setups, which still manage to fail on my current setup,
despite using Stackage _and_ sandboxes.

Among what I've seen in Haskell, ORMs and web frameworks are the two types of
libraries that carry the largest and most cumbersome monad transformer stacks
around; making a website involves having to typecheck a huge monad transformer
into another huge monad transformer. This is not easy.

Not to mention that, in the world of open-source web developement, nearly all
the documentation, expertise, copy-pastable code examples and standard
practices are in dynamic OOP languages.

Perhaps the ease of RoR, PHP and Node has let everybody forget how much stuff
is involved in making modern webpages : )

~~~
codygman
> including scotty, actually, which is really cute

You know people use Scotty in production right? It's more than cute, though I
agree that most Haskell library/framework documentation could use some work.

> And there's Yesod, which seems to do everything out of the box, but which
> has such a colossal amount of dependencies that I never got anything to
> build beyond fresh yesod-init setups, which still manage to fail on my
> current setup, despite using Stackage and sandboxes.

Hm, did you try only using Stackage or sandboxes? If your up for another
solution, I know Halcyon[0] is supposed to be very frictionless.

> Among what I've seen in Haskell, ORMs and web frameworks are the two types
> of libraries that carry the largest and most cumbersome monad transformer
> stacks around; making a website involves having to typecheck a huge monad
> transformer into another huge monad transformer. This is not easy.

Matter of opinion? I find Scotty and Snaps monad transformer stacks to be
pretty simple. This is something that becomes easy with experience I think,
just like getting used to using composition over inheritance.

> Not to mention that, in the world of open-source web developement, nearly
> all the documentation, expertise, copy-pastable code examples and standard
> practices are in dynamic OOP languages.

Mostly a function of manpower available I'm guessing. I find when you ask for
examples or file bugs against libraries/frameworks they are responded to
quickly though.

> Perhaps the ease of RoR, PHP and Node has let everybody forget how much
> stuff is involved in making modern webpages : )

You just have to remember how much polish they have and how it was nowhere
near this easy in the beginning.

~~~
slashnull
Yeah, I meant "cute" more in the sense of "elegant". Unfortunately it felt too
barebones and there wasn't enough documentation to let me get everything
running. I like that lib, though.

...

Halcyon, I'l remember that. I still have to try Nix one day, too

...

Matter of experience, I guess, yeah

...

Precisely, there's a virtuous cycle going around popular platforms due to the
larger amounts of features and documentation getting written, and in turn, the
larger amount of new devs getting into it. Shame those platforms are built on
such bad languages... It's cool to see the web evolving in a direction where
Haskell can solve very specific problems without disrupting people's stacks.

------
dllthomas
_" In type-system theory this is called a "Sum Type" because we can define all
possible representations"_

It is indeed a sum type, but I don't think that is why (or true of all sum
types, or less true of product types). It sounds like they are conflating
"enum"?

As an example of a product type for whom we can define every possible
representation: ((),()) has precisely one representation.

As an exampe of a sum type for whom there are infinite representations:
(Either String Integer)

My understanding has it that it is called a sum type because 1) for finite
types the number of representations is the sum of the number of
representations under each tag, and 2) it behaves more generally like a sum
(a^x * a^y = a^(x+y) is an equality in arithmetic, (x -> a, y -> a) is
isomorphic to ((Either x y) -> a) in programming).

~~~
mazelife
That's a good point. A sum type/disjoint union (by my understanding) means the
type is in the union of subsets and that the subsets are pairwise disjoint,
but it doesn't say that the subsets have to be finite. Since this post is
really aimed at Haskell beginners, I was trying to avoid going down the rabbit
hole as regards type system or set theory and I might have oversimplified in
the process.

------
sriku
> Sequencing parsers applicatively allows the compiler to perform static
> analysis on a parser without running it. This knowledge can be used to avoid
> things like backtracking that may slow your parser down. This is not
> possible when sequencing parsers monadically because the grammar of each
> parser depends on the previous one. However, performance results in this
> case are probably negligible; don’t hesitate to choose do-notation if you
> find it easier to read.

Would be good to talk more about this and quantify too. Very interesting
topic.

~~~
codygman
Maybe this paper[0] on using Applicatives for performance and concurrency in
Haskell's Haxl will help.

Also note that there is a feature proposal[1] for Applicative Do notation.

0: [http://community.haskell.org/~simonmar/papers/haxl-
icfp14.pd...](http://community.haskell.org/~simonmar/papers/haxl-icfp14.pdf)

1:
[https://ghc.haskell.org/trac/ghc/wiki/ApplicativeDo](https://ghc.haskell.org/trac/ghc/wiki/ApplicativeDo)

