

Haskell improves log processing 4x over Python - jmintz
http://devblog.bu.mp/haskell-at-bump

======
andrewcooke
The work sounds very cool (and they are hiring), but (only) a factor of 4
speedup over Python is (to repeat a phrase from elsewhere today) like boasting
that you're the tallest midget ;o)

~~~
jamwt
Hi, article author here.

It's important to note that this particular job is largely bound on a.) I/O
and b.) format serialization tasks. Both Python's BSON and JSON libraries are
mature and have their critical sections written in C, so a speedup of 4x is
still noteworthy. The Haskell version, on the other hand, is pure Haskell.

~~~
andrewcooke
Neat - thanks.

------
Peaker
Sounds great. I'm a very big Haskell fan.

I'd love to point people to this when trying to convey some advantages of
Haskell. To make it more compelling, can you expand some on the downsides and
maybe obstacles you encountered?

The thing I'm unsure about, is how difficult it would be for (very) talented
developers to just jump in. We have really talented developers, and everyone
is super time-constrained, so many are wary of diving into a language as
different as Haskell. Was it hard for your developers to figure Haskell out?
Did your previous use of Scala help? How long did it take them to dive into
Scala?

~~~
jamwt
I would say the two real barriers to writing effective Haskell projects are
a.) "getting" monads, and b.) understanding the implications of laziness,
especially with regard to space leaks and unconsumed thunks. Everything else
isn't that big of a deal.

It's all much easier to digest, though, even for "really talented developers",
if they have some experience with another functional language first. OCaml is
a nice stepping stone before digging into the abstractions involved in
understanding Haskell's powerful type system. Scala is good too, but having
the object stuff mixed in there can lead you to rely on some patterns that
aren't going to be available in a non-OOP language. I think the scheme/clojure
path isn't bad either, but it's probably ideal to spend some time in the
"statically typed" wing of the functional universe before going to Haskell.

~~~
samstokes
Could you say more about why "getting" monads was needed?

I came to Haskell with no understanding of monads, started writing code, and
eventually used my knowledge of Haskell to learn about monads. Not
understanding monads just meant I was lacking a useful design pattern, and
found certain API docs confusing, but it didn't stop me from writing
reasonable code in most circumstances.

On the other hand what you describe in your (awesome) blog post is a more
significant Haskell project than any I've worked on, so I'd be interested to
hear your experience.

I've not really written my _own_ monad, or properly looked into monad
transformer stacks, and I'm aware that I could probably clean up a lot of code
using them - is that the sort of thing you mean?

~~~
jamwt
Sure, I can say more--put bluntly, and despite anything you might hear to the
contrary, you basically _do_ need to grok monads and monad transformers to
tackle any nontrivial project in Haskell. Even if only to understand the code
and APIs of libraries your application will interact with.

Basically, you can't swing a dead cat without hitting monads in the Haskell
library ecosystem; therefore, you'll need to know what they are.

~~~
mrshoe
Catch phrase thief!

~~~
jamwt
You're just jealous I worked it in so naturally.

------
Locke1689
The author is mostly write about the usage cases of Haskell, but simply
"systems" is a bit misleading because there are certain performance
characteristics of lazy programs which make them bad choices for some systems
programs. Any type of real-time system, for example, can suffer unpredictable
performance in critical sections, which is pretty undesirable.

~~~
awj
Not to argue the example, but Python's garbage collection disqualifies it for
real-time systems as well. In fact, I'm having a hard time find a "system"
task for which Python (as a language) is qualified by Haskell is not.

~~~
Locke1689
Python is not a systems programming language.

~~~
danieldk
Maybe not, but it is certainly used as one. Everything from package managers
(yum) to operating system installers (Anaconda) have been written in Python.

Besides that, the grandparent is right: possibly every situation where Python
was used as a systems programming language, Haskell could fit in (and more).

~~~
Locke1689
I think you're confused to what "systems" programming entails. User-space
packaging with a bunch of scripts is not systems programming. OS installers is
not systems programming.

To give you an example of what is systems programming, I have helped developed
operating systems kernels, virtual machine monitors, and distributed networked
systems. All of these things would be considered systems programming.

See <http://en.wikipedia.org/wiki/System_programming> for more information.

~~~
danieldk
That's just one definition. Read up on Ousterhout's dichotomy:

<http://en.wikipedia.org/wiki/Ousterhouts_dichotomy>

~~~
Locke1689
You should just read <http://home.pacbell.net/ouster/scripting.html> (his
original paper) because all his statements about systems programming languages
reinforces my claim that Python is not a systems programming language.

------
ynniv
Are the logs being read from disk? In my experience, python is highly
optimized for reading (possibly compressed) files from disk. If your
infrastructure keeps logs in memory, python will lose this advantage and
compete on computational performance where Haskell has the advantage. This is
important for those of us who grind logs on disk and might be considering a
language switch.

~~~
enneff
What do you mean by optimized? Python makes the same read and write syscalls
everyone else does.

What you're probably observing is Python's slow code generation being masked
by the inherent slowness of I/O.

~~~
ynniv
_Python makes the same read and write syscalls everyone else does_

Except, when python's pants are on, it makes gold records.

I haven't looked to see if there are any explicit optimizations, but your
statement is ridiculous; an effective IO strategy can have an enormous effect
on performance.

~~~
enneff
I'm sorry, what? Just being disagreeable is not an answer. What does "IO
strategy" mean? You are being incredibly vague and unhelpful.

Reading data from a file handle into a buffer is a trivial operation. It's
what you do with that data afterwards that is important. In C (or Go) you have
complete control over what happens next. As for Python, I don't know what
happens, but I don't see how it could possibly be more efficient than any
other sane language.

~~~
ynniv
_Reading data from a file handle into a buffer is a trivial operation._

If that is all you're doing, then yes; there isn't a much more efficient way
of doing that.

 _In C (or Go) you have complete control over what happens next._

It is up to the programmer to know what to do next. Does haskell strike you as
a language of micromanagement? Python can sometimes be multiples faster than
command line grep. I haven't looked into why, but I have some ideas.

 _You are being incredibly vague and unhelpful._

If you don't understand what I'm talking about, it doesn't make /me/ wrong.
And just saying so is also not an appropriate response. I had a specific
question that was answered by the OP, and was useful to me.

Hacker News comment threads are rarely a place of education, but I will
reinterpret what you said as a question.

An IO strategy is how and when you make those system calls. Reading from disk
takes a vast amount of time, during which you can be doing computation. To be
fast, you should be asking for the appropriate amount of lookahead, at the
correct offset. Is that 4k? 1Meg? 100Megs? 1GB? Do you use threads for this?
Can you skip any of the input stream? Do you let the operating system,
programming language, library, or program code decide how big the read is?
Where that data is stored after being read from disk is also important.
Especially fast strategies use mmap to avoid copying from kernel space into
user space. And of course everything is always chunked at specific intervals,
so knowing where those are can sometimes reduce the number of calls. The
ability to optimize for these is one thing that makes dedicated database
software so successful.

It is a dark art, and it is not expected that the average person know these
things. If you happen to be the kind of person working on a programming
language, it could be useful to be aware of them. Here are some quick links,
but there is a vast amount written on the subject.

[[http://lists.freebsd.org/pipermail/freebsd-
current/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)] [<http://tangentsoft.net/wskfaq/articles/io-
strategies.html>]

------
kordless
I'd be interested in hearing more about how the author is using the resulting
data set. Doing extractions at event generation time can be very useful if you
know what you are after in advance, but not so good for adhoc analysis.

Any reason why you didn't use Hadoop for this, then run batch jobs to extract
summaries?

~~~
jamwt
Yeah, the whole pipeline is actually quite more faceted than can be deduced
from this summary. This stage actually just persists the events into a
consolidated transaction log. Then, there are secondary processes that scan
these transaction logs (in batch) and distribute data into various databases
for system, business, and user analytics. I can't go into too much detail
there, but the actual digesting and reporting side is more involved.

~~~
kordless
I'd like to hear more about the use case if you have time, and can talk about
it. I'm kordless at loggly dot com.

------
aristus
Awesome work. If you haven't heard about Tim Bray's WideFinder challenge, it
was really interesting.

[http://tartarus.org/james/diary/2008/06/17/widefinder-
final-...](http://tartarus.org/james/diary/2008/06/17/widefinder-final-
results)

