

Fancy algorithms lose; dumb C code wins - begriffs
http://blog.begriffsschrift.com/2011/11/11#real-c

======
lukesandberg
This doesn't strike me as particularly interesting. CL is just a programming
language, a functional assembly so iit seems intuitive that it would be
infeasible to cache all computations. Even just caching all programs up to a
given size is a big issue because the number grows exponentially with the size
of the program, so even if you could cache all programs of a given size it
wouldn't neccesarily help to solve (reduce) larger programs very much.

It would be like trying to write a compiler with a peephole optimization for
every form of the language, you just end up with a large, slow compiler.

Still it makes me want to write an SKI interpreter.

~~~
p4bl0
> Still it makes me want to write an SKI interpreter.

See unlambda at <http://en.wikipedia.org/wiki/Unlambda>.

~~~
Sniffnoy
Bah, Unlambda is clearly artificial. You want something that is ridiculous in
a natural way, you use Iota: <http://en.wikipedia.org/wiki/Iota_and_Jot>

------
stuhood
The first solution that didn't involve disk or network IO was the fast one?

~~~
Mesmoria
My thoughts exactly. It has nothing to do with language at all.

~~~
archangel_one
Potentially it still might have done; since C is more efficient with memory,
it may not have been practical to take that approach with PHP.

------
StavrosK
Ah, nothing like a false dichotomy to start the weekend.

------
secure
So _not_ putting stuff into a transactional database and using a compiled
language instead of an interpreted one is better for computationally intensive
tasks? Big surprise.

------
mjbellantoni
Anyone wanting to do experimentation of pre-computing logical expressions
should take a look at Binary Decision Diagrams.

<http://en.wikipedia.org/wiki/Binary_decision_diagram>

------
codex
This story reminds me of the (apocryphal?) claim that C++ running on a single
machine can often beat a cluster of Hadoop machines. Scalability is another
matter.

~~~
bpodgursky
That's a totally problem/domain specific assertion. A single machine with 8 or
so cores simply isn't going to make it through a hundred terabytes of data,
and many (most?) hadoop jobs are IO-bound anyway.

If you find that micro-optimization greatly increases your performance, you
probably shouldn't be using hadoop anyway...

~~~
beagle3
> If you find that micro-optimization greatly increases your performance, you
> probably shouldn't be using hadoop anyway...

It's all a matter of what you optimize for. In a recent project, a C++
version, using memory mapped files with a fixed-length record, was about 20
times faster than the hadoop version, and that's just the CPU.

The C++ version had: no deserializing needed, everything random access,
multiple runs on same data had _NO_ I/O requirements (mmap was cached between
runs). It was about as hard to write.

So, instead of running a 100-core strong Hadoop install (which is small, but
far from trivial), I was able to do with one hefty 8-core machine.

Scaling up is "easier" with Hadoop, in the sense that you can just throw money
at it and get more EC@/rackspace nodes when needed. But it is a _lot_ of
money.

Hadoop makes sense if you've got lots of money to waste, _and_ actually need
_thousands_ of cores (cause you'll only need a few tens if you effectively use
your hardware).

~~~
jeffffff
If your working set fits entirely in ram on a single machine, hadoop is the
wrong tool for the job. Hadoop is more about getting obscene IO bandwidth from
hundreds of hard disk spindles going in parallel, not raw number of
cores/parallel computation.

~~~
beagle3
While that is what Hadoop is about, and it is successful at it, it is very far
from being efficient; In my experience, you pay 10x compared to good use of
the same hardware. "obscene IO bandwidth" is right, but at 10% of the
practical maximum.

This is a tradeoff that people in computing have been happily taking for years
-- e.g. using a higher level language is a similar kind of tradeoff (use
Python instead of C - pay x10 in performance, get x10 shorter development
time).

But this tradeoff only makes sense if the x10 thing you get is cheaper than
the x10 thing you pay for. With hadoop, you pay x10 for hardware/computing
time and more for administration (unless you have some kind of EC2 elastic
mapreduce, at which point you pay x20 for hardware/computing but no
administration cost), and you save some development time.

Two hadoop projects I was consulting on (or rather, consulting _off_ hadoop)
were paying ~$10K/month each to Amazon for a while, when <$20K in programmer
time for each reduced it to working on one beefy machine that was already
located in the office, with ~$100/month cost (power, cooling).

~~~
jeffffff
I've found the slowdown from hadoop to be very problem dependent. I've seen
anywhere from no slowdown to 1000x slower. It sounds to me like those projects
should never have been using hadoop in the first place. If your problem is not
IO constrained naturally, moving it to hadoop will make it IO constrained. In
my experience doing that will get you a 10x-100x slowdown versus a more
appropriate solution on the same hardware.

If your problem is IO constrained, for example processing 10 TB of log data,
hadoop will get all your disks to 100% utilization just as well as anything
else will. It will leave your processors horribly underutilized, but that's
just the nature of IO constrained problems. If you could get 10 TB of ram
dedicated to your dataset then you could easily get a massive speedup over the
hadoop solution but that's just not realistic.

------
cageface
Must be a slow news day when this hits the front page.

------
hmottestad
From the post: "This post is about sucking it up and just writing low-level
code"

So I'm quite sure that C is a high level language.

Assembly is low level. It allows you to control the cpu directly by telling it
what instructions to run and which registers should be used for what. In C the
closest thing you have is specifying if a register should be used at all. Or
inline assembly, which again...is just assembly.

And yes....php is even higher level.

Caching to drive is only useful if your operation takes longer than reading
from drive. In this case reading from drive was slower than doing the
calculation. What he should have considered is an in-memory-cache.

~~~
slowpoke
_> So I'm quite sure that C is a high level language._

Except it isn't. At least not anymore. This statement was true a few decades
back, when C _was_ the highest level available. Today, C is barely a level
above Assembly on the abstraction scale of programming languages.

Mind you, I'm not saying that's bad (and most other people don't mean it as an
insult, either). C is a powerful language that will probably be needed by
generations to come. But calling it "high level" today is being stuck in the
1970s.

------
viraptor
I think this might be much faster with caching, but the mistake was to put
caching in some external database, rather than local process memory.

There are only 4 characters possible: `,i,k,s - that means on a 64b machine
you can have 29 elements fitting into a native integer (5 bits for length + 29
2-bit elements). Use that as a key for a hash in a cache and compute the
longer terms in a standard way to avoid smaller mallocs, preallocate large
chunks of memory which can be filled linear way... I'd hope for 1.5-2x speedup
over the original "dumb C".

Sigh... now I'm tempted to actually try that. Does anyone know of a good
library of SKI fragments for testing?

~~~
begriffs
I'd love to see what you come up with! Here are lots of CL terms you can use:
<http://share.begriffsschrift.com/joe/bank-17.txt.bz2>

It is a tab delimited file. The first column is all CL terms of length 17
which reduce to a normal form. The second column is what they reduce to, and
the last column is the number of reduction steps it took in a leftmost-
outermost evaluation order.

~~~
viraptor
I'm not sure how comparable it is to your program since I cannot run it in
batch mode (memory leaking...), but I'm able to process the first 340k rows of
that file in under 7 seconds with cache limited to 15 characters. That makes
20 microseconds per row (with checking the result).

Or another way - 7 microseconds per reduction step on average.

Without cache, it takes minutes, so I didn't want to wait for the actual time.
See <http://imgur.com/FL2Eg> for seconds per row -vs- cache limit. Ah...
written in python without using the bit-packing tricks. I guess the 100+ times
speedup is enough anyways... <https://bitbucket.org/viraptor/ski/src>

~~~
begriffs
Nicely done, and good analysis. It's a fascinating problem, isn't it? Simple
rules, but an opportunity to approach it several ways and refine the solution.

------
3pt14159
Once you get good enough at Python or Ruby knowing C is so very, very
important. It releases whole swaths of problems up to you and binding it to
Ruby or Python is _easy_.

------
listrophy
Hey begriffs, next time we're hanging at CE or a tech social, let's talk about
this kinda stuff. I may be able to shed some light on making "fancy" stuff
work faster/better... though a straight-up C implementation without I/O is
gonna win (though it perhaps won't be as persistent as, say, anything with
actual persistence)

------
chuinard
You complain about App Engine expiring requests, but I think the new 'backend
instances' concept allows you to have really long requests running.

