

The Best Tool for the Join: Scaling Node.js with Unix - jonahkagan
http://engineering.clever.com/2014/06/18/the-best-tool-for-the-join-scaling-node.js-with-unix/

======
valarauca1
To quote an older hackernews comment:

"I love when people rediscover basic unix utilities and have this light bulb
moment and realize that people have been working on the same problems as them
for much longer and basically solved it 20 years ago."

------
rgarcia
If you think improving education with tools backed by years of unix wisdom is
interesting, we're hiring! [https://clever.com/about/jobs#engineer-full-
stack](https://clever.com/about/jobs#engineer-full-stack)

------
spasquali
It's great to remind everyone how Node follows various rules of the Unix
Philosophy, and how it is designed to make process spawning/streaming as
natural as on the OS.

I would prefer it though if the implication wasn't that a failure in Node's
design is responsible for the failure of this in-process-memory technique of
sorting massive data sets. From the article:

"However, as more and more districts began relying on Clever, it quickly
became apparent that in-memory joins were a huge bottleneck."

Indeed...

"Plus, Node.js processes tend to conk out when they reach their 1.7 GB memory
limit, a threshold we were starting to get uncomfortably close to."

Maybe simply "processes" rather than "Node processes"? -- I don't think this
is a Node-only problem.

"Once some of the country’s largest districts started using Clever, we
realized that loading all of a district’s data into memory at once simply
wouldn’t scale."

I think this was predictable. Earlier in the article I noticed this line:

"We implemented the join logic we needed using a simple in-memory hash join,
avoiding premature optimization."

The "premature optimization" line is becoming something of a trope. It is not
bad engineering to think at least as far as your business model. It sounds
like reaching 1/6 of your market led to a system failure. This could (should?)
have been anticipated.

~~~
rgarcia
To some extent we knew that in-memory joins would eventually cause problems,
but we were certainly surprised at how quickly Node memory usage became the
bottleneck. Here's a little gist I used to test it a while ago
[https://gist.github.com/rgarcia/6170213](https://gist.github.com/rgarcia/6170213).

As for your point about premature optimization, in my opinion a startup's
first priority is getting something in front of users in order to start
improving and iterating. The first version of the data pipeline discussed in
the blog post was built when Clever was in 0 schools, so designing it to scale
to some of the largest school districts in the country would have been fairly
presumptuous.

------
waylonflinn
Cool solution. Reminds me of the Flow-Based programming movement happening in
the node community. [http://noflojs.org](http://noflojs.org)

It's also interesting to consider why we don't actually build more tools this
way (generalizing the Unix philosophy). Some Real Talk about that here:
[http://memerocket.com/2006/12/01/the-unix-tools-
philosophy-t...](http://memerocket.com/2006/12/01/the-unix-tools-philosophy-
the-big-lie-or-the-big-missed-opportunity/)

------
platz
Glad they found a solution, but wonder how the initial planning went when they
started up.

Was there a conscious decision at the beginning to avoid a database? If so,
why not? It kinds of sounds like they'd like to go in that direction now, but
it'd require a re-write.

------
erjiang
I thought the 1.7GB memory limit for v8 was lifted? This bug[0] seems to
indicate that was fixed.

[0]
[https://code.google.com/p/v8/issues/detail?id=847](https://code.google.com/p/v8/issues/detail?id=847)

~~~
jonahkagan
Good find! I can't seem to find a comment that indicates what the new limit is
though. Experimentally, we were seeing processes dying at around the 1.7GB to
2GB range.

Did you find anything indicating what the expected new limit is?

------
siculars
Hi Clever gang, this hack is rather clever ;) I do a similar thing dealing
with complicated relational structures but I child out to bash scripts vs your
stream solution.

Is your stack coffescript? Doesn't look like generic javascript to me...

~~~
jonahkagan
Yep, we use CoffeeScript.

