
Performance Improvements Using Judy Arrays - drewolson
https://github.com/blog/1489-hey-judy-don-t-make-it-bad
======
jemfinch
The article's credibility is significantly reduced by referring to Judy Trie
lookups as O(log n) operations. They're actually O(log w) operations: the
number of operations is bounded by the array's _word size_ , not the size of
the Judy array. A million elements in a Judy32 array will cause at most 7 node
hops, not the 20 you'd expect from a binary tree.

~~~
mcintyre1994
This is the genius of tries. We had to code up the data structures for Twitter
for our coursework, for some reason as if it was all in memory. I found a
suffix tree (not as clever, it branches on character, so normalised there were
36, and is designed for full text searching) was a really clever way to store
that data because search time for a phrase doesn't increase as your data set
does. Unrealistic for something like Twitter, but a really useful trick that
this article definitely misses.

~~~
Peaker
Twitter's data _is_ all in memory. That memory may not be random-access
memory, though.

A pet peeve of mine is how people can do all the CS necessary to get nice data
structures for their in-RAM data, but seem to forget everything they know and
use very bad structures when they spill to flash or disk storage.

~~~
snprbob86
Can you recommend or link to introductory material on data structures
optimized for flash and disk?

~~~
dantheman
You could start by looking at databases.

One useful data structure is the B+ tree
<http://en.wikipedia.org/wiki/B%2B_tree>

~~~
nostrademons
Also log-structured merge trees:

<http://en.wikipedia.org/wiki/Log-structured_merge-tree>

There're a bunch of fascinating performance tradeoffs between B+ trees and LSM
trees, which come out when you're choosing between something like MySQL InnoDB
or PostGres (B+ tree) vs. LevelDB or Apache Cassandra (LSM trees). LSM trees
tend to be faster for inserts and for write-heavy applications, and they offer
very fast reads and writes for frequently-accessed data. They also depend upon
the _bandwidth_ of the disk rather than the seek time, and bandwidth has of
late been increasing significantly faster than seek times. OTOH, they offer
very variable latency, as an insert might trigger a major compaction, while B+
trees have a bounded latency limit. A lot of the distributed systems work at
Google (a big user of LSM trees via BigTable) comes from the need to work
around the poor 99th percentile latencies of LSM trees.

------
lobster_johnson
While this is very interesting, the real problem here is not Linguist stuff,
but the fact that it's crammed into the web app.

A web app involved in a site as complex as Github's should really contain only
the parts required to service web pages. A service like language
classification is clearly better designed as a standalone service.

Aside from working around GC overhead, compositing big app from many smaller
apps have many advantages. You force yourself to encapsulate complexity with
strictly focused, network-friendly APIs (we use REST ourselves), which changes
the way you have to think about programs. Hiding the implementation behind an
API also allows you to swap out the implementation without the client ever
knowing. Since the client and server are separated, you have fine-grained
control over how much resources you are willing to spend on which components.
And so on.

Having done monolithic Rails apps for many years, it's a model I am not going
to back. Today, our web apps contain only UI logic. No database, no models, no
data logic, just UI all the way. Everything else is handled by specialized
components.

~~~
lobster_johnson
[Sorry about the horrendous spelling/grammar errors. I was in a hurry.]

------
jekub
There is some interesting things in this post but the main problem is not
using Judy or something else, or talking about memory or complexity. The main
problem I see here is using the wrong tool.

Replace the token matcher with a simple classifier (a maxent will work very
well here) with n-gram of characters features through a hash-kernel and you
get a very accurate, fast and low memory system.

I've build one two-years ago who accurately classified a bit over 100
different languages with only 64k features in the end. (So requiring only
8*64ko of memory) And this was without using file extensions as they weren't
available in our case.

Before any hard optimizations, first check the methods used, and next the
algorithm, anything else should go after.

------
prezjordan
Some previous discussion on Judy Arrays [0]. Apparently they are patented!

[0]: <https://news.ycombinator.com/item?id=5043667>

------
habosa
Wow this is an incredibly interesting and accessible article on performance
tuning. I have two questions:

1) Is there an easy tutorial somewhere on calling out to native code from
Ruby?

2) Could the team at Github possible give a little more detail on what you did
to get all of those pretty benchmarking graphs? How did you get all of the
info on memory usage and CPU activity (I assume it wasn't just time {command}
> file.txt)

~~~
steveklabnik
> 1) Is there an easy tutorial somewhere on calling out to native code from
> Ruby?

The canonical documentation is here:
<https://github.com/ruby/ruby/blob/trunk/README.EXT>

------
exabrial
I would have liked to seen the Judy array implementation in pure Ruby, so we
could compare apples to apples. I'm not trying to troll... but basically they
solved their problem by:

* Avoiding Ruby language features

* Rewriting it in a different language

This is why I lean towards static languages like Go, Scala, and Java.

~~~
twistedpair
So if you want to run a big site on Ruby, slowly migrate all the Ruby bits you
use heavily to C/++? Seems to be the storyline of their blog.

~~~
quandrum
Well, maybe to put it more generously:

1\. Build a site in ruby (insert langauge/framework the creators are most
comfortable/familiar with and can iterate quickly). 2\. Build a community of a
certain size that causes it scaling problems. 3\. Identify the hotspots and
re-implement them in more performant way.

This seems entirely reasonable to me. In fact, I'm guessing this is the path
that all applications that get to Github scale take. Of course, it's easy to
be snarky and identify Ruby as the problem. (Building the community is the
hard part!)

~~~
r00fus
Back in my smalltalk days (mid-90s), we used to call this "going below the C
level" - take sections of code that need to be optimized, and rewrite them in
C as a DLL/shared binary, to be called by the smalltalk native layer.

The % of C code was less than 3%, from what I remember, and C is significantly
more verbose than smalltalk, too.

I hear that happened a lot with java as well, using JNI sections for
performance purposes (this was before hotspot compiler got decent).

------
jrochkind1
This is great. Someone should change the title so it's descriptive of what's
going on, although I have no idea what that title should be, heh.

This post is about Github moving some code from ruby to C for performance,
MRI's GC, and judy trie-like data structure and how it compares wrt to both
performance and memory consuption with hash implementations.

~~~
127001brewer
_"... I have no idea what that title should be, heh."_

It's a reference to the Beatles' song, "Hey Jude" (which opens with "Hey,
Jude, don't make it bad | Take a sad song and make it better ...").[1]

[1] <http://en.wikipedia.org/wiki/Hey_Jude>

_Edited for clarity._

~~~
mitchh
The issue is that the title, while clever, doesn't give a even a remote idea
of what the post is about.

~~~
parennoob
I guess I still prefer a catchy title to preachy ones that extensively use RFC
2119 keywords. e.g. "Why you SHOULD profile Ruby code to catch expensive regex
bottlenecks" :)

Personal preference though.

------
pjmlp
I wonder if something like Crystal could have helped.

<https://github.com/manastech/crystal>

~~~
IanChiles
That looks amazing! Hadn't seen this before, but it's what I've been after for
a long time.

------
thezilch
For starters, the blog post was informative and should get some creative
juices flowing for applications in others' stacks -- piqued my interests in
hacking core utilities. However...

The graphs never show a 30K token mark, which is what the heap-allocation
counter showed for their [programming language] classifier.

It's not clear to me that much RSS was saved. Maybe 20Mb? So, the next
question is how many classifiers are running, where such a "hit" actually
matters. Again, there is no 30K mark on the runtime graphs, but let's assume
the generation to the graphs' left are linear. It looks like we're saving a
half-second and removing most of the jitter, but it's not made clear how the
removal of GC chunking has any effect on processing outside of the classifier.
I just can't imagine how a couple of these classifiers can't keep up with the
push rate of files to Github -- the classification improvements are neither
low-hanging or part of the BigO in Github's stack. The process seems very able
to run asynchronous at push time and un-deferred, if unrun, on page view.
Unless they're seeing more than 1 million pageviews/sec (per classifier; just
run more? one 20Mb service per httpd server?), because I can't really imagine
hitting more than 10x their tokens (300K tokens), which is still only ~50Mb;
the RSS graph, again, has a weird scale to suggest they might grow their
tokens by 67x.

~~~
driverdan
They said in the article that they want to expand the classification data so
my guess is they will be going well beyond 30k tokens.

------
jerrysievert
Judy arrays are also faster in node.js:
[http://legitimatesounding.com/blog/Faster_sometimes_Associat...](http://legitimatesounding.com/blog/Faster_sometimes_Associative_Arrays_with_Node_js.html)

the issue is with the patents - has anyone been contacted yet about the usage?

------
nthitz
Sometimes Mods change the title to be the original article title. And
sometimes they change it from the original article title to something else!

~~~
anateus
Yes... they change it to the original unless it's really clear the original
change was super necessary. In cases when the original title tells you
_nothing_ of the content, they change it to something useful. Seems like a
pretty transparent and reasonable heuristic to me.

~~~
nthitz
I agree that it's reasonable. I don't feel it's applied uniformly though.

~~~
rhizome
No policy ever is.

------
jongraehl
1\. this could be done offline, or cached

2\. unconvinced that they can't do better w/ the right hash table (which will
be thousands of lines fewer of code). but since they're using a library, i'll
give them a pass. not even a benchmark vs. a quick hash table version makes
all the claims about "judy arrays are better for our purpose" suspect.

------
ww520
Isn't Judy Array patented? What's the legal status using it?

~~~
dalke
If patent does apply (and there's been no statement or claims about which ones
do definitely apply), then people assume it's US6735595: Data structure and
storage and retrieval method supporting ordinality based searching and data
retrieval.

The software from HP is under the LGPLv2.1 which says "Finally, software
patents pose a constant threat to the existence of any free program. We wish
to make sure that a company cannot effectively restrict the users of a free
program by obtaining a restrictive license from a patent holder. Therefore, we
insist that any patent license obtained for a version of the library must be
consistent with the full freedom of use specified in this license."

That can imply that there are no additional patent protections should you use
the LGPL'ed version of Judy or its derivatives. YMMV. Consult your nearest
patent attorney for details.

The patent expires in 7.5 years.

------
kristianp
So each language is stored with a prefix in the Judy array. This means that to
identify the language of a token, you have to loop for all languages,
prefixing the token and looking it up, keeping a count of matches for each
language. Does that sound correct?

I wouldn't have used the prefix approach, instead storing a token once in the
judy array, and using the data stored to indicate which languages match the
token.

------
alexdowad
This is awesome!!! I remember looking for an object allocation profiling tool
for Ruby 1.9+ some time ago, and not finding one
([http://stackoverflow.com/questions/12204012/allocation-
trace...](http://stackoverflow.com/questions/12204012/allocation-tracer-which-
works-with-ruby-1-9)). When will Github's --debug-objects option make it into
MRI???

------
ikeepforgetting
Somebody finally found a use for them!

------
revelation
Something here doesn't sit right with me. Please don't launch into a lecture
on algorithmic complexity and cache lines if you are running on Ruby on Rails.

~~~
jknightco
Slow platforms are exactly where you need to talk algorithmic complexity.
Switching to a faster environment would only reduce a routine's execution time
by a constant multiplier—switching algorithms has the potential to do much
more.

~~~
revelation
That's not what I mean. I think that any half-decent implementation in C would
have been better than their current Ruby solution. And instead of spending all
that time to research Judy arrays and other magic bullets, they could have
already identified another problem zone and fixed it.

This just feels like "we have this huge and slow Ruby behemoth here, but using
magic Judy arrays we made this little gear turn insanely fast!".

~~~
jknightco
Execution time isn't the only cost in building software however. Building a
web application in C may be "cheap" in regards to scaling performance, but it
would sure as hell increase development costs a ton.

I think what Github has done is exactly what you're supposed to do: build your
product on a platform that allows for quick growth and easy iteration,
identify the slow parts, then swap those out for faster "gears" as you said.

Edit: I find it rather humorous that you find algorithmic research to be a
magic bullet, but believe switching platforms to be a viable solution when
faced with a bottleneck. I believe you have that backwards :)

~~~
oinksoft
I don't think parent is suggesting the whole website be written in C. I think
parent sees the CS focus as a sideshow, and that a C version of the "escaper,"
even if not designed ideally, would still smoke the Ruby version that was
acting as a bottleneck.

~~~
jknightco
"...parent sees the CS focus as a sideshow..."

My point is that the CS is never a sideshow. Improvements gained from changes
in algorithmic complexity vastly outpace improvements gained from switching
languages/platforms.

Worry about your algorithm first and your platform second.

