

Kill hashtables, get shorter code. - JRM
http://herdrick.tumblr.com/post/825096887/kill-hashtables-get-shorter-code

======
nostrademons
This is actually a pretty common technique in Haskell. If you look at the Alex
lexical analyzer generator's source code, it represents CharSets as functions
from Char -> Bool: they return true if the character is in the charset or
false if it is not. Internally, they work just like the article describes: the
union of two CharSets doesn't actually enumerate every possible character in
the set, it just calls the two CharSets and performs an OR.

There're other languages that do this two, eg. JavaScript written by
functional programmers often ends up looking like this. In Arc, the
distinction is completely meaningless, because you "call" a hashtable to index
an object anyway.

The problem with it, I've found, is the lack of debugability. You can
enumerate a hashtable, and you can inspect it in the debugger. You generally
can't enumerate a function. I've found that with complex systems, the biggest
correlation to programming productivity is the amount of debug information
available, _not_ the size of the code. Trading off debugability to get short
code is usually a poor trade-off, once you get past a certain code size.

~~~
cageface
_I've found that with complex systems, the biggest correlation to programming
productivity is the amount of debug information available, not the size of the
code._

I don't think this point is made often enough. Brevity is a virtue certainly,
but ultimately what counts is not LOC, bug counts, iterations/sec etc but
features delivered to users. Worse is better reigns supreme here which is why
you so often see engineering and design disasters like php, craigslist,
myspace, and windows brushing aside far more elegant but less engaging
alternatives. For most of us the pride we take in our craft makes this a very
hard pill to swallow but swallow it we must.

------
ajg1977
"But it’s hard to argue with brevity."

No, it's really not. Brevity != clarity. Clarity is hard to argue with,
brevity is not. By his own admission this new code is harder to write and
harder to debug.

It's interesting to me that in the article the author dismisses any
performance impact of his changes because "we can profile and deal with it
latter", but fails to realize that by attempting to reduce the length of his
code he is at risk of creating exactly the same type of needlessly hard to
read, hard to debug, code that premature optimization does.

~~~
herdrick
That's a pretty old debate. I think brevity does a lot for clarity, and the
more you work at making your code brief the better you'll be at reading brief
code.

Of course if other people are reading your code then you should consider what
current programming practice is. But the current practice in Clojure is, for
cultural reasons, to make things really succinct. So you can assume those
reading your code are ok with that.

~~~
gnaritas
> That's a pretty old debate. I think brevity does a lot for clarity

Up to a point, that's true, but if you start shortening words by removing
letters, you've gone too far.

~~~
gruseom
I disagree in two senses. First, it's a poor measurement of brevity that
depends on identifier length. Even LOC (surely the gold standard of bad
metrics) doesn't depend that much on that. Second and more interestingly, as
code gets more concise, long explicit identifiers begin to detract from
readability. Well-written concise code discloses valuable information just by
the shape it takes. With experience reading such code, you start to grok it at
a whole-expression level. Lengthy identifiers interfere with this gestalt:
they provide an initial boost when reading the code, but at the long-term cost
of overwhelming the code with lexical noise.

Concise programs often appear laughably unreadable to people who are used to a
definition of "clarity" that derives from more verbose languages. At one time
they appeared that way to me too. But with time spent working in less verbose
languages, to my surprise I found myself wanting to use shorter and shorter
names, not to interfere with clarity but to enhance it.

~~~
gnaritas
Then length of an identifier name should be proportional to it's scope. But
specifically I'm talking about non descriptive function names shortened for
brevity: map, filter, join, split are better names than mp,flt,jn,sp no matter
how often you use them.

It takes less effort to read a word than to trip over an unfamiliar
abbreviation, so while the code may be more concise, it is less clear. If your
code relies upon the reader to know a bunch of specific idioms, then it might
be concise, but it isn't clear; it's jargon.

~~~
gwern
All words are arbitrary collections of letters. map, filter, join, and split -
for at least 2 of those, the most common meaning is so far removed from the
programming meaning that they might as well be something shorter like 'mp'.

~~~
gnaritas
That might be true if your users were a blank slate, but they aren't, they
have existing language skills and it's easier to remember words from ordinary
language than it is some library writers abbreviation.

Nor all all words arbitrary collections of letters, most words have some form
of alternating consonant vowel structure that makes them easy to speak and
remember. Write a program that just arbitrarily combines letters and I doubt
you'll get much out that looks like words; there are lots of patterns in how
we create words, it isn't arbitrary. If you can't read your code aloud and
sound reasonable then you've probably inappropriately abbreviated your
identifiers.

Map is by far a better name than mp; there is no shortage of vowels; map is
already a short identifier and a real word.

~~~
herdrick
_map is already a short identifier and a real word_

This is heresy, but I actually think it might be better if it weren't a real
word. I find I benefit from inventing names for functions or data abtractions.
It helps to clear the mind of what those things might be.

Plus making map into mp or whatever makes it much more googleable. And don't
get me started on Clojure having map and a Map.

~~~
gnaritas
I don't care if you invent words, if you call your function zork that's fine,
it's still a pronounceable word. When I say real word, I mean it has
consonants and vowels and is speakable. It's abbreviations that annoy me.

~~~
herdrick
Really? OK, that's interesting. I guess I prefer speakable ones too.

~~~
gruseom
Agreed all around. Whether or not a name sounds right when you say it is one
of the criteria we use on our project for naming design ideas. If you're
interested in the rationale behind this, Eric Evans' book _Domain Driven
Design_ articulates a concept of design as the "ubiquitous language" of a
project that I have found very useful.

------
jfager
One of my favorite features of Coffeescript (and other languages) is
destructuring assignment, which lets you use pattern matching over an object
to pull out values from nested structures. It's way more useful than this
proposal: brief when you want brevity, but you still get the full object
returned from the method when you need to debug, and apis don't have to be
adapted or rewritten with the usage pattern in mind.

~~~
swannodette
Coffeescript is certainly cool, but Clojure has sophisticated destructuring
assignment as well.

~~~
jfager
The point was not "Coffeescript is better than Clojure", it was "destructuring
assignment (like in Coffeescript, for instance) is a better solution than the
one proposed in the post". If Clojure has sophisticated destructuring
assignment, awesome. Use that instead.

~~~
swannodette
I suppose I failed to see how destructuring assignment on hash-maps
accomplishes anything interesting at all related to the post. The post
eliminates hash-maps altogether.

~~~
jfager
It eliminates them from the api, but how does memoize work? There are still
hashtables, you just never get a hook on them.

I'm just saying that I prefer having that hook, and that destructuring
assignment lets me do so with brevity and clarity.

------
idoh
Good stuff. My only comment is to make a macro like defmemo. That way instead
of writing something like: (def set-m (memoize set)) you could write (defmemo
set-m set).

~~~
herdrick
Yep, that'd be better. I didn't want to have to explain a macro, though I
guess anyone bothering to read the code can handle a simple macro. Thanks,
maybe I'll change that.

------
amalcon
When he talks about performance, he mentions that memoization allows
performance to not suffer very much, but what he doesn't mention that almost
the entire performance gain comes from memoizing _frequencies_... which stores
the hashtables anyway; it just does it behind the scenes. This is fine for CPU
usage, but it does have a significant storage penalty depending on the
characteristics of the input.

There's some low-hanging fruit to cut down on the required storage, namely
wrapping to-words and frequencies together, and memoizing that. This doesn't
reduce the number of hashtables, but at least there's no need to store the
entire word lists. As it stands, this will simply fail if your corpus is too
large to fit in your RAM all at once.

~~~
herdrick
Without profiling it I'd guess it's memoizing the function that does the disk
i/o that makes the biggest difference.

~~~
amalcon
Right, I realized after writing that that I was forgetting about that one.
Still, memoizing frequencies takes it from O(n^2) to O(n) right now, so it's
easily the _second_ largest increase.

------
Groxx
In summary: instead of getting a hash back from a function and pulling out the
(single!) desired key, pass the key to the function and get the value back
directly.

A limited-use but handy technique, and they're not arguing for dogmatic
adherence to the rule. Something to keep in mind, but don't go re-wiring old
code unless you're bored. You could also code the function so if you pass no
key-args, you get the whole hash back.

I'm personally a fan (sometimes) of code like this when I _really_ want to
one-line something but need multiple values:

    
    
      # ruby
      value1, value2 = [(x=aFunction(arg))[:key1],x[:key2]]
    

:)

~~~
cageface
I hate it when I see code like that. It's so much harder to maintain. You've
saved two lines of code but made the intent at least 10x harder to infer. I've
been wading through the Rails codebase lately and this kind of thing gets to
be annoying fast, particularly when you're trying to debug a function that's
littered with these kinds of expressions, often nested.

Honestly it's this kind of thing that makes me wonder if the Pythonistas don't
have a point about clarity vs cleverness.

~~~
chc
That's just excessive "cleverness." I don't think you should equate that with
concise coding. I could write an equally "clever" snippet that takes 12 lines
to do the same thing. Really, you're only saving one line of code over a
perfectly readable version:

    
    
      x = aFunction(arg)
      value1, value2 = x[:key1], x[:key2]
    

And you can still write a one-liner much better than the original:

    
    
      value1, value2 = aFunction(args).values_at(:key1, :key2)
    

In essence: That style of coding is not really clever, and I would argue it's
not idiomatic. It's doing more work to compensate for a missing tool (in this
case, not knowing about the values_at method).

~~~
wisty
Aside from the use of a temporary variable (which can be used for self-
documentation anyway), people aren't saving any operations with the one-liner.
They are just cramming more operations into a single line.

If LOC was so important for readability, we could just write in C without
using any newlines.

~~~
chc
The point isn't to save operations or lines of code — it's to reduce semantic
load and make code more readable. In most languages, calling a function
actually results in _more_ operations than just doing copy-and-paste with the
same code, but I don't think you'd dismiss well-factored code on those
grounds.

A rough gauge for better code might be how much a skilled coder can grok by
glancing at N lines of code. This means that reducing LOC _can_ be helpful,
but it's not a 1:1 correspondence — your example of C with no newlines would
make it harder to read. On the other hand, putting what is essentially a hash
destructuring on one line actually does tighten up the semantics.

------
thaumaturgy
Aside from the maintainability of this technique, it sounds as though it's
taking one of the least efficient routes towards doing its job. (I'm not a
LISPer, and although I should stop and think about the code there until I grok
it before commenting, I'm looking for a lazier answer.)

So, let's say we have two documents, and want to calculate their distance.
Each document has 2,000 words. We can either scan each word in each document
once, counting them in memory and processing a total of 4,000 words, or we can
start with a word in one document, scan both documents for all occurrences of
that word, move on to the next word, scan both documents for all occurrences
of that word, and so on, for a total of ~N! scans. Given a choice between a
really efficient function running in factorial time, or a longer, uglier
function running in constant time ... I'd kind of prefer the latter.

Am I way off the mark here? Is there something I misunderstood?

~~~
herdrick
Sorry, I don't understand what you're asking. The way it works now is: each
document is read and summarized into a bunch of word frequencies, once. It
looks like each is being read many times, but since I'm memoizing, it only
happens once. Is that helpful or are you suggesting something else?

------
jimbokun
"That sort of sucks. Further, it makes some functions, like euclidean,
hopelessly non-general, since euclidean now has to know what function to call
to get a frequency [3]."

No. You can just pass in the appropriate, memoized frequency function to call.
In the new code, that just means that "freq" would be an additional argument
to euclidean, instead of a globally defined function.

~~~
herdrick
Right. That's what I said in the footnote, [3].

------
jrockway
Yup, that's what abstraction does for you. The less policy the caller has to
implicitly understand, the more flexibility you have in the underlying
representation, and the less code you have to write.

------
yason
I can't remember which philosopher it was but he said something like: "If you
have something that you can actually express, you can then also express it
concisely."

