
Some performance tweaks - espeed
https://github.com/bitly/dablooms/pull/19
======
Smerity
This is a fun example of the power of open source. Bitly releases software
they use internally but which isn't secret sauce. Github comes along, uses it
internally, improves the original software, and ends up benefiting both.
Contributing to OSS is a non-zero-sum game and the sooner companies realise
this, the better.

Not only that but the pull request is done in a fun and informal style -- a
perfect example of Github's use by a Github employee =] He frankly admits that
some of the changes are substantial and weren't requested or set out before
hand, so there's no pressure for them to be merged into mainline if not
appropriate.

It's important to note that this is an example where both sides work optimally
though. I've contributed code to OSS projects backed by companies previously
and it's not uncommon to end up with "dangling" pull requests -- no-one looks
at it either for months or at all.

I'm still appreciative of these companies, don't get me wrong, but if it takes
months for a short but critical bugfix to get through then you're not playing
the OSS model properly. Either admit it's a "dump and release" or ensure your
open projects are handled properly. Developers will look at you in the future
and decide that's your attitude towards all your projects (see: Oracle). This
ends up being a major problem when you need to win the trust of third party
developers for your start-up/service/tool.

(I'm also really glad the dablooms library is getting more exposure due to
this -- the initial Hacker News post fizzled out)

~~~
dfc
How could contributing to an OSS project be a zero sum game?

~~~
rictic
Imagine a simple market (if you're wincing already, I'm right there with you,
but you asked for this right?). There are 100 customers, and two products. The
products are interchangeable, save for the performance of their bloom filters.

This is a zero sum game from the perspective of the two companies providing
the products.

~~~
dfc
Thanks for responding. I'm afraid my question was unclear. This was supposed
to be an answer to how could contributing to an open source project be a zero
sum game? You did not define what the open source project was and or the
companies rationales fopr consoidering contributions to be zero-sum or
otherwise.

~~~
rictic
If we further assume that the two companies who produce the two products only
produce those products, then for those two companies it's zero sum for them to
contribute in good faith to open source bloom filters. Any gain that one
company gets in the quality of their bloom filter is likely to translate
directly into fewer customers for the other.

It isn't globally zero-sum, notably the users of these companies' products are
presumably better off, as are other software projects and their users who
might benefit from any improvements made to open source bloom filters.

------
Kynlyn
Good grief, some of you folks take yourself way too seriously. No, his pull
request wasn't groundbreaking, but it was useful. His presentation was
whimsical and light-hearted. So what? Is dry and boring better because it
seems more academic or professional? Bleh.

Life is too short not to have some fun in your day job and kudos to vmg for
doing exactly that. For the rest of you..lighten up. Seriously.

~~~
apawloski
Sometimes it feels like people always have to find a flaw to latch onto and
take some meaningless stand on, no matter how irrelevant. I don't know if it's
for the fleeting sense of superiority, for the sake of discussion, or (as you
put it) because they take themselves too seriously. I value the importance of
criticism, but not when its primary motivations are self-serving.

For what it's worth, this is a phenomenon you see in most tech communities --
it happened on slashdot, proggit, and it's present here too.

~~~
evilduck
God help you if you attempt to defend the use of anything except statically
typed languages on proggit.

------
MattRogish
This is why, at a certain level, after a certain length of time, most software
companies (including startups, although it could be argued you're
transitioning from startup to real company at that stage) need the oft-
maligned "neckbeard" type folks (vmg sports some stubble, perhaps he's a
closet neckbeard? :D).

Yes, it's great to have the latest JavaScript ninja working on your front-end
and whiz-bang Ruby folks on the back-end but eventually you're gonna run into
problems that require RealHardComputerScience(TM) to fix. Or, you just throw
more hardware at it and forget about it, and end up paying for that oversight
over-and-over-and-over (it looks like it didn't take him much time to fix it)

~~~
netnichols
Not to take anything away from vmg (what a great pull request!)... but
"RealHardComputerScience" is _writing_ a hash function, not swapping out one
hash function for another based on some profiling. ;-)

~~~
mikle
I'd argue that writing a hash function is RealHardMathematics.

~~~
neilc
It really involves both: particularly if you want high speed hashing, you need
to pay close attention to how your hash function is executed by the hardware.

------
breckinloggins
Could HN please, PRETTY please introduce some feature to parse well-known URLs
so it could give you a little more sense of the source? This looks like it's
going to be an announcement from github.com about some cool performance
improvements. Not that I don't appreciate the actual article, but it was kind
of a let-down.

I know this has been discussed before, but I'm honestly mystified why this is
still an issue.

~~~
espeed
The title was changed. It used to say: "Some performance tweaks -- 'this is
the most amazing pull request ever'".

See <https://twitter.com/neha/status/233225033332445184>

------
ocharles
When I initially saw this posted here, I was irritated. The tone grates with
me, and yes - I would prefer a much drier, concise explanation if I were to
receive such a pull request for my projects. I kept this to myself though, and
had a look at the comments here, and I've had time to mull on why I think this
is potentially dangerous behaviour.

The tone and humour in that post requires a large amount of confidence in the
changes being made, in order to write about the humorously, but also in the
author themselves to actually present their work in such a tone. vmg is
perfectly entitled to do both of these; the pull request is detailed, shows
clear motivation and research, and vmg seems to know his stuff. The problem is
that GitHub encourages networking. The damage comes when other people who are
less experienced, or frankly, less knowledgeable, copy his style and do
produce noise.

I worry about a risk of imitation of this culture, but missing the crucial
underlying detail and explanation that's hidden in vmg's writing. I worry
reasoning with this people will be difficult because they have trained
themselves to have such arrogance in their work.

I prefer a dry report not only because it is succinct, not only because it
makes my life easier to understand, but also because it encourages a
disciplined state of mind. If you aren't able to write about something in a
mature dry tone and back it up (that is, not cover up with humour), then you
should doubt your work until you can amply support it. Yes, life is short, but
it's also so short that I would like to get things done; rather than have to
potentially argue past people to get important points across. Lets put this
creativity into making great stuff, not making great pull request comments,
eh?

Finally, all of this stuff builds a record for the project. A succinct, yet
detailed, pull request is much more accessible a year down the line to
understand the changes in more detail. Of course, this detail should be in the
commit messages (and I _do_ criticise vmg on poor commit messages here), but
every bit of writing contributes towards project documentation, at some level.
The more we can create a habit to create mature, if somewhat monotonous,
technical writing, I do think the better.

So no, it's not just a "I HATE HIS FUN" argument; there are more reaching
concerns, no matter how exaggerated you might think they are.

------
angersock
Great engineering, but a little bit brogrammer in the presentation.

~~~
ionforce
As long as the claims he is making are true, I wouldn't consider it
brogrammery at all (just a bit fun/Internetty).

It's when unsubstantiated claims are being taken as gospel does brogramming
get in the way. HEROKU AND ORM ALL THE THINGS~!!!! WHAAAAAT

------
tocomment
One thing on my "bucket list" is to "use a bloom filter for something". They
seem like such awesome data structures but I've never found a place to use one
:-(

~~~
WALoeIII
Next time you are caching a bit list with memcached and running out of space,
you may replace that big list with a bloom filter.

~~~
tocomment
Could you explain that a bit more?

~~~
btilly
This is exactly the scenario that a bloom filter is for.

You have an expensive lookup. You're caching information on success/fail so
that you don't have to do the expensive lookup every time. But the caches are
getting large.

What you do is replace the local caches with a bloomfilter. That data
structure takes a bounded amount of memory. When it says, "No, I have not seen
you before," you really haven't. And when it says, "I might recognize you," it
is only sometimes right. However its mistakes will not really matter because
you'll do the expensive lookup.

The tradeoff is that the more data you put into a bloom filter, the higher the
odds are that it will think think you might have seen things before, and
therefore the less useful it becomes. But in this caching situation, it saves
you work even if the false positive rate is fairly high.

------
jw_
What a obnoxious write-up for a pretty straightforward optimization. I don't
really see how this is worthy of any discussion, unless we want to discuss how
some developers think they're a lot more clever than they really are.

"Developer profiles code; replaces slow library call A with faster library
call B; ensures B does not change any important behaviour; writes self-
congratulatory pull request."

~~~
bherms
Yeah, god forbid any of us actually have fun while working. _GASP_

------
playhard
"Hey, I just met you, and this is crazy, but I rewrote your bloom hashes, so
merge me, maybe?"

------
seanwoods
I find the style that this is written in to be annoying, distracting, and a
little arrogant. It's not funny at all.

Just write the facts and let them stand for themselves.

~~~
MBlume
All the facts are there. He presented them entirely clearly.

~~~
voltagex_
Agreed. It's clear, it kept me interested and I made sure I read it thoroughly
instead of just skimming

------
asharp
Cool speed hacks.

An an improvement though, you only need two independent hash functions to run
your bloom filter[1]. Strangely enough, this isn't well known and as such
isn't implemented anywhere near as often as it should be (ie. it's not
implemented here).

[1] www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf

~~~
zheng
I was just talking with a colleague the other day about how to simplify the
number of hash functions needed for a bloom filter. He recommended something
similar to what the paper describes, but we both dismissed it as "probably
won't work". Just goes to show you that sometimes the simple solution is worth
closer examination. Thanks for the paper!

------
dllthomas
"MD5, being cryptographically sound"

Uh, not for a while now...

Which is not to say that the general point isn't sound - MD5 was aimed at
generating high quality entropy while most non-crypto hashes are aimed at
generating entropy-enough _fast_ \- but don't use MD5 for crypto stuff
anymore.

~~~
ketralnis
That has nothing to do with the context of the conversation.

The point is that it was _designed_ to be cryptographically sound -- and
therefore more heavily optimised towards entropy over performance -- whereas
the need here is for the hypothetical entropy/performance slider.

------
silentbicycle
This pull request has a LOT of attitude for something that just swaps out a
hash algorithm and uses ftruncate(2).

------
brown9-2
The quote in the title here doesn't appear in the actual pull request - am I
missing a reference somewhere?

~~~
espeed
It's how it's being passed around in Twitterland
(<https://twitter.com/neha/status/233225033332445184>).

~~~
nehan
thanks for the attribution!

------
eranation
I feel so stupid now you have no idea, shame on you, I now started reading
about MurmurHash instead of working. And I write software for 12 years and
have a CS degree.

~~~
zanny
The manual doesn't say you have to know any hashing algorithms. I think that
_should_ be the desired behavior - if your hashes work, you shouldn't need to
know their internals as long as a reputable specailized verifier and some form
of peer review can authenticate their worth. You have more _important_ stuff
to be doing!

... I also read up on Murmur though, since I know SHA and MD5 already. ~3
months out CS grad myself.

------
jimmytucson
Impressive, considering this could have been an even better pull request
without all the "like, white guy speak, yo".

------
outside1234
vmg clearly has a career in standup if this coding thing doesn't work out.

~~~
evan_
because he made some tired meme jokes?

------
tripzilch
Nice optimisation, but who thought it would be a good idea to use MD5 in a
Bloom filter?!

MD5 is a cryptographic hash (even though it's not secure anymore for most
purposes) and while it's _pretty_ fast, you don't need any of its crypto
properties, just the properties of a good quality regular hash function. Such
as Murmur, or even simply FNV.

------
bonaldi
ye gods, another useful headline edited to meaningless moronicity. Seriously,
mods, enough of this shit.

------
noveltyaccount
"Glad you like the changes! Sorry it took me a while to answer, I was watching
SCIENCE."

------
arnarbi
Why the essay?

~~~
apawloski
Because it's a clear description of the motivations behind the changes without
restricting the discussion to an audience of a particular skill level. Also
because, when possible, things should be fun - and he seemed to have fun with
the essay, so why not?

~~~
arnarbi
I don't think that is a clear description by any measure. It could be
summarized in a couple of paragraphs. It's perhaps good for beginner
programmers to learn from, so I'd encourage him to write a blog post. Pull
request motivations should be to the point, especially for small changes,
because otherwise they just waste time.

~~~
apawloski
Meh, to each their own I guess. While it is long, that does not in any way
make it unclear. In fact, he explicitly states the motivations and
repercussions of each decision. To persuade the master developers, who made
the "mistakes" (for lack of a better word) in the first place, this seems like
a worthwhile pursuit.

The only thing that would have made this more entertaining would be if he had
submitted a similar pull request to Linus..

------
mvkel
Cool. As we all know, ideas are worthless. It's the execution that matters.

