This is a fun example of the power of open source. Bitly releases software they use internally but which isn't secret sauce. Github comes along, uses it internally, improves the original software, and ends up benefiting both.
Contributing to OSS is a non-zero-sum game and the sooner companies realise this, the better.
Not only that but the pull request is done in a fun and informal style -- a perfect example of Github's use by a Github employee =] He frankly admits that some of the changes are substantial and weren't requested or set out before hand, so there's no pressure for them to be merged into mainline if not appropriate.
It's important to note that this is an example where both sides work optimally though.
I've contributed code to OSS projects backed by companies previously and it's not uncommon to end up with "dangling" pull requests -- no-one looks at it either for months or at all.
I'm still appreciative of these companies, don't get me wrong, but if it takes months for a short but critical bugfix to get through then you're not playing the OSS model properly. Either admit it's a "dump and release" or ensure your open projects are handled properly. Developers will look at you in the future and decide that's your attitude towards all your projects (see: Oracle).
This ends up being a major problem when you need to win the trust of third party developers for your start-up/service/tool.
(I'm also really glad the dablooms library is getting more exposure due to this -- the initial Hacker News post fizzled out)
Imagine a simple market (if you're wincing already, I'm right there with you, but you asked for this right?). There are 100 customers, and two products. The products are interchangeable, save for the performance of their bloom filters.
This is a zero sum game from the perspective of the two companies providing the products.
Thanks for responding. I'm afraid my question was unclear. This was supposed to be an answer to how could contributing to an open source project be a zero sum game? You did not define what the open source project was and or the companies rationales fopr consoidering contributions to be zero-sum or otherwise.
If we further assume that the two companies who produce the two products only produce those products, then for those two companies it's zero sum for them to contribute in good faith to open source bloom filters. Any gain that one company gets in the quality of their bloom filter is likely to translate directly into fewer customers for the other.
It isn't globally zero-sum, notably the users of these companies' products are presumably better off, as are other software projects and their users who might benefit from any improvements made to open source bloom filters.
"Contributing to OSS is a non-zero-sum game and the sooner companies realize this, the better."
So I am asking him how could companies possibly think that contributing to OSS is a zero sum game? Its either zero sum or it is not. If they need to realize its non-zero-sum they would have to be under the impression that it was a zero-sum game.
I hope that cleared up my question. I'm sorry if my intentions were opaque, that was not my intention.
The default assumption ascribed to enterprise is that they don't want to contribute to open source because there is no "return" on such an investment. It is a zero-sum mentality which says "I am worse off for having shared my technology".
This may actually be the case in many situations, and where it's not, the possibility of outside contributions subsequently improving the technology is often overlooked.
Good grief, some of you folks take yourself way too seriously. No, his pull request wasn't groundbreaking, but it was useful. His presentation was whimsical and light-hearted. So what? Is dry and boring better because it seems more academic or professional? Bleh.
Life is too short not to have some fun in your day job and kudos to vmg for doing exactly that. For the rest of you..lighten up. Seriously.
A few paragraphs into the pull request and I just knew the HN comments would be full of energy-sucking Debbie Downers fazed and miffed by the word "yo", using the word "brogrammer" in their epithets, and saying things like "just present the facts" in a comment that not only creates zero value but squanders it from the pool.
In a bout of absolute irony-blindness, these Value Vacuums are far more insufferable than the very material they declared "distracting".
Sometimes it feels like people always have to find a flaw to latch onto and take some meaningless stand on, no matter how irrelevant. I don't know if it's for the fleeting sense of superiority, for the sake of discussion, or (as you put it) because they take themselves too seriously. I value the importance of criticism, but not when its primary motivations are self-serving.
For what it's worth, this is a phenomenon you see in most tech communities -- it happened on slashdot, proggit, and it's present here too.
I've noticed it's worse here than it was about two years ago. Lots of people nitpicking on things that have nothing to do with the main point of the OP.
This is why, at a certain level, after a certain length of time, most software companies (including startups, although it could be argued you're transitioning from startup to real company at that stage) need the oft-maligned "neckbeard" type folks (vmg sports some stubble, perhaps he's a closet neckbeard? :D).
Yes, it's great to have the latest JavaScript ninja working on your front-end and whiz-bang Ruby folks on the back-end but eventually you're gonna run into problems that require RealHardComputerScience(TM) to fix. Or, you just throw more hardware at it and forget about it, and end up paying for that oversight over-and-over-and-over (it looks like it didn't take him much time to fix it)
Not to take anything away from vmg (what a great pull request!)... but "RealHardComputerScience" is writing a hash function, not swapping out one hash function for another based on some profiling. ;-)
It really involves both: particularly if you want high speed hashing, you need to pay close attention to how your hash function is executed by the hardware.
Not to be mean, but how was any of this hard? True, I work on server-side development and I'm not a neckbeard, but this is part of my day-to-day job and I think the approach is pretty straightforward.
Step 1. Design a test performance data set. Great if it comes from your production data!
Step 2. Run your algorithm and attach a profiler.
Step 3. Look at any method about 5% utilization. Can you use a better algorithm, more compact data structure, improve memory locality, or tighten the loop? Googling for fast hash functions would give you a good answer here.
Step 4. Repeat steps 2-3 until you give up.
The difference with this pull request is the level of documentation that went into it. We're impressed because nobody shows these levels of documentation and humor when writing about their change. That's what's impressive.
On one hand I agree with you completely about its apparent "obviousness".
On the other, I've seen so many devs that wouldn't even have been able to go through the process of identifying properly what to change, let alone do it, that I can recognize this pull as great.
The reality is that a majority of coders out there do not know how to properly do their job and they do not even know it (thus they don't try to improve).
(but to be fair, that's also what allow us to ask for the kind of salary we get)
I think this has more to do with the fact that in most workplaces once something works it's taboo to 'waste' more time on it (or even if it's not taboo, it becomes low priority compared to more 'critical' issues), than it does with the particular changes requiring super skills. Though I'd love to work somewhere that was even set up cleanly enough to do meaningful test profiling.
Yeah, when dablooms was posted on HN I wrote my own version to learn it. The first thing I did was use murmer. I am clean shaven.
Few companies need RealHardComputerScience guys. Instead, they need people who can apply the results of CS research to messy business problems. JS rockstar ninja brogrammers can't really do that.
Agreed. As a Ruby/JS programmer I absolutely do not think that JS or Ruby folks can't do hard CS stuff, just that it's really easy to be a commercially successful Ruby or JS programmer and not have any idea what a hash algorithm is. Or know anything about bloom filters.
The problem domain often doesn't expose you to that, and that's unfortunate. There's a lot of research over the last 60 years in CS that can make our lives much easier (obviously lots of research is ongoing, too).
There's a certain elitist tone in your comment which I found unwarranted: RealHardComputerScience can happen in JavaScript or Ruby just as easily as C - often easier as you can focus on the computer science rather than memory management.
I'd also second the other comments that this wasn't a great example of it: profiling and swapping hash functions is classic software engineering.
Could HN please, PRETTY please introduce some feature to parse well-known URLs so it could give you a little more sense of the source? This looks like it's going to be an announcement from github.com about some cool performance improvements. Not that I don't appreciate the actual article, but it was kind of a let-down.
I know this has been discussed before, but I'm honestly mystified why this is still an issue.
When I initially saw this posted here, I was irritated. The tone grates with me, and yes - I would prefer a much drier, concise explanation if I were to receive such a pull request for my projects. I kept this to myself though, and had a look at the comments here, and I've had time to mull on why I think this is potentially dangerous behaviour.
The tone and humour in that post requires a large amount of confidence in the changes being made, in order to write about the humorously, but also in the author themselves to actually present their work in such a tone. vmg is perfectly entitled to do both of these; the pull request is detailed, shows clear motivation and research, and vmg seems to know his stuff. The problem is that GitHub encourages networking. The damage comes when other people who are less experienced, or frankly, less knowledgeable, copy his style and do produce noise.
I worry about a risk of imitation of this culture, but missing the crucial underlying detail and explanation that's hidden in vmg's writing. I worry reasoning with this people will be difficult because they have trained themselves to have such arrogance in their work.
I prefer a dry report not only because it is succinct, not only because it makes my life easier to understand, but also because it encourages a disciplined state of mind. If you aren't able to write about something in a mature dry tone and back it up (that is, not cover up with humour), then you should doubt your work until you can amply support it. Yes, life is short, but it's also so short that I would like to get things done; rather than have to potentially argue past people to get important points across. Lets put this creativity into making great stuff, not making great pull request comments, eh?
Finally, all of this stuff builds a record for the project. A succinct, yet detailed, pull request is much more accessible a year down the line to understand the changes in more detail. Of course, this detail should be in the commit messages (and I do criticise vmg on poor commit messages here), but every bit of writing contributes towards project documentation, at some level. The more we can create a habit to create mature, if somewhat monotonous, technical writing, I do think the better.
So no, it's not just a "I HATE HIS FUN" argument; there are more reaching concerns, no matter how exaggerated you might think they are.
I found it whimsical and funny, I don't see it as brogramming. It was also a very good pull request, so this further decreases my brogramming expectation.
One thing on my "bucket list" is to "use a bloom filter for something". They seem like such awesome data structures but I've never found a place to use one :-(
Chrome uses them for the "safe browsing" filter. Google HQ makes a bloom filter out of all the blocked websites and sends out the filter occasionally in an update. That way Chrome can test the URL you're about to visit for membership in the huge blacklist without taking much network/disk/memory/CPU. Of course there could be false positives but Chrome only has to phone home to double-check on possible matches. http://blog.alexyakunin.com/2010/03/nice-bloom-filter-applic...
This is exactly the scenario that a bloom filter is for.
You have an expensive lookup. You're caching information on success/fail so that you don't have to do the expensive lookup every time. But the caches are getting large.
What you do is replace the local caches with a bloomfilter. That data structure takes a bounded amount of memory. When it says, "No, I have not seen you before," you really haven't. And when it says, "I might recognize you," it is only sometimes right. However its mistakes will not really matter because you'll do the expensive lookup.
The tradeoff is that the more data you put into a bloom filter, the higher the odds are that it will think think you might have seen things before, and therefore the less useful it becomes. But in this caching situation, it saves you work even if the false positive rate is fairly high.
What a obnoxious write-up for a pretty straightforward optimization. I don't really see how this is worthy of any discussion, unless we want to discuss how some developers think they're a lot more clever than they really are.
"Developer profiles code; replaces slow library call A with faster library call B; ensures B does not change any important behaviour; writes self-congratulatory pull request."
Just write the facts and let them stand for themselves.
Life's too short for that shit. That's boring and dull and bland and insipid. Live a little, loosen up, we'll all have time to worry about "just the facts" when we're dead. Which, ironically enough, will probably happen far too soon, unless Ray Kurzweil turns out to be right about some of his crazy "life extension" ideas.
I found the article to be quite terse, well-presented and readable, compared to many (verbose, boring) blog posts that get discussed on HN ... Plus, I learned something new. No reason at all to flame the author in my opinion.
An an improvement though, you only need two independent hash functions to run your bloom filter[1]. Strangely enough, this isn't well known and as such isn't implemented anywhere near as often as it should be (ie. it's not implemented here).
I was just talking with a colleague the other day about how to simplify the number of hash functions needed for a bloom filter. He recommended something similar to what the paper describes, but we both dismissed it as "probably won't work". Just goes to show you that sometimes the simple solution is worth closer examination. Thanks for the paper!
Which is not to say that the general point isn't sound - MD5 was aimed at generating high quality entropy while most non-crypto hashes are aimed at generating entropy-enough fast - but don't use MD5 for crypto stuff anymore.
That has nothing to do with the context of the conversation.
The point is that it was designed to be cryptographically sound -- and therefore more heavily optimised towards entropy over performance -- whereas the need here is for the hypothetical entropy/performance slider.
It's a cryptographic MAC that's almost as fast as MurmurHash. It was designed to be used in hash tables, to protect against denial-of-service attacks from people trying to cause a lot of hash bucket collisions.
Yeah all the "yo"s had me wincing. That doesn't change the fact that this fellow is a much more experienced programmer than me, but the style did turn me off a bit. Good show overall. :)
I feel so stupid now you have no idea, shame on you, I now started reading about MurmurHash instead of working. And I write software for 12 years and have a CS degree.
The manual doesn't say you have to know any hashing algorithms. I think that should be the desired behavior - if your hashes work, you shouldn't need to know their internals as long as a reputable specailized verifier and some form of peer review can authenticate their worth. You have more important stuff to be doing!
... I also read up on Murmur though, since I know SHA and MD5 already. ~3 months out CS grad myself.
Nice optimisation, but who thought it would be a good idea to use MD5 in a Bloom filter?!
MD5 is a cryptographic hash (even though it's not secure anymore for most purposes) and while it's pretty fast, you don't need any of its crypto properties, just the properties of a good quality regular hash function. Such as Murmur, or even simply FNV.
Because it's a clear description of the motivations behind the changes without restricting the discussion to an audience of a particular skill level. Also because, when possible, things should be fun - and he seemed to have fun with the essay, so why not?
I don't think that is a clear description by any measure. It could be summarized in a couple of paragraphs. It's perhaps good for beginner programmers to learn from, so I'd encourage him to write a blog post. Pull request motivations should be to the point, especially for small changes, because otherwise they just waste time.
Meh, to each their own I guess. While it is long, that does not in any way make it unclear. In fact, he explicitly states the motivations and repercussions of each decision. To persuade the master developers, who made the "mistakes" (for lack of a better word) in the first place, this seems like a worthwhile pursuit.
The only thing that would have made this more entertaining would be if he had submitted a similar pull request to Linus..
OSX' Instruments, which is based on DTrace (from Solaris).
More pressing question for me - did anyone get to that kind of UI and capability (profiling userland) under Linux with SystemTap/DTrace port or anything really?
Not only that but the pull request is done in a fun and informal style -- a perfect example of Github's use by a Github employee =] He frankly admits that some of the changes are substantial and weren't requested or set out before hand, so there's no pressure for them to be merged into mainline if not appropriate.
It's important to note that this is an example where both sides work optimally though. I've contributed code to OSS projects backed by companies previously and it's not uncommon to end up with "dangling" pull requests -- no-one looks at it either for months or at all.
I'm still appreciative of these companies, don't get me wrong, but if it takes months for a short but critical bugfix to get through then you're not playing the OSS model properly. Either admit it's a "dump and release" or ensure your open projects are handled properly. Developers will look at you in the future and decide that's your attitude towards all your projects (see: Oracle). This ends up being a major problem when you need to win the trust of third party developers for your start-up/service/tool.
(I'm also really glad the dablooms library is getting more exposure due to this -- the initial Hacker News post fizzled out)