Some of the most common icebergs are:
-form validation (seriously -one of the most highly exercised user-interaction paths; it's all over the place, and scales semi-exponentially with the number of fields)
-search ("how hard could it be? you just put an input form there, then figure out what the user thought, then display it" -exact quote)
-anything that has to process natural language. I mean everything. Wanna split up a text into sentences? How do you differentiate between dr. mr., 2004. jun. , and valid sentence-enders? Generating a definite article ("a", "an") before a noun? Keep in mind that 1,2,@,$,=, and other characters might also be valid noun first-letters :)
In my experience, the best anti-iceberg pattern is to follow a portfolio approach, and for each requirements which smells like iceberg, have a fallback plan in place -ie. after N hours of sunken investment, execution shifts to plan B. Usually works out much better, than banging away on the same problem for days.
We had made a point of asking before the project if regionalisation was ever going to be an issue and no, it would only ever be in English. Shortly after go live we were asked to regionalise everything into Chinese.
I'm still not sure what a capital letter looks like in Chinese.
Fun fact: there exists a convention stipulating a double space after a period that ends a sentence. Not that I'm advocating relying on this for any serious purposes
In fact I just checked, and HN is honoring the two spaces, in that they are output to the actual HTML sent to your browser. And of course, yes, there were other trends that would have ended this anyhow, "two spaces" in meaningless in a non-monospaced font and ever more stuff is going proportional as the computing power necessary to do that continues its steady march from "prohibitive" to "trivial", but the WWW certainly beat the corpse to death again.
It took me forever to stop hitting the space bar twice after ending a sentence.
If you want consecutive spaces in a bare HTML page, you have to use " " which doesn't work here because pg isn't a moron.
So, one particular solution with these performance characteristics is building a decision tree using a bunch of training data, and eg. a maximum entropy classifier. Add some sample data from any openly available corpus's (or fire up mturk, and create your own), and you're pretty much done with it.
Of course, sentence-tokenization is only the tip of the iceberg :)
I spent 4 months, full-time (I think I was a bit depressive and unproductive, though (that might have to do with the difficulty of the problem a bit)) to make a goddamn "error message merging" system where you specify some merge rules for error messages, and then said messages are "merged" efficiently at runtime (with another of my adaptations of the awesome Rete algorithm).
For instance, as a trivial example merging "Please enter your Username." and "Please enter your Password." could yield "Please enter your Username and Password."
Merging error messages efficiently with a great concise syntax looked SO EASY =/ I was wondering why the hell no websites (that I know of) do that because it's a pretty obvious feature to me... Well, now I know. People don't really mind that much about these things (maybe they just have low expectations) AND it's really hard to implement.
I finally made it, right now the implementation is utter crap, with some missing features and some bugs but the general architecture is there and works. I'll clean it up and document it within a few months, probably.
It's one of the hardest things I ever made so far in programming.
Fortunately we were able to convince them it wasn't time well spent, but it would be neat.
Well worth the read: http://steve-yegge.blogspot.com/2009/04/have-you-ever-legali....
For those of us who weren't around at that time or have forgotten. ktharavaad agrees to attempt it in a grandchild post.
Now, you can put all those crazy foobar-izing things into XYZApp, and that'll work—but they should really either go into libfoo itself, or into a new library (libfoobarize) that uses libfoo.
This is the case with the example in the article: DuckDuckGo shouldn't be parsing Wikipedia to make its own abstracts. MediaWiki already creates abstracts—they're just bad abstracts. The correct thing to do, since MediaWiki is just a regular ol' FOSS project, is to write a patch that makes MediaWiki spit out good abstracts, that are actually trivial to use in DuckDuckGo. Or, even better, if you know MediaWiki cares about having good abstracts, just submit it as an issue to their tracker and let them do it for you. In other words, repeat the programmer's litany to stave off NIH: "It's not my job. I shall buy, not build. 80% of the features at 20% of the cost. Don't ask a question, send a message. No god-objects. Encapsulate, encapsulate, encapsulate."
Note that, of course, there are cases where there really is no libfoo—but then you're doing something totally new, and you can tell the client right up-front "no one's ever done this before, so we have to schedule time for R&D before we can even tell you how much time this feature will take."
There is also the case where the only libfoo/libfoobarize is a proprietary one used by the people you're trying to steal market-share from by implementing this feature, in which case you can tell your client "we know it's possible, but we don't know how long it took them to build it. What we do know is that no one else has yet copied them, which means that foobar-ization isn't trivial. It'll probably take a while."
You won't believe how many times I've been in a discussion about something and the other person has said "oh that's easy to do" or "it can be done in a few hours" when in fact if they were to go into the details, they would see the hiding devil...
Favorite occurence: "You just need to build a state machine." Yeah, saving the whole browser-side state (did I mention third-party GUI components?) of an application and reestablishing the server-side session state to match it is really easy with this piece of sage advice.
Naturally, there's an uncountable infinity of ways that comes out in reality, but Code Icebergs are a common way.
Freebase (http://freebase.com) isn't bad, either.
"DBpedia is derived from Wikipedia and is distributed under the same licensing terms as Wikipedia itself"
Please correct me if I'm mistaken!
Not to be rude, but I doubt it.
Twitter currently has 175 million users. Estimates in 1999 for the online population of the entire internet were 259 million, with 110 million in the US .
In 1999, I imagine Yahoo and maybe a couple of other sites (Microsoft/Excite/AOL/Lycos?) were getting similar traffic numbers to what Twitter does today. BUT the scaling is very different, because Twitter requires fan-out of messages, which none of those sites did.