I've heard through the grapevine that they were able to index and serve 100 billion documents on 100 machines, which is a pretty impressive technical accomplishment if true. I'm surprised they weren't acquired for that. It's unfortunate that their search quality wasn't up to snuff yet.
How many queries per second can they handle on those nodes, and with what latency? What kind of relevancy calculations were they able to do at query-time in their system with 1B documents per node? Were they able to support query-time aggregation of structured fields in their documents? Was the index stale or did they support continuous feeding and indexing of new documents? If the latter, how well did they meet their SLA QPS and latency when indexing new documents?
I can set up a single search node and fill it up with God know how many documents any day, but the difference between supporting 10 QPS with ~500ms latency and 3000 QPS with the 99 percentile below 40ms is really more interesting than exactly how many documents I have per node.
I started out before those APIs existed, and did all my own crawling & indexing. When they came out, I decided to focus on my value-adds because I thought that was a quicker path to customer acquisition.
Furthermore, I don't use Yahoo/Bing straight up, e.g. I re-rank, omit, etc. I also mix them with my own my index/negative spam index from my own crawling efforts.
All re-ranking you can do is very limited with these services because they are black boxes: you don't see what factors went into ranking a page the way it happened, you can't tweak the weight of different factors, you can't add new factors. All you can have is hardcoded rules like "if there is a wikipedia page in the first 20 results, bring it on top" that don't really add much value, because if that wikipedia page is any good it would be on top already. With spam results it's similar: you can provide impressive customer service blacklisting spam on user requests, but major search engines are already pretty good in low-ranking spam so much that you don't see it if there are any other meaningful results.
Your marketing here on HN has been brilliant, you have some very interesting UI decisions and possibilities that Google doesn't have, but your added value is definitely not in improved ranking of results.
They also got a lot of marking out of their "ex-Googlers take on Google" narrative which probably wouldn't have worked out as well if they were using something like your strategy.
I've heard this in other discussions of DuckDuckGo here, and I don't understand why bing/yahoo allow a potential competitor free access to data that is so important to their search businesses. What's in it for Yahoo/Microsoft? Or is DDG paying for the privilege?
"We are exploring a potential fee-based structure as well as ad-revenue models that will enable BOSS developers to monetize their offerings. When we roll out these changes, BOSS will no longer be a free service to developers."
I don't think the ordering of the results is Google's competitive advantage anymore it's branding and habit.
I think Cuil should sell their index as a service. Over which businesses can implement PageRank and http://en.wikipedia.org/wiki/Pagerank#See_also type of algorithms.
Technical achievements are great; but Gabriel is much better placed. He is self funded, he is building on existing tools (always good advice), he leveraged us, the hacker crowd, who can be very loyal, he clearly listens to his customers etc.
Cuil, on the other hand, produced some very confusing (if technically interesting) things and then ranted about those who criticised them. They had a lot of big bucks VC money (always a warning sign) and didn't appear to be leveraging loyalty from any user base.
Even if the problems these two startups are facing are different; there is a lesson here. One is how not to build a product, and one is :)
To me, Cuil looked like a prime example of design by committee whereas DDG is clearly opinionated but thrives because of it.
By being small DDG can address issues that others will not even think or bother thinking about i.e. enhanced privacy controls, TOR utilization etc.
I believe that Cull was a dream that went south. Unfortunately that dream had a hefty bill ($33m).
Strangely enough I visited that site once before and could not put a name to a site until I saw a screenshot of it.
I once contacted Cuil about some worthless search results, and got a standard reply asking me to be patient since they were a small company. But there wasn't any hint in the email that they would actually address the issue, so drastically reduced my use of them and never bothered to contact them again. Listening to users would have probably helped if my experience indicates a pattern.