Hacker News new | past | comments | ask | show | jobs | submit login

Many people like DuckDuckGo for good reasons but let's be clear:

It does not do any of it's own indexing.

It's just a frontend to other very very expensive backends that have millions of dollars behind them.

The entire company can be shut down overnight if it's data feeds are cut.




According to DuckDuckGo's FAQ you are wrong http://help.duckduckgo.com/customer/portal/articles/216399-s...

The DuckDuckBot crawls and indexes the web. http://duckduckgo.com/duckduckbot.html


I suspect it's trivial at best.

Every test search I've ever done on DDG shows near identical results to Bing.

I'd like to see a search that uses it's own data, examples?

Gigablast was the last serious third-party backend that had a chance for independent data. It's like old-school Google.

Gabriel should try to buy Gigablast and merge it with DDG so he has his own independent dataset.


Blekko has millions behind it and does a full index of the web.


Ah now that is an interesting engine.

Checking to see if it's hit any of our sites.

3k pages, not bad. Data is kinda stale though.


Gigablasts creator has made this now http://procog.com/ the results are better then gigablast.


Has anyone ever seen DuckDuckBot hit their site? I don't have any web property large to appear via DuckDuckBot, but maybe someone else does? Im fairly certain it crawls Quora, as via this tweet https://twitter.com/yegg/status/33693491838066688


It's not just you. Maybe he only crawls the Alexa top 10k or some similar minimal set.


I checked my logs and there are several fetches from 72.94.249.37 and 72.94.249.38, over a number of domains that I host. None are particularly popular as far as the greater internet is concerned; one is a semi private site that I set up for my daughter's photos, another is one that has not yet been developed, apart from a few words of text and an image.

Interestingly, the fetches do not have a user-agent that identifies itself as the DDG crawler:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

I'm assuming this is the crawler because it does not fetch anything besides text/html.


That's interesting.

Gabriel, does DuckDuckGo's crawler have a distinct user agent? Can you talk more about how DuckDuckGo observes/respects robots.txt?


Follow up to this (its been a week so probably nobody will see this)

http://techzinglive.com/page/1028/179-tz-interview-gabriel-w...

around the 70 minute mark Gabriel mentions that DuckDuckBot is mostly about determining if pages are spam.


No, it crawls non-Alexa sites too. I host two sites, one PR4 with Alexa rank 1.6+ million and another with PR3 and no Alexa rank. Only the latter one was accessed in the last year according to my servers access logs (and only with a single "GET /" request).


Try searching for something using their api, and compare with websearch. There is a world of difference between the search results. IIRC they aren't allowed to share results from some of their other sources via their api, so you can get a good idea of how much they take from other sources and how much from their own index.

But that is a non-issue as far as I'm concerned. As long as the results are relevant and they got them legally, then who cares where or how they came from?


I've asked Gabriel specifically and he said that they use their own crawler/indexer.


Their results are nearly identical to Bing in every case I've tried.

Let me know if you find a search result that is different.


They definitely use Bing as their backbone as Gabriel has said somewhere before.



Those aren't actually website results. Those are links to entries for the single word on other sites, then followed by bing results.

How often do you do single word searches in the realworld?


Both are search/website results. "Links to entries on other sites" are exactly that, and DDG's special sauce is presenting third-party results contextually.

As for "followed by bing results", there is a single one that is equal to bing. The results that follow are also far from similar (this is exacerbated by Bing insisting on giving me local results regardless of relevance, which DDG purposedly avoids).

And yes, I do single word searches very often, but if you're so inclined, here are results for a longer search: http://cl.ly/image/0m1E3I3J0M1U (DDG shows Hulu, tv.com, CTV, amazon, and doesn't repeat the Wikipedia entry)


DDG is really only displaying definitions, aside from hacker.org. Given that this is a dictionary "word", I'm not sure how much of this query is really due to indexing and how much is due to identification of it as a singular word. Because there is so much query specific customization with search engines, (context-dependent results), it's hard to identify what results would have been returned from their raw indexes.

If you adjust the term to "hacker movie", you get more similar results between DDG and Bing. But, overall, it does seem like DDG is returning more differing results now than in the past.


But they are using legitimate means to provide search results. And feed(/API) providers are not shutting them out.

Same argument can be made about Google, they do not produce their own data just copy webpages from content providers. If content providers decide to block them from indexing their websites, Google would be irrelevant.


Look at the twitter token fiasco for an example of what happens when you build your business model on someone else's API.


its




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: