Hacker News new | past | comments | ask | show | jobs | submit login
DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing (highscalability.com)
202 points by orrsella on Jan 28, 2013 | hide | past | favorite | 126 comments



Good god, the comments in this thread really show how douchy the HN community can be.

Tons of anecdotal stories with no evidence. Obvious outsiders trying to image what it is like on the inside (all the guesses about how DDG works) and of course no facts to back anything up.

I would say this pandering and uninformed behaviour is not common on HN, but it is.

People who engage in language flame wars simply do not understand that when you become a good programmer languages do not matter only the platforms matter.


Leaving aside one trolly user, most of the conversation here is either referencing data mentioned in the article or asking questions/having discussions about semi-related points.

I'd hardly call the argument over language speed a 'flame war' by any internet standard I've ever seen.

There are probably better examples than this one to pull the 'HN commenters suck' card. The conversation here is mostly civil and trying to learn things, one troll aside.


Agree. The choice of perl actually does surprise people, genuinely. As somewhere else mentioned, even blekko uses perl. So obviously these people know what they are doing.

But that said, choosing anything interpreted does come to haunt you later on, in some way or the other.

We run Java on Jettys for most of the things for our app (traffic is around 60k/70k in a day, so much less in comparison). But even with this traffic, we need to use 2 large EC2 instances in the day time. And mysteriously the Jettys keep going out of memory every now and then.*

I am sure the same thing if done in C++ would need only 1 large EC2 instance. And it will also help the latency a bit, as a parallel gain. At present am analyzing the cost/benefit of such a move. Inputs are welcome.

* With Java its always the memory which hurts you first. Latency wise, not much of a difference, in most cases.

Edit: Down vote? Surprised. Why??


> People who engage in language flame wars simply do not understand that when you become a good programmer languages do not matter only the platforms matter.

Guess you're building the next killer app in brainfuck?


Brainfuck really does not have much going for it as a platform. For an example of what I mean check out Jeff Atwood's article http://www.codinghorror.com/blog/2007/04/reddit-language-vs-...


Surprised to hear that crawling still runs out of the basement. For a general-purpose search engine, what kind of bandwidth does that require? Or is it manageable only because DDG proxies so many searches to other search engines?

Also, doesn't it seem inefficient to have crawlers here and indexes there?


I run a couple of boxes that are crawl only for Nuuton right out of my office. I do have a good internet connection. But its a good way to save on infrastructure.


I don't suppose you have a blog or something for that? My hobby/interest is search engines and I love reading about peoples experiences getting them running.


There is one but I have been too busy building it that have not posted anything yet. You can always get in touch through email. I like chatting with other hackers.


That's a pity. Just for the record for anyone looking at this space here are some links to get started,

  http://www.yioop.com/blog.php
  http://www.gigablast.com/rants.html
  http://queue.acm.org/detail.cfm?id=988407
  http://blog.procog.com/
  http://www.thebananatree.org/
  http://blog.blekko.com/
I will even suggest my own small implementation (created purely for SEO value but it does work)

  http://searchco.de/blog/view/code-for-a-search-engine-in-php-part-1/
BTW email sent.


These are all great links to read if you'd like to dig deeper into search engine land.

Thanks for the searchco.de articles, they were nice to read (Saw them in the previous searchco.de HN) :-)


Thanks. I collect any that come up since its really such a dark area. I should probably write a blog post about it since I have a few more I have since dug up. Glad you like the articles :) it was something I had been writing for weeks and then finally got my act together and finished it.


I believe yegg has a pretty awesome internet connection in his basement: http://www.gabrielweinberg.com/blog/2011/12/duckduckgo-used-...


DDG is using dynamic language like Perl[1] to achieve high-scalability which is slowest of all languages. This proves languages don't matter much, its all about architecture.

I guess its time to stop worrying about performance of your programming language & start building better high-salable architectures.

[1] Slowest of all languages http://benchmarksgame.alioth.debian.org/u32/which-programs-a...


Those benchmarks are extremely suspicious. I remember comparing Perl and Python for bioinformatic work (non-trivial computational workloads), and finding that Perl was about 2x the speed of Python, on average.

Later, I did similar, non-trivial benchmarks with Python and Ruby, and found a similar, 2x factor. Ruby has improved since then, but unless Perl has become dramatically slower in the same interval, I suspect that these benchmarks are either trivial (i.e. simple loops), or badly written.


>> ... extremely suspicious ... I suspect ...<<

Please - less FUD and more looking at program source code (which is 2 clicks from the URL you were given).


If you want to debug the metrics they're using, you're more than welcome to do it. I've got plenty of direct experience to question the results shown by a random webpage on the internet, and not much incentive to figure out why this particular set of benchmarks looks wrong.

A quick inspection of the tests suggests a high enough bogon count that it already begins to confirm my suspicions: there's a "fasta-redux" test for Perl which dramatically outperforms the "fasta" test that's implemented cross-language. Also, many of these tests are written in least-common-denominator style, which doesn't reflect the way the languages are actually used.


Nope -- The Perl fasta program is 109x slower than the C fasta program; and the Perl fasta-redux program is about 112x slower than the C fasta-redux program.


now you're just being deliberately obtuse; the perl redux program is almost 30% faster than the perl fasta program.


Now you're just being deliberately insulting, to avoid answering the obvious question -- what change was made to the algorithm?


Your benchmarks are likely very biased. The vast majority of the work done by your bioinformatics programs should be done in C code, with thin wrappers so that the code can be accessed through Perl and Python. If anything, your benchmarks only comparing specific implementations, such as BioPerl vs BioPython.


I wasn't using bioperl or biopython. I wrote the algorithms myself.


1 million searches per day is not really that much. If evenly distributed, it's only 11.6 searches per second.


We also do about 12M API requests per day: http://duckduckgo.com/traffic.html


The interesting measure is what is the peak rps

EDIT: I mean, it ~150 doesn't seem that much, but I'm sure the requests are not uniformly distributed, so the system should be able to handle much more than that within reasonable latency bounds.


Indeed. Peak RPS is what is going to be interesting. Depending on where traffic is from, it's very possible to have 12 mil requests in a day only from 9-5 type of hours and a big huge empty space during the night (This is the kind of traffic my company sees).


> within reasonable latency bounds

Have you used DDG? Requests take over 500-600ms regularly for me. Compare to Google's instant search and excellent latency for non-instant queries... which handle orders of magnitude more traffic.


I'm not a DDG user, but it's easy to measure:

    $ time I=$((I+1)) curl -s  "https://duckduckgo.com/d.js?q=test&t=A&l=us-en&p=1&s=0" >/dev/null
    I=$((I+1)) curl -s "https://duckduckgo.com/d.js?q=test$i&t=A&l=us-en&p=1&s=0"  0.01s user 0.00s system 3% cpu 0.256 total
I saw it go over 800ms a couple of times, anyway, I don't intend to run a DDG benchmark.

However, just for the sake of the discussion, about architecture and language choice etc, the interesting part would be to see how many rps before starting to degrade. There is no point bashing it and comparing it to other search engines that have more resources behind them.


Great article. Thanks. Could you please share, the good things you did which made the direct searches jump during Jan-2012 to Mar-2012?


Interesting. Any public big users of the API?


FastestFox is the biggest.


Serious question: at what volume of traffic do you plan to close it down or start charging for an api?


DDG is mostly a front end for a bunch of search engines. Perl, PHP and Python are great to make front-ends like that. Because you are for the most part only dispatching queries, parsing xml feeds and build the gui for the end user.

Writing an internet scale search engine on the other hand would require different tools. Most successful projects has probably been done i C, C++ and maybe Java.


That page shows Perl as being roughly 5% slower than Ruby, however when comparing Ruby and Perl directly, things look significantly different:

http://benchmarksgame.alioth.debian.org/u32/benchmark.php?te...

So i have to say i am quite suspicious of the chart on the page you linked.


Take away those last three benchmarks (spectral-norm, fasta & binary-trees) then things do look quite different.

For eg: Here is Which programs are best with those benchmarks removed: http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

This shows Perl higher up the chart with it being 49% better (on average) than Ruby.

Some other things to note:

1. Perl has never been fast with binary trees. But Ruby 1.8 is even worse (I recall it being many times slower than Perl at this). So hats off to the Ruby 1.9 VM guys because they've turned things around and now it's even out pacing Lua on this benchmark! - http://benchmarksgame.alioth.debian.org/u32/performance.php?...

2. Fasta in Ruby maybe twice quicker than Perl however it uses over 100 times more memory! This is be because it's inlining some code (eval) giving it a huge speed bump at expense of a bigger program footprint. If I do a port of this to Perl then this fasta.pl runs 2.6 times faster than alioth's perl version (which now means the Ruby version is about 30% slower than it's direct equivalent Perl version).

3. My make_repeat_fasta idiomatic subroutine is actually a little slower than the one in fasta.pl on alioth. So both my fasta.pl & alioth's fasta.rb programs could be speeded up more :)

4. I may even be able to shave a little bit more off fasta.pl time. Same might be true for spectral norm & binary-tree however the Perl versions aren't showing up on alioth site at moment :( ... IIRC, when I last looked at the perl binary-trees code on alioth a couple of years ago I was able to shave 10% off.

ref: My perl port of fasta.rb on alioth - https://gist.github.com/4675254


>> Take away those last three benchmarks... <<

What if we are selective with the evidence a different way, what if we take away k-nucleotide and pi-digits and reverse-complement :-)

>> My perl port of fasta.rb <<

http://benchmarksgame.alioth.debian.org/play.php#contribute


On the contrary my dear igouy, specialisation can be key and knowing which tool in my toolset is best for different tasks is indeed insightful & helpful ;-)

re: contribute - Looking on my hard disk I see that I downloaded the bencher/shootout-scm back on 1st Jan 2010. However IIRC the process of contributing code back was a bit unwieldy. If some free tuits come my way then I may relook at it.


Contributing code is a simple matter of attaching a complete tested source code file to a tracker item ticket. Really not difficult.


Looking at my keychain I see I have three logins for alioth... draegtun, draegtun-guest & draegtun_guest... all created on 1st Jan 2010.

So it looks like I had issues logging (back) onto Alioth at that time :(

Anyway resolved it now because I see that draegtun-guest does work for me :)

It's a little convoluted and the contribute notes don't initially match what you see at login but after rummaging around I found the required tracker and have now submitted the faster fasta.pl.


Frankly, i'm not convinced there's any point in investing time on improving the benchmarks there, considering the overall comparison directly contradicts the detailed comparison, by putting perl in a worse position than languages it outperforms.


I think it might well be worth it. My improved fasta.pl is now on Alioth and (I believe) this one change moved Perl above Ruby & also Python on the Which programs are fastest - http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

Here's a link to my fasta.pl on Alioth - http://benchmarksgame.alioth.debian.org/u32/program.php?test...

NB. For posterity here are the bottom five on this benchmarks at this moment in time:

  PHP       40.49
  Perl      51.96
  Python 3  55.45
  Ruby 1.9  62.40
  Mozart/Oz 74.77
Previously Perl was second bottom with something like 69.3


Replying to myself because I want to add something interesting (to posterity) that I noticed today on Alioth:

Perl (or OS) was upgraded from 5.14.* to 5.16.2. From cursory glance this gave all the Perl benchmarks a little boost. For eg. My fasta is about 3 secs quicker and the "interesting alternative" fasta dropped below 2.0 barrier (now timed at 1.96).

However on the summary Perl slowed down a few points (here's the new bottom five on u32 single-core benchmark):

  Lua       31.09
  PHP       40.49
  Perl      54.90
  Python 3  55.45
  Ruby 1.9  62.40
I think the drop is because the Perl pldigits benchmark is now failing. The Math::GMP module can't be found. Pretty sure this wasn't a core module so perhaps a Perl dependency has been removed in the OS (Debian).

PS. This maybe a temporary glitch so that dependency maybe restored soon. If not then I may amend pldigits benchmark accordingly.

PPS. I see that Python pldigits is using gmpy and is working fine. This means that GMP is installed (as is gmpy Python library) so it's just the Math::GMP perl module that's missing :(


You don't seem to understand what the overall comparison shows.

Are you familiar with descriptive statistics? Quartiles? Box plots?


> You don't seem to understand what the overall comparison shows.

I told you i don't. It looks entirely nonsensical. I asked you for clarification. So far your only response has been to parrot my saying that i don't understand why your data representations are dissonant.

> Are you familiar with descriptive statistics?

Possibly under another name, but i don't know what you mean when you say that.

> Quartiles?

In theory yes, i am unsure how you're applying it here, since we're not talking about binnable quantities.

> Box plots?

Yes.


>> I told you i don't. It looks entirely nonsensical. I asked you for clarification. So far your only response has been to parrot my saying that i don't understand why your data representations are dissonant. <<

It would have been better if you had said -- "English is not my primary languages and especially english maths are hard for me to grasp." -- instead of saying "it certainly seems deceptive".

You say you are familiar with box plots, so you should have no difficulty understanding that box plot shows - the Perl and Ruby programs have very similar performance when compared to the fastest programs.

"Visual Presentation of Data by Means of Box Plots"

http://www.lcgceurope.com/lcgceurope/data/articlestandard/lc...


Those two sentences are not a contradiction. I may not be good with reading english descriptions of math, but i am good with applied math. The calculations i did with your numbers disagree with what your graph showed. So to me the graph seems deceptive. There is no contradiction in this.

Further, if you show me the actual calculations done, i will understand it perfectly fine. Yet you refuse to do so. I do not understand why, and i hope you can understand how that makes me even more distrustful.

On the graph in the overview page Perl was shown to significantly outperform Ruby in a number of benchmarks, yes, i could see that. Yet the median of Perl was still set higher than the median of Ruby, which could possibly be explained by perl also being outperformed significantly in one benchmark, but which was not supported by the actual direct comparison numbers.

So i ask again: Please show me the actual calculations performed to arrive at the median values shown in the overview graph.


>> but i am good with applied math <<

Really? http://news.ycombinator.com/item?id=5141025

>> Please show me the actual calculations performed to arrive at the median values shown in the overview graph <<

http://anonscm.debian.org/viewvc/benchmarksgame/benchmarksga...


Okay, i worked it out, no thanks to yoo. Actually, i fucking worked it out IN SPITE of you. All your condescending hints and links and such were entirely bullshit and did not even remotely lead in the direction of explaining why the data seems dissonant. They were flat out orthogonal to the entire problem.

The important thing which you did not bother to point here even once is that the comparisons on the overview page are done against the fastest programs of all languages, thus weighting the results by a factor that is simply not present when one language is compared directly against another.

So, alright, the graphs do entirely make sense.

Would you be open to a patch that reworks the language vs. language comparison pages in such a manner as to make this relationship obvious?


>> The important thing <<

Is stated in plain sight - twice - on the overview page.


That still does not change that your presentation of the data is not consistent in all parts, namely the language comparison.


I know it can be easy to mistake that, but i said calculations intentionally. I was not asking for the code.


Are you suspicious that you may not have understood what is shown? Do both show the same thing?


In fact, i know i do not understand how the table on this page is generated:

http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

That is precisely why i am suspicious. I cannot say for a fact that it is deceptive, but it certainly seems deceptive.

On that page Ruby is being shown as 10% faster than Perl. Yet on the direct comparison page things look quite different:

http://benchmarksgame.alioth.debian.org/u32/benchmark.php?te...

On that page, for all benchmarks that can be compared, Perl has used an overall time of 9255 seconds, while Ruby has used an overall time of 10662 seconds. As such Ruby is actually 10% slower than Perl.

Where does this difference come from?


>> but it certainly seems deceptive <<

You go too far -- your lack of understanding is simply your lack of understanding ;)

What are you told the table shows?

>> Where does this difference come from? <<

Check the same thing for 2 other language implementations were the arithmetic should be easy. For example, Java median 2.04 and Smalltalk median 21.22 -- the direct comparison shows 11x as the rounded median of the Smalltalk/Java program times.


So, basically: Because Perl is considerably slower in one single comparison, even though it is faster in 7 others, it gets judged as slower overall?

Seems like your graphs up top in the language versus language comparison need to be reworked to make it clear how bit the difference in reality between x3 and 1/3 is, because right now it is deceptive.


>> Because Perl is considerably slower in one single comparison <<

What are you looking at? Perl is shown slower on 3 tasks.


Note the word "considerably". There is only one single task in which perl takes a considerable amount of time longer than Ruby.


Another addendum, i'm not sure i'm getting what's happening here:

If i calculate the average of the time perl took divided by the time ruby took, i get this:

((226/724)+(5.35/16.8)+(3/9)+(2750/3960)+(1120/1368)+(3236/3837)+(30.5/35.8)+(939/618)+(263/135)+(662/214))/10 = 1.07

Which i understand to mean that perl on average took 7% longer.

-----

However if i turn this around i get:

((724/226)+(16.8/5.35)+(9/3)+(3960/2750)+(1368/1120)+(3837/3236)+(35.8/30.5)+(618/939)+(135/263)+(214/662))/10 = 1.58

Which i interpret to mean that Ruby took, on average, 58% longer.

------

These things contradict and i made some mistake here. Can you clear up what i should've been doing?


>> i made some mistake here <<

The table you don't understand shows the median but you seem to be calculating the arithmetic mean.


Ok, i don't get it. Can you show me the formula i should be using?

Or even better: Try to explain in detail why on the overview page perl is claimed to be slower than Ruby, when in a direct comparison it is not.


You don't know how to calculate the median?

http://www.robertniles.com/stats/median.shtml


English is not my primary languages and especially english maths are hard for me to grasp. That's why i am asking you to demonstrate, using the actual numbers for Ruby and Perl, what calculations should be performed to gain the numbers your site is showing.

Also, in addition, after reading your link, the situation seems even worse. Using the median Perl outperforms Ruby by ~15%, but the main site does not reflect that at all.



Also see: Which programs are best? - http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

One useful metric change is to add "memory (usage)" weight - http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

Based on this we should all be using Pascal ;-)


Free Pascal statically links programs by default, avoiding libc.


Not even the slowest of all language implementations shown on the web page you reference.


This is simply not true. Not every problem can be solved with caching. If DuckDuckGo would be a real search engine, they would not even touch with a 10 foot pole to Perl for any non trivial algorithm/function.

edit: fix the name


Uh, hello!? Blekko is a web scale search engine that is built using Perl.

It's not the language that makes handling big data slow. It's the algorithms and how you move the bits around.


Well, sure if you want to use 1000 servers instead of 100, go ahead and use it. No wonder Facebook noticed this and desperately trying to compile Php to C. Perhaps if Blekko ever gets popular we will see if their choice will bite them.

Also an anecdotal example. Two months ago a friend of mine wrote a rather complex algorithm in Perl (He is very fluent with it). It was a novel sentence alignment algorithm using Gibbs sampling. Algorithm was hard to make parallel and not cache friendly. He needed to wait more than a day for training the system in a server. Well, long story short, with his help another developer converted it to Java in short time it worked around 200 times faster. So there. Moving bits around did not help Perl here at all.


>>Blekko ever gets popular we will see if their choice will bite them.

What choice?

The only reason why Blekko or Facebook is even able to launch and turn around things quickly to survive in competition is because they opt to use dynamic languages like Php and Perl. If they started doing their projects in C, with the current growth in rate of complexity they will never finish their projects ever.

>>No wonder Facebook noticed this and desperately trying to compile Php to C.

Dynamic languages aren't slower because they are not C. They are slow because they do a lot of magic. C is fast because the magic is left for you to perform. I can't see how any one pull out pace out of compiling Php to C directly, unless they sacrifice things along the way. Which really defeats the purpose of using Php at the first place.

>>Well, long story short, with his help another developer converted it to Java in short time it worked around 200 times faster. So there. Moving bits around did not help Perl here at all.

Number of programmers who even need to here the word 'bits' in their day to day activities(Talking of application programmers) are rare enough to make their case totally exceptional.

Besides there are places where C makes perfect sense. There is hardly any other language heard of in the embedded programming world.


Is the backend/search kernel in Perl to?

I wrote the first prototype of Boitho (Norwegian internet search engine, now defunct) in Perl in 2000. That did not work at all because memory access and sorting was so slow. We had to write the next version in C. Of course a lot has happen with Perl in the last 13 years, but search kernel in Perl still sounds odd in my ears.


Granted, sometimes flipping bits is important. Then you write that part in C.

There is no reason to pay the development tax of C for everything else. Not when you can build it in Perl, get it working, and then find the hotspots and convert those to C.


> There is no reason to pay the development tax of C for everything else. Not when you can build it in Perl, get it working, and then find the hotspots and convert those to C.

Certainly not with fewer than 20 million req/day.


Uhm, no. Google can't be implemented in Perl.


Modern day software architecture is too complicated to run on one and only one programming language.

But regardless of that Perl continues to power some very serious work happening some very important places all over the world. And that is not likely to change sooner. No matter how many web frameworks get written in Php, Ruby or Python.

The reason for that is Perl has little or almost no competition in the niche it occupies. And anything that is likely to be invented to replace Perl will by and large like 99% look like Perl(Read: Perl 6 or whatever). This being the case we are likely to be using Perl(or a Perl like language) very far into the future.


I've gotta say, I love DDG, I rarely ever use Big-G for anything anymore. If I had to guess, it saves me a couple hours a week!


Just curious, as a non-DDG user: what features are you using that are saving that much time?


Sometimes I get really frustrated with Google when doing programming-related searches. After spending a minute trying to coerce Google into recognizing an API call exactly and having to exclude popular typos and alternate spellings of what Google thinks is the root word, I end up switching to DDG and getting the answer I was looking for. Google is awesome for general queries, but lately I'm finding more and more that it tries to be too helpful and the search results for my precise, specifically crafted query end up being so broad that it's useless.


To make Google search for the exact word you type, put it in quotes.


That is the coercion I mentioned. I wish I could provide an example, but there have been times where I have to both "quote" the terms I really want and -"quote" the terms it corrected things to. It usually only happens for longer queries when I'm trying to drill into a specific issue.

Come to think of it, perhaps this is their feature which uses previous searches and treats your current search as a continuation of the last one; though I've altered my query the search terms I abandoned get resurrected.

[Edit: confirmed!] I just tried this in relation to another comment I posted. I typed peer to peer lending into Chrome's bar to search for this. A bunch of finance related results came up. I was looking for movie lending, so I altered my search to peer to peer lending movies and looked at the results. Still unsatisfied, I thought that maybe the term borrowing was better than lending. So I changed my search to peer to peer borrowing movies. Lo and behold, the search results show a page full of results with the word lending bolded and no results that show borrowing at all: http://imgur.com/lPPQmxY

If I absolutely didn't want the word lending in the results, I would need to alter the search to peer to peer borrowing -lending movies to avoid this. For development searches where I'm sometimes trying to find a needle in a haystack, I don't want to have to keep excluding numerous terms I have already decided are undesirable. As I never sign in when searching and I can't be bothered to find and change whatever Google setting causes this every time I fire up my browser, I find DDG to work the way I expect and often with better results too.


The major for me is being able to do meta-queries that will be sent to the relevant website, e.g. "!maps 401 broadway to Canal St Station" or "!wa how many stars are in the galaxy"

http://duckduckgo.com/bang.html


But Google does that already, without needing to type !maps...


You get better (short, sane and email-friendly) URLs if you access google searches via DDG bang. !gm = google maps, !g = google search, !gi = google images, etc.


So we have DDG as a better solution when you're e-mailing links to google searches. That doesn't add up to hours a week.


I do the same sort of queries but just use the search keywords feature of Firefox - so that's surely even faster as DDG isn't acting as intermediary to interpret the search request and redirect.

That said I think the competition DDG has provided the other search providers has been a benefit to me.


I think people choose DDG for its emotional appeal (they like that it's a small company, privacy benefits, all kind of reasons besides search quality). Later they rationalize about it being better.

Frankly, I think this is morally wrong. When they use an inferior search engine, or inferior maps app for that matter, people are dumber than they might otherwise be. They miss a boost of intelligence, perhaps not IQ, but they are less effective operating in the real world. The world as a whole is worse because of it.


I disagree about the ordering (i.e. for me, for certain queries, it was objectively better first, before the other criteria played a part).

However, you're saying that the only thing (or the best thing) that will make the world better is better quality results. This is wrong. Playing the long game by supporting such search engines means that quality can improve and you end up with both better quality search engines AND morally improved companies. The whole world will be better because of it. You're thinking too short term.


The thing I like a lot with DDG is that you can easily configure the region and language explicitly. Google keeps redirecting me depending on geolocation (that isn't always correct). When searching for shopping, I want things to be local. When programming, I much prefer stuff in English to German or Italian.


Most of the time what I want is in "the little box" right at the top of the list (aka the "Instant Answer"), so I don't have to click and scroll... Google had "I'm feeling lucky," but just putting the summary of the primary result at the top is better/faster/visually cleaner.


Google have some instant answer style results now (eg search "barack obama"). I'm not sure if they're a result of DDG's work on "instant answer" but it looked that way from where I'm sitting. Just general industry development though I'd guess.


The same search on DDG gives a less "newsy" and more useful and explorable result, like a software encyclopedia. It looks like a Facebook profile (because it's taken from the Wikipedia page, which reads that way). I think that's the right approach.

https://duckduckgo.com/?q=barak%20obama&kp=-1&k1=-1

Now, if Google can pull off something similar with heuristics alone they may have an advantage, because those little boxes are run by code covering a whole bunch of special cases, and that could be a bottleneck.

But I love the exploration options of the instant answers.. What Google is doing is definitely more time-sensitive (they're justifiably tooting their own horn a bit with it), but it doesn't feel as useful in the typical case (does anyone really use Google like a newspaper?). If you search something like "barak obama news" on DDG, you get basically the same result as the Google headline summary, and it didn't get in the way of the list on the "barak obama" results page. It feels more natural, to me at least.

"I came for the privacy, but I stayed for the features..."


For your own crawler, what do you estimate the size of your index (# of objects)? Just curious how your own data compares to the rest of your sources (Bing, Yandex, Blekko, etc)


1 million doesn't seem that big a number on the internet?


On average that's around 11.6 requests per second. That's about 500 million CPU cycles per request on single low end CPU. Of course DDG probably isn't CPU bound, but it does give an indication that 11 requests per second is not a lot relative to the raw power of modern hardware.


Not sure why yegg's (Gabriel Weinberg) comment was killed. It says:

> We also do about 12M API requests per day: http://duckduckgo.com/traffic.html


Same comment was posted in another thread.


Their system centered around APIs, hashmap and msg encode/decode are CPU bound.


Doesn't google do around 3 billion searches/day? I was under the impression DuckDuckGo was a bigger player.


Seems like a mostly straightforward arch. I have only one question (and it's a genuine question as I'm not sure under what circumstances it's being used and if there is an issue that was being run up against). My question is why is Solr replication being run using a 'custom' solution rather than cloud/Zookeeper or similar, or even just the standard master/slave arch if the data was small enough?


Regarding the "specialized semantic data", isn't Google already offering similar stuff and/or in the position to offer more? The only example that comes to mind is searching for word definitions, but I'm sure there's more.


"Front-end development uses a lot low level JavaScript. Thinking of moving from YUI to jQuery."

So they are using node? I can't think of YUI being "low level javascript" at all.


I'd be very interested to hear a ballpark estimate how much data they're storing in their PostgreSQL instances. Any word on that?


startpage.com seems to have a better front end, is faster and the results actually what you're searching for. haven't dug into their security to see if it really is "the world's most private search engine"


StartPage just serves you Google results through a proxy. If you're trying to avoid Google, this is pointless.


Most people are just trying to avoid Google tracking. Startpage claims to not even record your IP address. And it was hosted outside of the USA, but it seems that isn't the case anymore...


where can i find info on how the ddg crawler is build?


When will google block DDG?


DDG don't rely on Google. There results comes from there own systems and Yahoo (through BOSS), embed.ly, WolframAlpha, EntireWeb, Bing, Yandex, and Blekko according to http://help.duckduckgo.com/customer/portal/articles/216399-s...


Well DuckDuckGo searches their own stock and then points you towards google for anything DuckDuckGo doesn't have in stock. Google should be paying them commission, if anything.


the bad press among techheads might make it not worth their bothering. Remember the people that use ddg are the candidate pool for most hiring positions at google. regular people use use google/bing etc


Custom Perl solution O_o. Ok seriously, no matter what, DuckDuckGo is still just a front end of some search engines.


Don't be so nasty.


Ok I was harsh, but I wonder if what I said was wrong. I do not understand the over protective behavior over DuckDuckGo news and brand in HN.


It's not about being protective over DDG. You are being an asshole again and again in this thread.

By your admission, you don't have any experience with search engines, and yet you have definitive opinion about everything related to it.


DuckduckGo is not a search engine. You dont need to be an expert to see that fact.


1) He never called you an asshole. He just posted his views about DDG.

2) If you are offended by what he said, it doesn't give you the right to call him an asshole.

For example, I know many people whose projects you didn't complete on time and who think you are a rotten dick for rightly being one. But I didn't call you a rotten dick, did I?

And also, stop defending DDG like a pussy that's everywhere, dude.


I thought about responding, but then "wrestling, pigs" etc.


I see only one pig here


>> I see only one pig here

... and I see accounts (with davidpayne11) with just enough comments/karma to be useful for trolling -- and insults from them.

Please get a life or at least take this pathetic shit to some other place than HN. :-(


Yep, not many people respond to insults. But you actually gave an explanation as to why you didn't, to make yourself sound cooler. Bravo 'chuthiya' (you were the one who taught me this)

Andddd, next time, before you waste time insulting people on an international forum, try to use it constructively to atleast complete projects of people whose money you've taken promising them to complete it and you actually didn't. Makes more sense than replying to my comments, right?


HN is helping Davids against Goliaths and I consider it a very good thing.


What happened to merit?


Could you share your search engine development experience?


Sadly, I do not have that experience. I know what everybody knows. I guess it is extremely hard. Software-Infrastructure wise it is possibly one of the most challenging product. Competition is stiff, so going global is almost not a choice. You either go vertical, or local-national. Yes you can use available open source software (Nutch, Hadoop, Solr etc) but they can make you go up to a point. If you want to go real big, then you may need to write your own crawlers, indexing - ranking, clustering, duplicate-spam detection algorithms, distributed file systems, ad systems etc. Making a search engine of course is not enough, you need to add a gazillion service to make it attractive. Many of those services are also requires cutting edge technology (translation etc.). And it burns money 7/24.

I understand why people values DuckDuckGo for privacy reasons. I respect that. But sure devleoping it, is far far easier than developing a real search engine.


If you're interested in what it takes to write a search engine with its own multi-billion-webpage crawl/index, check out the blog series starting with http://highscalability.com/blog/2012/4/25/the-anatomy-of-sea...


And when they said they're moving towards knowledge graph like instant results, latency not gonna be pretty, lots of small fetchings after general queries from core search servers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: