
DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing - orrsella
http://highscalability.com/blog/2013/1/28/duckduckgo-architecture-1-million-deep-searches-a-day-and-gr.html
======
kimmel
Good god, the comments in this thread really show how douchy the HN community
can be.

Tons of anecdotal stories with no evidence. Obvious outsiders trying to image
what it is like on the inside (all the guesses about how DDG works) and of
course no facts to back anything up.

I would say this pandering and uninformed behaviour is not common on HN, but
it is.

People who engage in language flame wars simply do not understand that when
you become a good programmer languages do not matter only the platforms
matter.

~~~
hammerzeit
Leaving aside one trolly user, most of the conversation here is either
referencing data mentioned in the article or asking questions/having
discussions about semi-related points.

I'd hardly call the argument over language speed a 'flame war' by any internet
standard I've ever seen.

There are probably better examples than this one to pull the 'HN commenters
suck' card. The conversation here is mostly civil and trying to learn things,
one troll aside.

~~~
rehack
Agree. The choice of perl actually does surprise people, genuinely. As
somewhere else mentioned, even blekko uses perl. So obviously these people
know what they are doing.

But that said, choosing anything interpreted does come to haunt you later on,
in some way or the other.

We run Java on Jettys for most of the things for our app (traffic is around
60k/70k in a day, so much less in comparison). But even with this traffic, we
need to use 2 large EC2 instances in the day time. And mysteriously the Jettys
keep going out of memory every now and then.*

I am sure the same thing if done in C++ would need only 1 large EC2 instance.
And it will also help the latency a bit, as a parallel gain. At present am
analyzing the cost/benefit of such a move. Inputs are welcome.

* With Java its always the memory which hurts you first. Latency wise, not much of a difference, in most cases.

Edit: Down vote? Surprised. Why??

------
pjungwir
Surprised to hear that crawling still runs out of the basement. For a general-
purpose search engine, what kind of bandwidth does that require? Or is it
manageable only because DDG proxies so many searches to other search engines?

Also, doesn't it seem inefficient to have crawlers here and indexes there?

~~~
orangethirty
I run a couple of boxes that are crawl only for Nuuton right out of my office.
I do have a good internet connection. But its a good way to save on
infrastructure.

~~~
boyter
I don't suppose you have a blog or something for that? My hobby/interest is
search engines and I love reading about peoples experiences getting them
running.

~~~
orangethirty
There is one but I have been too busy building it that have not posted
anything yet. You can always get in touch through email. I like chatting with
other hackers.

~~~
boyter
That's a pity. Just for the record for anyone looking at this space here are
some links to get started,

    
    
      http://www.yioop.com/blog.php
      http://www.gigablast.com/rants.html
      http://queue.acm.org/detail.cfm?id=988407
      http://blog.procog.com/
      http://www.thebananatree.org/
      http://blog.blekko.com/
    

I will even suggest my own small implementation (created purely for SEO value
but it does work)

    
    
      http://searchco.de/blog/view/code-for-a-search-engine-in-php-part-1/
    

BTW email sent.

~~~
ersii
These are all great links to read if you'd like to dig deeper into search
engine land.

Thanks for the searchco.de articles, they were nice to read (Saw them in the
previous searchco.de HN) :-)

~~~
boyter
Thanks. I collect any that come up since its really such a dark area. I should
probably write a blog post about it since I have a few more I have since dug
up. Glad you like the articles :) it was something I had been writing for
weeks and then finally got my act together and finished it.

------
gary4gar
DDG is using dynamic language like Perl[1] to achieve high-scalability which
is slowest of all languages. This proves languages don't matter much, its all
about architecture.

I guess its time to stop worrying about performance of your programming
language & start building better high-salable architectures.

[1] Slowest of all languages
[http://benchmarksgame.alioth.debian.org/u32/which-
programs-a...](http://benchmarksgame.alioth.debian.org/u32/which-programs-are-
fastest.php?gcc=on&gpp=on&ifc=on&java=on&v8=on&hipe=on&jruby=on&php=on&python3=on&yarv=on&perl=on&calc=chart)

~~~
timr
Those benchmarks are extremely suspicious. I remember comparing Perl and
Python for bioinformatic work (non-trivial computational workloads), and
finding that Perl was about 2x the speed of Python, on average.

Later, I did similar, non-trivial benchmarks with Python and Ruby, and found a
similar, 2x factor. Ruby has improved since then, but unless Perl has become
dramatically slower in the same interval, I suspect that these benchmarks are
either trivial (i.e. simple loops), or badly written.

~~~
sreque
Your benchmarks are likely very biased. The vast majority of the work done by
your bioinformatics programs should be done in C code, with thin wrappers so
that the code can be accessed through Perl and Python. If anything, your
benchmarks only comparing specific implementations, such as BioPerl vs
BioPython.

~~~
timr
I wasn't using bioperl or biopython. I wrote the algorithms myself.

------
bifrost
I've gotta say, I love DDG, I rarely ever use Big-G for anything anymore. If I
had to guess, it saves me a couple hours a week!

~~~
wiremine
Just curious, as a non-DDG user: what features are you using that are saving
that much time?

~~~
orta
The major for me is being able to do meta-queries that will be sent to the
relevant website, e.g. "!maps 401 broadway to Canal St Station" or "!wa how
many stars are in the galaxy"

<http://duckduckgo.com/bang.html>

~~~
lutz
But Google does that already, without needing to type !maps...

~~~
nonamegiven
You get better (short, sane and email-friendly) URLs if you access google
searches via DDG bang. !gm = google maps, !g = google search, !gi = google
images, etc.

~~~
badgar
So we have DDG as a better solution when you're e-mailing links to google
searches. That doesn't add up to hours a week.

------
photorized
For your own crawler, what do you estimate the size of your index (# of
objects)? Just curious how your own data compares to the rest of your sources
(Bing, Yandex, Blekko, etc)

------
account_taken
1 million doesn't seem that big a number on the internet?

~~~
jules
On average that's around 11.6 requests per second. That's about 500 million
CPU cycles per request on single low end CPU. Of course DDG probably isn't CPU
bound, but it does give an indication that 11 requests per second is not a lot
relative to the raw power of modern hardware.

~~~
jules
Not sure why yegg's (Gabriel Weinberg) comment was killed. It says:

> We also do about 12M API requests per day:
> <http://duckduckgo.com/traffic.html>

~~~
tedunangst
Same comment was posted in another thread.

------
druiid
Seems like a mostly straightforward arch. I have only one question (and it's a
genuine question as I'm not sure under what circumstances it's being used and
if there is an issue that was being run up against). My question is why is
Solr replication being run using a 'custom' solution rather than
cloud/Zookeeper or similar, or even just the standard master/slave arch if the
data was small enough?

------
knowaveragejoe
Regarding the "specialized semantic data", isn't Google already offering
similar stuff and/or in the position to offer more? The only example that
comes to mind is searching for word definitions, but I'm sure there's more.

------
fourstar
"Front-end development uses a lot low level JavaScript. Thinking of moving
from YUI to jQuery."

So they are using node? I can't think of YUI being "low level javascript" at
all.

------
pepr
I'd be very interested to hear a ballpark estimate how much data they're
storing in their PostgreSQL instances. Any word on that?

------
hexonexxon
startpage.com seems to have a better front end, is faster and the results
actually what you're searching for. haven't dug into their security to see if
it really is "the world's most private search engine"

~~~
dublinben
StartPage just serves you Google results through a proxy. If you're trying to
avoid Google, this is pointless.

~~~
temes
Most people are just trying to avoid Google tracking. Startpage claims to not
even record your IP address. And it was hosted outside of the USA, but it
seems that isn't the case anymore...

------
umenline
where can i find info on how the ddg crawler is build?

------
abbert
When will google block DDG?

~~~
runarb
DDG don't rely on Google. There results comes from there own systems and Yahoo
(through BOSS), embed.ly, WolframAlpha, EntireWeb, Bing, Yandex, and Blekko
according to
[http://help.duckduckgo.com/customer/portal/articles/216399-s...](http://help.duckduckgo.com/customer/portal/articles/216399-sources)

------
afsina
Custom Perl solution O_o. Ok seriously, no matter what, DuckDuckGo is still
just a front end of some search engines.

~~~
JoachimSchipper
Don't be so nasty.

~~~
afsina
Ok I was harsh, but I wonder if what I said was wrong. I do not understand the
over protective behavior over DuckDuckGo news and brand in HN.

~~~
irahul
It's not about being protective over DDG. You are being an asshole again and
again in this thread.

By your admission, you don't have any experience with search engines, and yet
you have definitive opinion about everything related to it.

~~~
davidpayne11
1) He never called you an asshole. He just posted his views about DDG.

2) If you are offended by what he said, it doesn't give you the right to call
him an asshole.

For example, I know many people whose projects you didn't complete on time and
who think you are a rotten dick for rightly being one. But I didn't call you a
rotten dick, did I?

And also, stop defending DDG like a pussy that's everywhere, dude.

~~~
irahul
I thought about responding, but then "wrestling, pigs" etc.

~~~
afsina
I see only one pig here

~~~
berntb
>> I see only one pig here

... and I see accounts (with davidpayne11) with just enough comments/karma to
be useful for trolling -- and insults from them.

Please get a life or at least take this pathetic shit to some other place than
HN. :-(

