

Why Writing Your Own Search Engine Is Hard - helwr
http://queue.acm.org/detail.cfm?id=988407

======
iamelgringo
Search is a gold mine, and I don't understand why there aren't more people
diving in to building niche search engines. Sure you can't really compete with
Google on size, but there's a lot of nooks and crannies online where you can
pick up valuable search traffic around the edges.

At least, that's why I'm working on a search engine for financial news at
<http://Newsley.com/search>. (We're focused on building the crawlers and the
index right now. Search is _very_ alpha).

After reading this article, I feel validated for a bunch of the decisions that
I've been making. I've been running on EC2, but their disk IO is slow as
molasses. So, I'm starting to build servers and throw them in my garage. I'll
be migrating to garage servers in the next few months. Pretty much everyone I
talk to thinks running servers in your garage is a terrible idea, but I can't
think of any way else to do this cheaper and still have control over my
hardware. It's nice to read that I'm not crazy for thinking this.

It was also great to read that on early search engines, the bulk of the work
is done by small teams. Being the only dev, at times I think I'm a bit crazy
for trying to boostrap a search startup. Again, it was nice to read that it's
not all that crazy to try and do it on my own.

~~~
ljlolel
She specifically says to go with slow disk I/O (not in-memory indices) because
the most important thing is not to have to deal with failing servers.

This was written in 2004, and the equivalent statement would be that you
should not deal with maintaining physical servers but have the cloud handle it
so you can focus on algorithms/parallelism/runtime.

Also, unrelated to what you wrote, this is 2004, and statements like "people
search for words not phrases" are frankly no longer true. Average query length
is way up (even before google suggest and instant) and people have been
searching for phrases more and more since 2004.

She's completely right, but if she were to rewrite this today, in 2010, it
would be an even longer article. A single-word search engine would not return
acceptable results.

~~~
iamelgringo
re: Cloud Srorage vs Garage data center

Elastic Block is NAS . This is why Elastic Block storage is two to three
orders of magnitude slower than SSD attached to a server in your garage. 1 EC2
compute unit is analagous to a 1.6 Ghz 2005 Opteron, or performance wise it's
similar to a fast Atom processor. I can build a 4 core server with 8 Gigs of
RAM and 120GB of SSD and 2 TB of spinning disk for around $600. SSD latencies
are low enough that they can be thought of as slow local memory.

So, for the cost of a 3 to 4 months of medium / large instance EC2 instances,
you can have roughly 8 times the processing power and 2 orders of magnitude
more local memory/SSD.

By setting up a couple of cheap servers in my garage, I can get as much
processing power as 20 to 30 EC2 instances. That means I can spend a lot more
time worrying about coding and less worrying about SysOps.

re: single word search

You are absolutely right. As I said, our search is very much in alpha right
now, and I

really haven't spent much time at all on it.

------
bradleyland
This is from 2004. A lot of the paper still applies, in principle, but I'd
argue that there are far fewer people chomping at the bit to get in to the
search business these days. Now it's all "social" or "game" related.

~~~
mmaunder
Agree and I think there's a a few big opportunities. e.g. a better job search
engine. Search in some form is what most of us spend a huge amount of time
doing every day. Even searching within web pages for the relevant content
we're looking for.

Also Google are barely keeping their head above the web spam and filtering out
spam is a very hard problem which may be better solved with a blend of human
intelligence. Before you disagree with me consider that there are vast farms
of humans creating SEO content and the content-gen business is growing
extremely fast.It's the thing that scares Google the most in search.

------
korussian
I think for most people writing a search engine is overkill when there are
existing options out there.

If you want to search a subset of sites, then Google CSE is really all you
need + whatever bells & whistles you'd like to add around it. I've done that
here: <http://searchESLCafe.com>, adding "recent searches", search via
wildcard subdomain (i.e. foo.searchESLCafe.com or bar.searchESLCafe.com or
foo_bar.searchESLCafe.com, etc), and customizing the heck out of Google CSE's
options.

Is there a demand out there for the search engine to parse the results into
something informative at-a-glance? I'm not so sure it's the user's first
priority. Or, to put it another way, there's plenty of hard-to-reach info out
there that you can hand users via a customized Google CSE, and they don't mind
doing the leg-work of clicking on the query results and finding their own
answers.

It's a lot more important to have an accurate search algorithm than drill-
down-related bells & whistles.

Google does a great job of returning solid results for any subset of sites, so
why not let Google handle it, and concentrate on the other stuff?

------
rwmj
I wonder what happened to the Internet Archive search tool she wrote
(recall.archive.org)?

~~~
gojomo
'Recall' was at the promising prototype quality level, with results more Cuil-
like, centered around linguistic concepts detected during indexing, than
Google-like. So for example the index was compact but there was no phrase-
search, and even after receiving a page as a result to one query, a followup
query with words on that page might return zero results if that exact set of
words hadn't been concept-extracted.

It had nice graphs of concepts over time. But, its cleverness was probably
overkill for such visualizations, except to the extent (IIRC) it disambiguated
similar concepts from context. As the recent Google Books-based word frequency
tool has shown, simple word and n-gram counts provide plemty of similar value.

Ultimately, though, as a bit of advanced, non-open-source Lisp code by a
single expert, there was no one appropriate to productionize, maintain, and
improve it after she joined Google.

(I work at the Internet Archive.)

------
iwwr
In other words, avoid spending money, refine your algorithms first. Faster
machines may be tempting, but that makes scaling horribly expensive down the
road.

~~~
jamesaguilar
Following our own advice is hard. <http://en.wikipedia.org/wiki/Cuil>

~~~
iwwr
Wow, the same Anna Patterson. I don't understand how Cuil became such a
disaster. Even early Google seemed alright.

~~~
jseliger
I don't either, and I'd love to know who does and whether they've written
about it.

------
joshbaptiste
Heh.. wonder what yegg of DuckDuckGo thinks of this article.

------
known
We can rollout our won Google search engine via <http://aspseek.org>

------
mixmax
_Application server is busy. Either there are too many concurrent requests or
the server still is starting up_

Apparently scaling is hard too.

