

Ask HN: What strategy would you take to build a search engine today? - matt1

If you were to begin working on a search startup from scratch today, what strategy would you take? I'm thinking the best route would be to focus on a specific niche and tailor the search features around that topic since you wouldn't be able to compete on breadth.<p>What would you do?
======
jacquesm
I would aggregate all the knowledge in linkdumps / fora where voting is
possible, possibly seed the database by creating a giant repository of links
such as a social bookmarking site.

It would get built up slowly but it would be of a higher quality than you
could probably reach with crawling. Effectively you'd be crowdsourcing the
ratings system of your search engine.

A karma system would keep the spammers out, or at least have them identified
fast enough.

Of course there are plenty of things that would need to be fleshed out wrt to
abuse potential and 'gaming the system' but I think it would stand a fighting
chance.

Crawling the web is just going to get you tons of garbage, comparable with
moving the contents of the local trash dump into your house because you know
there must be a pair of earrings in there somewhere.

There will come a time when 'pagecount' is not the defining measure of search
engine quality. Google seems to be on to that because they've dropped their
'indexing xxx pages' message from the homepage long ago.

I would also allow users to 'hide' certain domains from their future search
results, and use that to help identifying sites that contain mostly (or
exclusively) trash.

~~~
jdrock
This approach would be suitable for "mainstream" search results, but wouldn't
be able to serve the long tail.

Or to put it more humorously, you wouldn't be able to find the earrings
because only you (and no one else) were interested in them :)

~~~
jacquesm
I'm not so sure about that.

The 'long tail' in search is - for google at least - everything beyond page
100 (or position 1,000 , which ever way you want to slice it). Those pages
might just as well not exist for those keywords. But because there is plenty
of other content nobody notices.

The situation with the 'long tail' for pagesets where there are less than
1,000 'results' can be handled the exact same way it is being done today (here
the long tail probably refers to rarity of search keywords / combinations).

For really small sets (1 full page of results or less) the ranking is pretty
much irrelevant.

~~~
jdrock
Let me clarify..

How would you know how to value votes for different queries? Certain pages are
more relevant to certain queries more so than others, and likewise for votes.
Your approach doesn't have a way of accounting for this. And implementing
something that does is a very, very hard problem.

~~~
jacquesm
That's true, you would not have the 'text in the link' to guide you.

But you might be able to get around some of that by allowing users to tag the
urls.

I realize it's a hard problem, I assume that the OP does not expect to walk
out of here with a bullet proof business plan for a new search engine. There
are bound to be issues with almost any suggestion that you could make here.

But it might give some useful hint or starting point.

------
Perceval
I would rethink things fundamentally. That requires a lot of abstract thinking
about ontology, but that's how the past revolutions in search began.

Yahoo treated the web like a phone book or directory. Altavista relied on
self-categorization efforts in meta tags. Google treated links as votes.

You need to come up with a new (and hopefully better) way of thinking about
what the web _is_. Come up with an inventive way of thinking about what
linking means, what DOM structure means, how to think about non-standard types
of content, and so on.

If you start with the same premises about the web that Google started with in
1997, you'll never surpass them much less carve about anything more than a toy
niche.

------
shorbaji
I would focus on anticipating the user's search needs even before he/she
formulates them and types out a search query.

I would use as much context that a user is willing to provide me - location,
recent email messages, voice call transcripts, unread messages, web browsing
history, etc - to try to anticipate what the user is likely to query for.

For example, a sales engineer who receives a technical query in his email
inbox is likely to be searching for product information once behind his/her
laptop. Also, a student that lands in NY JFK airport one a morning, will
likely be searching for restaurants in the vicinity by evening.

I would focus on anticipating such queries (among a range of others) in
advance. And would let the user choose which ones he/she wants answers for.

I'm sure this is easier said than done, but is a direction I think is worth
exploring.

~~~
anamax
If you can anticipate a user's search needs, why not take the next step and
provide the answers before the user gets around to searching?

~~~
sundeep
One step at a time :)

What you suggest would benefit from the performance/feedback from what the OP
suggests.

This is basically what I am interested on working on currently , using "clues"
gleaned from what the user is doing (a la RescueTime) to
reorganize/reformulate a user's search query.

~~~
shorbaji
I would be happy to chat about this. Please drop me a note. My email is in my
profile.

~~~
anamax
Your e-mail address doesn't seem to be visible in your profile.

~~~
shorbaji
You're right. Fixed.

------
scumola
Write a crawler: Make it nice, so it doesn't hammer sites, convert relative
urls to absolute, pull out relative data from html, cycle urls back through
crawler for more crawling. Make sure your ISP doesn't have a cap on your
monthly bandwidth (i.e. don't use a residential-scale ISP). Take into account
pages that are dupes of other pages, so compare MD5s of all content to detect
urls that generate duplicate content to what you've already pulled and ignore
those. Integrate sitemap detection for smarter crawls.

Index the data: Figure out a way to index the massive volume of data that's
not a live query system (lucene starts to gag at queries > 50GB or so), so
generate all possible results for all possible queries in a database and
update those results occasionally as you pull in more data. When someone
queries for a search string, pull up the results from a pre-generated query
from a database, don't do a live search of all of your terrabytes of data, or
the query will take days.

Generate interesting results: find a niche. Don't plan to take on Google, Bing
& Yahoo on a personal scale. People put together good engines, but target main
sites, and front pages, or a shallow-depth crawl. Don't plan on indexing every
forum and every blog on the internet.

I've been impressed by this guy's search engine: <http://gigablast.com/>

In short, unless you've got a TON of money, machinery, people and time, don't
try to compete with Google. Find a niche like shopping search or movie search,
or be human-powered like Mahalo. Google's got a dedicated computer for every
possible search query out there or close to it, plus a team of 500,000 Chinese
people making sure that popular results are relevant (Google does human-
validated results for many of the most popular queries, not like an error
message query).

------
jdrock
Use us. <http://www.80legs.com>.

:)

~~~
jpeterson
Why are comments like this sometimes voted to the top, and other times buried
with downvotes? Where's the line between good self-promotion and spam?
Seriously asking.

~~~
matt1
To their credit, 80legs does make crawling the web much easier than starting
from scratch so it does answer the question of how to begin.

------
yannis
Search technology - perhaps because of the cost in setting up - has stayed
behind. I would like to see:

(01)User ability to adjust the 'algo' for ranking results. I may want all the
newer websites and news in my ield rather than the websites with the highest
page rank.

(02) Ability to distinguish an 'authority website', i.e I search for Topic X,
I do not want the wikipedia. I want the website perhaps of a Ph.D. student
with no SEO, but with 500 pages on the topic (i.e not only rank pages, but
rank websites).

(03) and fast as hell :)

~~~
ajordan
<http://hounder.org/> does 01 and 03 quite well, and 02 through bayesian
filters you can train to find and rank relevant sites.

:)

------
delano
There is a lot of possibility to carve your niche, although this is true for
most software products (a product can be considered a market leader in one
vertical, yet almost unknown in another).

Whether you choose to seek out a niche or not, you still need a novel
approach. I know one search company that in 2000/2001 started building a
solution that relied heavily on in-memory indexes. At the time memory was
expensive but it was a smart play b/c it became cheap very quickly which gave
them a huge advantage in the depth and breadth of calculations they could do.
They got a contract with Verizon to provide search for superpages.com and
became quite popular in the YP space. They were acquired in 2007.

I mention the story because it's important to remember that there are many,
many successful companies doing cool stuff that we've never heard about.
That's the norm. The ones we do hear about are the outliers.

------
Tichy
I still like the idea of using p2p computing to get the computing power
(Searchi at home?), though I am not sure how viable it is.

Another thing I would be interested in experimenting with is browsing search
results. Maybe a flat list of results is not the end of the story? For example
maybe it could be interesting to be able to click results and say "more like
this, or more like that" (maybe Google does it already with voting on results
- did they continue that eperiment?).

I am not sure if Google's Algorithm is even that good (the SEOs still succeed,
after all). Computing power might not be an edge forever, either. Where they
have a big lead might be the number of data sources they have. Not only
crawling the web, but people using Google Groups, Google Maps, the book
scanning thing, and so on...

------
charlesju
Search is a function of 2 problems:

1\. Mainstream Search \- This is the search for information that a lot of
people want to know. Britney Spears, How good is the new transformer's movie,
1 + 1 = ?, etc. I think I would just go through slowly and optimize each page
to show results from the various information portals on the web, then make
competing websites bid for positions. ie. list game reviews from gamespot,
ign, etc. with rotten tomatoe's algorithm.

2\. Long Tail Search \- This is random information throughout the web. I don't
think there is a better way to aggregate this data than what traditional
search engines are doing. Perhaps look into more advanced spam filtering
algorithms, but that's a tweak, not a feature revolution. \- Probably just use
something like Yahoo Boss to get started

------
byrneseyeview
There's room for a search engine that can take more advanced boolean queries
and can handle custom ranking. For example, a while ago I was trying to find a
section on a blog that mention someone whose last name is "White". The blog
mentions politics a lot, so nearly every article also had the phrase "white
house". It would be nice to search for mentions on the blog of "white" _not
including_ "white house," but not excluding it, either.

The closest you can get on Google is to search for white and NOT "white
house," then search for "white house," and search within the results for
"white" to see if anything else pops up.

------
mbenjaminsmith
Not sure I have an answer, but after writing a niche search engine I have this
observation:

The more narrowly focused you get, the less keyword rich the data is likely to
become. This creates obvious problems. I've been working on a search engine
where last.fm is a major source of data, and their data is comprehensive, but
keyword poor. How to work around this? There are ways but they're far from
trivial or resource friendly.

------
moonchuck
Unless you are ready to get your hands dirty with the semantic side of search,
I would focus on creating a more enjoyable and emotionally appealing
experience for a particular demographic of consumers. Maybe teens, students,
mothers, whatever...focus on building a loyal, targeted user base.

Otherwise you run the risk of just being a couple cool features that the big
boys can use as inspiration for their own work.

~~~
matt1
PowerSet comes to mind.

How hard was it to do what they did? (I know nothing about semantic search
with the exception of what it is.)

~~~
jdrock
It's hard. Natural language processing and semantic analysis is a fairly deep
field of knowledge.

One evidence of its complexity is PowerSet itself. PowerSet launched with just
being able to search Wikipedia. Wikipedia is a highly, highly structured body
of text that is much, much easier for NLP and semantic technology to analyze.
Taking the same technology to the garbled soup that is the web is a whole
different ballgame.

------
jdrock
So most of the ideas mentioned here don't really consider the costs involved
with making a search engine. New technologies/concepts/ideas are great and
all, but unless you're building a very niche/vertical search engine, you're
going to require several million dollars in servers to build a search engine.

Any strategy on building a search engine needs to address the costs. (Raising
VC money is not an answer.)

------
byrneseyeview
Or you could do one that takes data like clicks on new stories, and uses it to
promote old stuff. For example, take political news: if there's a news story
about a politician's sex scandal, older content that mentions the politician
and sex scandals could get a bump; if there's a story about a politician
demanding a tax cut, older stories about the politician's attitude towards
taxes could be promoted, instead.

------
sh1mmer
I'd use one of the many search engine APIs available (BOSS, Bing, etc) to a
build vertical search engine that enhances the basic results provided to me.

Re-inventing crawling, relevance clustering, etc isn't worth the trouble or
the cost. Finding ways to enhance a specific market segment however would be a
differentiater worth pursuing.

disclaimer: I work for Y!

------
kasunh
I would opt for a complete new type of a search engine. One with completely
different features to present day search engines.

1) A mashup type of a search engine which would combine results from different
locations and combine them into one result

2) An intelligent search engine which give one or few accurate answer to what
ever we ask from it.

------
jamesgpearce
Very simple.

Mobile search today is an absolute fright.

(Google is about as good at mobile search as Alta Vista was at web search 10
years ago. Why not be this decade's Google and show them how it could be
done?)

------
mark_l_watson
I would try to build something like clusty.com, except I would try to identify
themes in user search requests and track these for each user over time, using
click throughs per user to keep search results on theme.

------
thorax
We focused on a specific niche.

<http://www.errorhelp.com>

------
jakestorm
I would make this engine: <http://www.badabingle.com> because what else do you
need?

~~~
matt1
That's cheating :)

------
clistctrl
like you can compete with cuil

