
Ask HN: How would you build a search engine in 2019? - throwaway13000
So, I was wondering how to build a competitor to Google. We have Common Crawl and Internet Archive. Distributed systems is pretty well understood. How would one go about buiding a search engine in 2019?<p>Would you do what duckduckgo did, which is to use some body else&#x27;s index and ranking or would you just build your index from commoncrawl? How are Ecosia and Startpage.com able to stay profitable without doing either?<p>Does that mean we can have many niche search engines? Can we crawl CommonCrawl and build an index for less than $10K (what one individual can do out of pocket.)
======
bwb
I think the why changes the approach...

I think my takeaway from the last 10 years, is that a lot of the info on
websites that was by real people has disappeared and you have a lot of spammy
blog and heavily commercial approaches. And, most of the real info has gone
into facebook groups, quora, reddit comments, slack, twitter, and so on.

The problem is those are all closed-door eco-systems in a lot of ways, and the
knowledge is hard to differentiate from the temporal messaging.

I think if I was going to approach this I would build software that users run,
or browser add ons that lets user tag and save information in some type of
format, and then that contributes to a knowledge search information.

For example, I am a member of several FB groups focused around specific expat
groups for where I live. There are great pieces of wisdom and hard to find
info in there. I'd love to with a chrome extension say save this and here is a
little context (or if it could know that is great from formats).

Then try to figure out how to make that public and searchable.

~~~
ai_ia
>I think if I was going to approach this I would build software that users
run, or browser add ons that lets user tag and save information in some type
of format, and then that contributes to a knowledge search information.

Something like delicious or pinboard, I suppose.

I agree. A social bookmarking website would be pretty interesting if done
right.

~~~
bwb
Well personal + fed into a centralized system, a human knowledge search engine
which I think was some of the magic of the early internet but now is buried in
FB groups and the other locations i mentioned.

~~~
K0SM0S
Cue some deep learning over this centralized dataset... we might just see
'profiles' emerge that allow the system to build very ad hoc results depending
on who searches what.

E.g. one of the strengths of Google (which I've used for a decade without ever
deleting history) is to give me very hands-on blogs and procedures whenever I
search for some tech — I don't even have to add "guide" or "tutorial" or
"hands-on" or "how to", whenever I use my dev account Google just knows what
I'm after — _“this guy MAKES, he don 't care for fluff”_. I've compared
searches with friends (non-tech non-nerds), they get the usual wiki / reddit /
commercial stuff, with my meaty top 10 stuff often on page 2-3-4 (if even
that!) for them.

This is the kind of smarts I don't know how to replicate without ML. This is
also where Spotify who has had years to figure it out remains extremely
generic (frankly bad at deep cuts, which at a basic level used to be one of
their initial strengths).

------
rossdavidh
So, if you want to make a competitor to Google, you would need to have your
own index, but it would be prohibitively expensive to make one on Google's
scale. So, you would need to make a search engine wherein most websites you
can simply not index, because they are clearly not what you're looking for.
That way, you can "only" index the 1% or so of the internet that is in your
niche. Some ideas:

\- indie websites only (no news, no medium, no ecommerce, etc.), for those who
want to find individuals who still maintain their own website, and say
something interesting on it \- low-size websites only, for people with very
low bandwidth; anything above a certain size (e.g. 1Mb) and it doesn't get
indexed \- recipes (but there are some niche websites for this already) \-
websites with no ads on them (but this may conflict with your business model,
if you have one) \- websites focused on a certain geographic area (e.g.
websites with information by, for, and about Texas, or Slovakia, or Buenos
Aires) \- websites with no javascript on them (for people who want to be able
to turn off javascript, but don't have a good way of finding out which
websites they can still use to get a particular piece of info)

------
pouta
I've been having this idea of crowdsourcing a better Google via a browser
extension.

Every time I repeat a search query and end up finding the answer in the same
website I visited before I wish I marked that link as the 'definite answer'.

Next time I search for the same information the extension would point me
directly to my previously marked link.

Maybe by letting people could subscribe to each other answers I could
bootstrap better google. Developers seem a good initial target market.
Students too...

------
ian0
In my mind there are two flavours of queries. The first are those which are
"ctrl f" in nature, ie I want to query the entire web to find string x. Which
are obviously better suited to indexes. The second are more knowledge based in
nature. IE I want to find quality information about a certain topic. These
benefit from curation, as anyone who has added "stackoverflow" or "reddit" or
"wiki" to a google query will know.

So I would start with a curated and crowdsourced first page results for the
top x% of these knowledge based queries. With wikipedia like guidelines and
moderation to ensure the quality of the sites mentioned is up to scratch
coupled with an inbuilt feedback mechanism from people browsing. I think
wikipedia has proven that while difficult, a scheme like this is indeed
possible.

I also think you can start very niche and play with the results structure. For
example, I like motorcycles and after years of browsing I have discovered the
best places for reviews and information. Even just this use case could benefit
from a better structure of results page and removal of all the spammy sites.
The same goes for other niches like cooking & programming languages.

~~~
ian0
Another interesting one is illustrating "depth" of knowledge in a results
page. For example, I type "Sagittarius A*". It would be nice to have the
results grouped into depth. From initial general-audience explainer videos on
youtube down to arvix results. With current news seperated from general
informational content.

This works for a lot of niches. For example motorcycle reviews run from
entertaining comparisons down to detailed analysis. Recipies for omelettes run
from the basics, through adding interesting ingredients, right through to the
art of perfectly preparing a difficult to make french omelette.

------
freediver
I like this topic a lot.

Obviously, the easiest is to build on top of existing index. This in turn
makes it a purely marketing play (DDG's marketing play is "privacy").

Here is one exploration of the following concept: search engine built on top
of high quality sites only (as vetted by HN submission history)

[https://cse.google.com/cse?cx=014479775183020491825:c2lrlzro...](https://cse.google.com/cse?cx=014479775183020491825:c2lrlzrogb5)

Described in full here:
[https://news.ycombinator.com/item?id=21209358](https://news.ycombinator.com/item?id=21209358)

------
rootshelled
You can't build something that can compete with Google without some very very
deep pockets and lot's of data.

Same problem with niche search engines, unless they have some unique
features/properties that Google doesn't have you are plainly better off with
Google.

Google has a Monopoly on search, which looking at the market will hold up for
the foreseeable future.

So unless you can offer a specific feature(set) for a niche or you just want
to build it for the hell of it I wouldn't reccomend anyone to go into search
engines.

I would probably go with what duckduckgo does but offer unique features that
are usefull for one or more niches.

------
probinso
I would build a discovery system. Search systems imply that you know what
you're looking for, it also implies a short query.

Imagine instead you start writing out your ideas in natural form. Documents
will appear that are relevant to your ideas, but with the goal of diversifying
category. Instead of the top 100 results, you may just get minimal results per
category.

As you continue to write the relevant documents get more constraint, but
continue to attempt to maximize diversity.

I don't know if this work for everything, but would be interesting.

------
hakejpc
Can we leverage torrents somehow to create distributed indices ?

~~~
rootshelled
Wouldn't blockchain make more sense? (Never thought I would say this but here
we are)

Problem with torrents/blockchain in this case are twofold:

\- unless there is enough people using it, it's slow

\- since you are basically crowdsourcing you need to create an enticing
incentive which works for everyone involved

~~~
fortran77
Torrents + Blockchain + _Rust_ would be even better!

~~~
quickthrower2
Don’t forget webassembly

------
kamutuna
with something like proxycrawl you should be able to build the crawler
yourself and start crawling to build a big index. Once you get big enough and
you start bringing traffic to the sites, you can then stop using them and
create your own user agent and ask sites to whitelist you. It will take some
time but can be done

