

Ask HN: Has Google/Bing won the search engine war? - palakchokshi

Recently I&#x27;ve been working on a project that required me to search a limited set of domains.
1)I tried Google Custom Search but they restrict usage to 10000 API calls per day at $50&#x2F;day. I cannot go higher than this even if I wanted to pay.
2)So I decided to write a crawler that will go to these domains and index their pages so I can put them in ElasticSearch and create a basic search engine.
3)When I created the crawler I came across domains that do not allow crawling their content e.g. Yelp, Craigslist, etc. have locked down their sites for crawling. However I see results from these on Google.
4)It seems the only option is to use their APIs (if they provide one) to get data from these domains. This can be a nightmare to maintain as the number of domains increases.
5)I want to respect the domains&#x27; policies and not use shady tactics to crawl their pages.<p>So essentially these domains allow google and bing to crawl their sites because they are big and established and not having their pages show up on Google or Bing would drastically affect their web traffic but smaller startups would be left out in the cold.<p>So my question is: What is the possibility of a new search engine emerging if the web is locked down for crawlers?
======
krishna2
Have you tried blekko.com ? The slashtags feature specifically does what you
are describing here.

Of course, using this method will not get you data from sites that explicitly
ban crawling (like yelp.com).

Disclaimer: I used to work at blekko and I implemented the slashtags feature.

~~~
greglindahl
blekko's "ScoutJet" crawler is allowed to crawl yelp. We sent them an email
before we launched, and they were nice enough to allow it.

~~~
palakchokshi
I see ScoutJet in Yelp's Robots.txt file. I guess I'll have to ask them nicely
to allow my crawler too. But Yelp is one of many sites that don't allow
crawlers so I'll have to ask each one nicely to allow mine. I had even set up
a temp google site explaining what my crawler uses the data for (essentially
showing results with correct attribution and driving traffic back to original
site).

------
xpose2000
This is slightly off-topic, but still very much related.

There will never be an all encompassing search engine to rival Google. I would
argue, one could not even rival Bing.

The future of search, as with most internet companies moving forward, is all
about mastering a niche. You may not be able to beat Google as an overall
search engine, but if you dedicate your time and money towards one specific
niche, say sports, then you have a fighting chance to beat them at that.

~~~
palakchokshi
The era in which Google and Bing became what they are is one in which the web
was more open. With walled gardens cropping up everywhere it is becoming
tougher to create a search engine. Another point against creating an all
encompassing search engine is the sheer size of the web today which makes
creating a search engine for the entire web a daunting task. Google and Bing
started when the web was much smaller and hence they were able to grow with
the web.

IMHO there are multiple ways to create a search engine: 1)Mastering search
within a vertical e.g. sports, music, etc. and then expanding to other
verticals. 2)Go after a limited popular area of the web and provide a more
personalized search experience(this limits discoverability of less popular
sites but there are ways to overcome this) 3)Enhance the results of existing
search engines by providing context, personalization, privacy features,etc.
(this relies heavily on existing engines which makes it subject to their TOS)
4)Come up with an algorithm that blows Google's or Bing's out of the water and
VCs line up to give you money so you can build the infrastructure to make it a
reality (I wish I had this too... sadly I don't)

------
int64
If the service is exposed publicly to the web, It can be crawled regardless of
whatever guards are in place by the service provider. Browser emulation will
be a good start.

~~~
palakchokshi
I know we can crawl them. There are no technical issues with crawling them.
The issue is I want to respect their TOS. There are multiple ways to
circumvent their anti-crawling code but that means any new search engine will
have its roots in "shady" tactics of crawling. IMHO crawling should be allowed
if the purpose of the crawler is to show results that drive traffic back to
the sites and not mashup the content and deliver it as "original content" on
the site that crawled these domains. However e.g. Yelp's TOS do not allow for
these types of crawlers that essentially drive traffic back to Yelp.

~~~
argumentum
> The issue is I want to respect their TOS.

Permitting _certain_ search engines to crawl but not others is anticompetitive
and violates the principle of an open web.

> any new search engine will have its roots in "shady" tactics of crawling.

Who cares? If you are successful enough, then you'll negotiate with them
later. Don't worry about it now.

Crawl away!

~~~
palakchokshi
[http://yelp.com/robots.txt](http://yelp.com/robots.txt)

It seems you can contact Yelp and tell them how you plan to use their data and
maybe they'll let you crawl their site.

I really want to explore any and all alternative options before I decide to
crawl away :)

~~~
argumentum
Are you _hacker_ , or aren't you? Dumb rules exist to be broken.

~~~
palakchokshi
Agreed but no harm in accessing the vast knowledge of the HN community to
exhaust alternatives before tightening my hacker cap and plunging in head
first.

------
dm2
It's all about innovating and offering something that Google doesn't offer,
such as privacy or smart widgets. See: DuckDuckGo.com

You're not going to crawl / index better than Google.

Maybe you could combine the DMOZ directory and Google results somehow.

Rather than replacing Google, try to supplement their services or just somehow
provide more value to users.

~~~
palakchokshi
you are right. I don't want to make a better search engine than Google's. I do
want to provide more personalized results than Google's. To do that I need to
give my users a basic result set and then take input from their behavior to
provide a more personalized experience. I do want to be able to search over a
limited set of domains without hitting a 10000 req/day limit.

I typed "sushi" on duckduckgo.com and saw a result from Yelp. If
duckduckgo.com used Yelp's API to get that result then you can imagine the
complexity of maintaining API connections to hundreds of services like Yelp
that do not allow crawling. However if duckduckgo.com crawled/scraped Yelp to
get that result then that's against Yelp's TOS which would cause Yelp to send
a cease and desist order to duckduckgo.com.

IMHO any crawler that drives traffic back to the original site (like Google,
Bing, duckduckgo, etc) should be allowed to crawl the site but Yelp's TOS
prevents that.

~~~
timr
DDG uses Bing's search API. That's why you see Yelp on DDG.

I used to work at Yelp, and I can tell you that if there were no restrictions
on who crawled the site, we'd have been regularly DOSed by robots. I don't
(and didn't) like the precedent, but I understand some of the motivation.

~~~
palakchokshi
I understand the motivation too. My request is that there should me some
option for crawlers to apply to be whitelisted. I know Yelp asks you to
contact them if you want to crawl them and they'll decide on a per case basis.
It would be nice if there were standards that sites could implement to
whitelist crawlers, create crawl schedules for whitelisted crawlers, something
on the lines of robots.txt but with attributes that told a specific crawler
when and how frequently they can crawl.

~~~
palakchokshi
So a "legit" crawler has no recourse but to resort to implementing technology
that'll circumvent anti-crawler code? Can't say I didn't try to go legit :)

~~~
timr
Well, no: you ask, and if they say no, you don't get to crawl. The rules don't
change because you think they're unjust.

If they catch you crawling, they'll send you a cease and desist letter, so do
it at your own risk.

~~~
palakchokshi
Agreed. I would rather not go down a route of building my product and then
have a bunch of cease and desist orders taking away a lot of my search
results. Asking nicely is the next step for me.

------
rip747
if the cap on google custom search is the only roadblock, then why not use
bing's:

[http://datamarket.azure.com/dataset/8818F55E-2FE5-4CE3-A617-...](http://datamarket.azure.com/dataset/8818F55E-2FE5-4CE3-A617-0B8BA8419F65)

~~~
palakchokshi
The data returned by Bing is limited. e.g. Bing does not return an image with
it's results while Google does. I want to show 1 image from the page along
with Title, Description, preview text. Bing has separate searches for web and
image however the results are not from the same domains so I can't correlate
results and associate an image to a web result. Secondly Bing does not allow
search across a limited set of domains. Yahoo BOSS API allows search across
limited set of domains however it's done by passing a comma separated list in
the URL. This means I can't put 1000 domains in the URL. Moreover it's search
results are in a similar format as Bing's.

Google Custom Search is the best option in terms of how it allows you to
specify the domains you want to search in and save that as a custom search
engine and then use the custom search engine's ID in the API request.
Furthermore Google Custom Search's result JSON data is more comprehensive than
either Bing or Yahoo BOSS.

~~~
rip747
fair enough. have you checked out google enterprise search then?

[https://www.google.com/enterprise/search/products/gss.html#p...](https://www.google.com/enterprise/search/products/gss.html#pricing_content)

~~~
palakchokshi
Enterprise Search is for searching within your own site (or bunch of domains
you own). I actually use Google Search Appliance on my day job so I have a
good idea of what it does and it's not what I need.

