Hacker News new | past | comments | ask | show | jobs | submit login

There is an irony in google preventing web scraping given that their business is pretty much built on web scraping.



Why is there irony in that? Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google. Google protecting its site from scraping means you can't compete with Google using Google's own resources.

That said, automated research fascinates me, I wouldn't want to scrape Google to make my own Google, but rather to make private repositories of information that I can then query efficiently. I would love to find any kind of scriptable search engine access, paid or free. Not entirely sure how to look though.


Think different. Try bing, it has an API.

I think bing is close to Google in quality. Some people might even like it better. On the other hand I think DDG is the Sprint of search engines.

Google used to have a search API and they discontinued it because they said most of the people who used it were SEO people.

People who do pay-per-click are into A/B testing and other quantitative testing. Google is all for you doing that if you pay for advertising. Their mainstay of anti-SEO is doing arbitrary and random things to make it impossible for SEOs to go at it quantitatively. (They have patents on this!)

One reason so many sites go to a harvesting business model is that once a site is established you can make the slightest change and then your search rankings plummet. If you depend on search engine traffic it is a huge risk that you can't do anything about unless you are about.com (bought a 'competitive' search engine and just might be able to make an antitrust case against Google.)


Can you elaborate more on this statement? "On the other hand I think DDG is the Sprint of search engines."

I've been interested in switching to DDG for a while but as a former Sprint customer, that statement scares me but maybe some explanation from you might understand your opinion better.


I'm not sure about the comparison itself... I've tried DDG several times, I search for technical things in generic ways a lot. DDG almost never gives me what I want in the first page. Google almost always does.


Same here. It’s hard to blame DDG though- Google’s search index of Stack Overflow is better than SO’s own.


It's not a matter of blame at all... I'd love to see some challengers. In the end, google knows a lot about me and is really good at delivering personalized results because of it.


Is there any site that indexes itself better than Google?


Maybe if I used DDG more I would learn to parse the results better but the first thing I see are many results that have a "dark pattern" appearance to me.

Often I do get a good result on the first page but often results #1-#N vary from third rate to non-sequitur and then result #N+1 is the one that should be at #1, where maybe N is drawn from Uniform(3,6). I see this so much I can't imagine it is an accident. If anything it seems to be 70% more evil than Google.


I find DDG to be slightly better for technical things. It is pretty similar though.

The real difference is non-technical things. Google filters out unflattering results and one side of anything even remotely political. It's a nerfed world, kind of like a Disney theme park. I'm an adult and I don't need to be led with blinders to the googly viewpoint.


> I think bing is close to Google in quality. Some people might even like it better. On the other hand I think DDG is the Sprint of search engines.

Isn't DDG just a Bing wrapper with a few frills in the results?


The irony is the "do as I say, not as I do".


Google's web scrapers obey robots.txt, you can stop Google from crawling your website if you want. Google doesn't want you crawling their website.

That word, I don't think it means what you think it means.


Google supports consensual scraping, and respects sites which opt-out (using robots.txt) just like they have. It's no more ironic than someone selling a product they don't happen to use themselves.


I think there's a credible argument that it's not purely consensual. Websites are forced to allow search engines with a lot of market share to scrape them or they won't be found.

No matter how well-intentioned you are, if you write your own scraper and have it abide by robots.txt, you'll never get nearly as many resources as Google or Bing. Many websites approve only their scrapers and ban everything else outright.

I don't have anything against the large search engines, it's just not really easy to say no to their scrapers for most websites.


I didn't consent to all this debt. It was just not really easy to say no to all these great credit cards.


The irony is that if every site protected themselves from web scraping, there would be no Google.


>Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

Unfortunately that is not the case. Many paywalled sites will let googlebot index their content but block other crawlers.

They may have good reasons for doing that in some cases, but as a consequence the level playing field you're talking about no longer exists.

Also, the purpose of using Google as part of some automated process is usually not to compete with Google's search engine, but to complete some specific and limited task.

I don't understand why Google does not have a general search API offering. I'm sure many people would happily pay for it.


I have had web crawlers from China crawl my site multiple times a day but never send me traffic. Same with Yandex. I like the bing search engine but often it does not like my site. If it doesn't send any traffic, why let them run up my AWS bill?


I understand that, but I think there are good reasons why we shouldn't always act in the narrowest sense of our self interest (provided we have enough financial wiggle room).

A search monopoly is not good for website owners. It makes us very dependent on the whims of that monopolist.

If you block all crawlers that don't already have a large market share and send back a lot of traffic, you're killing any possibility for new competitors to get a foot in the door.

Also, you're killing any chance for something unexpected to happen, such as someone having a great idea based on crawled data that could change all our lives for the better without ever sending traffic to your site.

Now, I'm not telling you what you can and cannot afford. If crawlers cost me a ton of money that I don't have I would certainly act exactly like you suggested.


> It makes us very dependent on the whims of that monopolist.

Very true. Only allow Google and you are helping them to build their monopoly. And if they have full monopoly they do what they want - including asking you money to be included in the search results.


I can't even imagine how many businesses would be extatic about the ability to do this. Might as well cut out the SEO middleman.


>Many paywalled sites will let googlebot index their content but block other crawlers.

Doesn't that infringe upon Google's own rules? I always thought Google didn't like it when sites served its crawler content that's different from what users get when they follow Google's link.


That's why many paywalled sites give you a few free articles per month if you're coming from a Google search results page.

But it no longer works at all sites. Maybe the rule has been dropped now that paywalls are becoming more popular (with publishers that is)


But as he describes it, the GP is not trying to "compete" with Google (whatever that means), he is only trying to do some comprehensive searches.

He is not selling advertising.

He is not even running a public website.

Google is preventing you from using automation to create private (i.e. personal) respositories of information, even when that information is public and (ironically) Google itself relied on automation ("bots") to collect it.


> I wouldn't want to scrape Google to make my own Google, but rather to make private repositories of information that I can then query efficiently.

That's what Apify in original post does - including a public database of scrapers, so there is a high chance you could use already finished scraper :).


> Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

I don't think that making a scraper will make you competitive with Google. If you can make a site ranking algorithm that compete's with google, on the other hand, you might have a chance


The site ranking algorithm is a solved problem.

The one reason Google is competitive is due to them taking advantage of the cheap labour that keeps track of ranking manipulation.

Luckily most of the search problems have nothing to do with ranking manipulation.


site ranking is not a “solved problem” - google tries to solve it all the time and yet finding anything other than trending or popular stuff still takes more than several attempts (and often doesn’t even result in best results).


Google has a set of contradicting requirements for the interface they've got on their website.

From one side it's along the natural-language interface from Alexa or alike; from the other side it's an interface of search for people who generally need access to information.

If Google exposed interfaces similar to Elastic Search - the search would never be an issue anymore; but it would not be easy to use by the users.


That’s precisely why they don’t want you cheating the hard part and just storing the results. It makes sense to me. Work on your own machine learning if you want good results.


There is no such thing as cheating, only staying within boundaries that don't land you in jail or sued in your own jurisdiction. If you can get an edge by using Google's own data, do so.


Bing bing bing!

Er, ughm. I mean,

Ding ding ding!


You could argue that Google should work on their own knowledge database instead of learning from other people's content and/or presenting other people's content in their own frontends (shopping etc)...


This is what Common Crawl does: http://commoncrawl.org/. I think more people should know about it.


> Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

They cannot. Googlebot & some other search engine bots (like Bing's & Yandex's) get special treatment in various websites. This includes things like ban on non-whitelisted scrapers & bypassing paywalls. If you are not already established player in the field, you would not get able to scrape the same websites as the established players can.


As I understand it, this was the rationale behind the courts' decision to prohibit LinkedIn from banning people from scraping public profiles.

Basically it was anti competitive to grant certain privileges to major players around 'public data,' but to block smaller players.

No telling if/when ramifications from that decision (last year) hit existing anti scraping measures, though.


Google crawls the web, they don't scrape it. There's a big difference.


> to make private repositories of information that I can then query efficiently

You and me both :)

I still haven't gotten around to do much about it, but for example one thing I've been thinking about is to have my system integrated with my desktop so that it has some situational context.

For example, it would look at the programs that I have currently running.

Let's say that it saw that I had PyCharm open where I was editing some Python 3 files. Furthermore I also had Vim open where I was editing some HTML, CSS and JavaScript files.

It would maintain a list of all items that had been in focus during the previous 30 minutes or something.

When I then searched for let's say sort list it would look at the list and see that most recently I had been editing a Python file in PyCharm so result number 1 would be how to sort a list in Python 3. Before that I had also focused Vim with a JS file, so sorting arrays in JS would be result number 2.

Results:

1. Python 3. Sort list "a_list". In-place: a_list.sort(). Build new sorted list from iterable: b_list = sorted(a_list).

2. JavaScript. Sort array "an_array". In-place: a_list.sort(). Create a new shallow copy and sort array: let another_array = an_array.concat().sort().

And if the system was even smarter, it would also be able to know details about what I'd been doing. For example it could see that while editing a JavaScript file I had most recently been writing code that was doing some operations with WebGL, and before that I was editing code that was changing style properties and before that something that was working with Canvas, so if I then search for blend, it would use this information.

Results:

1. WebGL Lesson 8 – the depth buffer, transparency and blending. http://learningwebgl.com/blog/?p=859

2. Basics of CSS Blend Modes. https://css-tricks.com/basics-css-blend-modes/

3. CanvasRenderingContext2D.globalCompositeOperation. https://developer.mozilla.org/en-US/docs/Web/API/CanvasRende...

Something like that.

And because it's for the limited amount of things that I am interested in and developed for myself only (as opposed to trying to give super relevant information for every person in the world), it might be doable to some extent.

Here is a book that might be of interest to you; Relevant Search. https://www.manning.com/books/relevant-search. I bought a copy myself but have yet to read it.


Shameless plug: it’s why we made SerpApi! (https://serpapi.com)


I find serp api very interesting but the big data plan is very expensive still for medium size companies. Does it really work for Google?


just scrape startpage.com


Google indexers respect robots.txt, so there goes the irony.


They also provide something of value to the operators of the websites they scrape, namely search traffic.


I've heard that they [sometimes] visit but don't index, is that true?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: