
Google2Csv is a simple Google scraper that saves the results on a CSV file - panos_sa
https://github.com/psalias2006/Google2Csv
======
elektor
If you're looking to do any serious search engine scraping, you'll find
yourself needing to use proxies.

For my thesis which required millions of datapoints, I used this tool:
[http://www.scrapebox.com/](http://www.scrapebox.com/)

~~~
mateus1
This seems like what I was lookingf. What kind of proxy were you using?

------
stinos
Nice, until the CSV part really. I mean you have a nice DataFrame there but
then it gets serialized using probably the worst format out there for totally
variable data like what you scrape from a site.

Unless the file uses the actual ASCII record separator you'll end up with a
CSV file which can only be read by a handful of software, after telling it
explicitly what the separator and quoting rules are. And even then it's hit or
miss. And likely it does not use the RS because even though the chances it's
unambiguous greatly increase and the RS was actually meant for that, software
doesn't typically use it because when CSV was invented it's existence was
unknown or ignored beacue it's not human readable (I guess, don't really know
actually) and so the sad story began.

As you can see: I'm not a fan of CSV :) Just today I again had to waste time
because at one point in the development of this otherwise fine piece of
software I'm working on - even though I knew I'd regret it - I allowed it to
export CSV files. Customer moved software to another machine, forgot that they
once told the CSV exporter part to use the system settings, and now has CSV
files with a comma separator (you know, the C in CSV). Oh the irony, that's
not what they wanted.

~~~
iaabtpbtpnn
CSV is totally fine, and I don't understand why people have such an issue with
the format. It is very simple, here is the entire specification: the file
should be encoded in UTF-8, and fields are separated by the , character.
Fields may be quoted with ". If a field contains any of [",\r\n] then it must
be quoted. The quote process is: replace all " characters in the field with ""
and then surround the field with ". I would expect a competent programmer to
be able to implement this specification _as an interview question_!

The existence of software that produces braindead CSVs is beside the point.
Don't use that software. Just do the above and everything will work.

~~~
panopticon
The answer to your first question is in your last paragraph. CSV is an abysmal
format for data exchange because the standards are weak and implementations
are extremely varied.

If you're writing an (en|de)coder for a specific or limited target, then sure
it's fine. Otherwise I would steer clear of it.

------
lazyjones
> _Scraping google search results is illegal._

IANAL, can someone please elaborate? This sounds wrong in several ways, one of
which being that Google results are almost exclusively scraped from somewhere
else already.

~~~
ttul
It is decidedly not illegal to scrape web sites. This was recently established
by a court [1].

[1]
[https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn)

------
jll29
Further to the practical/technical issue of being blocked, there is a legal
issue: this way of coding it (= not using an API key) violates Google's Terms
and Conditions.

~~~
xur17
Are Google's terms legally binding if you are just searching? I don't recall
ever agreeing to them in that case.

~~~
ggggtez
> is the ToS legally binding if you are just searching

Based on the LinkedIn case my guess is no, because it could be considered
"generally public". But that ruling is highly suspect and likely to go to the
supreme court, imo.

------
lbj
I made a wonderful little Google scraper in Clojure once. I was surprised to
see I got IP blocked after only 20 or so searches.

~~~
tantalor
A rite of passage.

------
nurettin
This is a great way of getting your ip permanently blacklisted and swamped by
captchas.

~~~
dvfjsdhgfv
It's not going to be permanently blocked, but the low rate limits are
annoying, so for any seriosu use a set of proxies is must.

~~~
nurettin
I don't know about being blocked. I have a vps where I used a similar naive
script and it got blacklisted by google. I sometimes tunnel through it. It has
been getting captchas from google the past 7 years. It is pretty permanent.

------
Minor49er
This is a cool project. I've used Scraperr[0] for years, but it's always great
to have alternatives.

[0] [http://scraperr.com/](http://scraperr.com/)

------
xur17
Neat!

Have you run into issues with getting blocked by Google / issued captchas?

~~~
damowangcy
Did something similar using requests, there's a rated limit around 80+ request
(not sure how long it lasted though) no matter how I delay each request.

My work around is to run in on the cloud with multiple IPs, so I just switch
between. Mine's a one time thing so this is workable.

If you're querying for news, Reddit might be a wise choice, they have quite a
bit of free APIs afaik.

