

How to scrape data from sites you can't log into - geoscripting
http://ssscripting.wordpress.com/2009/02/24/how-to-scrape-data-from-sites-you-cant-log-into/

======
tarmac
You're still technically logging in by providing the copied cookie. I don't
see any difference here.

Or say the title should be changed to "How to scrape _your_ data from _your_
sites that require login"

~~~
almost
Agreed, this more of a "how to scrape data from sites you can log into". It's
not even a very useful example at that.

For anyone who wants to scrape sites that require login I'd recomend Python
with Twill. That lets you do the whole thing with ease.

~~~
geoscripting
Twill is an option indeed, but this way you don't miss out on the javascript.
You can take advantage of all of your browser's features.

~~~
almost
But the article is suggesting just copying cookies from Firefox into a simple
Java based scraper. That won't support Javascript either.

If you needed Javascript you could use one of the Firefox scripting bridges
(Selenium or MozRepl).

------
timf
Seems to me this is just saying: log in via the browser (with the right
password), then use the generated cookie with the scraping code. You can also
do that with wget --load-cookies

~~~
geoscripting
True, you can use wget as well, but this way you have more control over your
code.

~~~
timf
For sure, I was responding to the title, it sounded like a security thing.

~~~
geoscripting
What did you thought this will be about? :)

------
apgwoz
But, if the site is using only HTTP and cookies, there's no reason not to
first make a request to the login page with the username/password and retrieve
the cookie via the "cookie" header that comes back... Did I totally misread
the article, or was it just dumb?

~~~
timf
A lot of scraping frameworks will handle it for you, too. It's even the front
page example code @ <http://jwebunit.sourceforge.net/>

I think most people are reacting to the title... I for example thought it was
a security posting.

~~~
apgwoz
Of course there are frameworks to do this for you, and I'm sure jwebunit will
work perfectly well. I guess I'm disappointed that the author didn't
understand the fact that Firefox doesn't perform magic, and that a login page
is no different than anything else being scraped.

~~~
ewiethoff
I'm disappointed the author didn't realize that he doesn't need Cookie Monster
to see Firefox cookies: Tools -> Options -> Privacy -> Show Cookies

------
babo
I do my scrapping with BeautifulSoup from Python, actually from an iPython
shell. With a urllib2 opener you could easily handle cookies and UserAgent
pretty easy, the later is also important for some sites.

~~~
geoscripting
There's also mechanize for python. It can be found here :
<http://wwwsearch.sourceforge.net/mechanize/> . It handles cookies by default,
and is a pretty good tool.

------
fugue88
man wget

wget --save-cookies cookies.txt ...

wget --load-cookies cookies.txt ...

------
rams
Can someone suggest a browser-independent way of handling login pages that
require javascript ? Twill doesn't handle javascript.

~~~
geoscripting
You could give HtmlUnit a try.

------
sho
Huh? What he did is completely automatable, you don't have to touch Firefox.
And if you really did need JS to log in, which I have _never_ seen outside
misguided banks, there's tools for that too. Selenium and JSSH come to mind
but for 99% of sites you'd just need Mechanize.

And Java? Why the hell would anyone write a _script_ in a compiled language
like Java? Desperate for that 2ms time saving between 10 second waits for the
pages to come down, eh? And any for-real scraping script would have a time
delay built in anyway.

The guy doesn't know what he's talking about.

~~~
geoscripting
The example wasn't written in Java for performance gain. It so happened that I
had NetBeans open , and it was easier for me to write it in Java at the moment
:).

~~~
sho
Easier!? You wrote pages and pages describing the most inefficient way
imaginable to do something I can do in 5 lines of Ruby, and you call it
_easier?_ And unless I'm very much mistaken, you'd have to compile the code
anew whenever the cookie changed?

Well, good luck to you, and the more script kiddies you confuse the better I
guess, but there are seriously much better ways to do this. Go look at Ruby
Mechanize (I think it's also available for python); coming from Java you will
be blown away by just how easy this kind of thing is. How do you think we all
test? ; )

Update: Oh I see you know Mechanize from another article. So why not just use
that ... you do know it can do all that logging in stuff for you, right?

~~~
geoscripting
Yes, I have worked with mechanize before. I was using mechanize even when
there only was the perl version. I added a comment to the article explaining
my choice.

~~~
sho
Fair enough. I guess the surprised reaction you're getting is because web
testing frequently involves doing this kind of thing, so, being a community of
web programmers, everyone here knows it backwards. I didn't really think of
the angle you mentioned where someone wouldn't know all the relevant
techniques and just want to get _something_ working ASAP. For that, taking the
cookie from FF might indeed be a time saver.

Anyway always good to see everyone chime in with their opinion so thanks for
the conversation starter.

BTW, is anyone else nervous about the day the teenage h4xx0rs discover how
easy this kind of thing is these days ..

------
spoiledtechie
interesting.

------
Mystalic
I wonder if this works every time...

And if it did, how long until a defense is made...

~~~
Retric
There is nothing magic about web browsers, telent to port 80 at www.google.com
and with a simple get request they will spit back their website. You can make
it a little harder to do this stuff but a packet sniffer is always going to
let you pretend to be any software you want unless they are using encryption.
Also because Firefox is open source you can't prevent people from scripting
using it anyway.

PS: I recommend all aspiring coders to telent to www.google.com at least once
just to feel the magic.

~~~
radu_floricica
> PS: I recommend all aspiring coders to telent to www.google.com at least
> once just to feel the magic.

I'm pretty sure it's illegal in a couple of ways. </bitter>

But yes, the magic is there.

~~~
xenophanes
illegal!? why!?

~~~
geoscripting
It is recommended that you use the API that google provides for searching, but
I fail to see why telnetting would be illegal. After all, both firefox and
telnet use sockets to do their job.

~~~
radu_floricica
Again, the comment above wasn't meant to be taken literally, but just because
they both use sockets doesn't mean they're just as legal.

Imagine a web site which terms of service state you cannot use software to
circumvent ads. Or where part of the security is done client-side (stupid,
yes, but not impossible). Skipping the browser breaches at least the terms of
service, and may be constructed as hacking. I think even Google discourages
automated searching and prefers you use its api, which (at least some years
ago) wasn't free for commercial use. I may be wrong in this particular case,
but the important point is you may want to check the specific TOS before
skipping the browser.

