

Ghost.py - a webkit web client written in python. - aeurielesn
http://jeanphix.me/Ghost.py

======
kanzure
Individuals might also be interested in the reimplemented version of phantomjs
in python (pyphantomjs): <http://github.com/kanzure/pyphantomjs>

~~~
civilian
pyphantomjs is not under active development anymore, unfortunately.

------
RBerenguel
Good! A few days ago I was playing with mechanize to automate some form
filling in wordpress posts (iTunes app details, automatically downloaded and
then batch-add as post drafts). Gave up by the lack of AJAX-Javascript, turned
instead to the Selenium web driver, which solved the problem in "seconds".
I'll have to give Ghost.py a spin :)

------
dylanpyle
Has anyone had any experience with both ghost and phantom (or any other
options I may not have found), and know how they stack up in terms of
rendering speed/etc? I'd imagine they're fairly similar, but if that's not the
case I'd be heavily biased towards the faster of the two.

------
DaNmarner
This is the Python equivalence of phantom.js, which provides a programming
interface for testing rendered web pages without the overhead of actually
opening up a browser (a la Selenium).

------
NiekvdMaas
In order for this to be usable for a broad range of projects, it must contain:

    
    
      * Cookie support (I see this is partially implemented)
      * File download support
      * Mouse movement API (move to pos X,Y - click)
      * NSPlugins support (Flash, etc)
    

Of the latter ones, I cannot find a reference so I think they are not working
yet. Once they are implemented, this is a nice alternative to PhantomJS.

~~~
revenz
This is a wrapper around PYQT which is a wrapper around webkit in QT. So if
those things are supported in QT and PYQT there is hope.

------
jc4p
Can it run inline Javascript as the page is loaded or do I have to explicitly
tell it what JS to run? I want to scrape some pages that use JS packers to
obfuscate their code so that it's only loaded by real browsers, but if I just
use curl all I see is JS that needs to be evaluated before I can get anything
useful out of it.

~~~
catshirt
" _JS packers to obfuscate their code so that it's only loaded by real
browsers_ "

this is probably not what's happening. more likely, it's obfuscated for other
reasons. curl doesn't parse or execute javascript.

~~~
jc4p
It actually is in the case I'm talking about. I'm talking about illegal
websites where the only money generated is by advertisements on human
eyeballs. They go way out of their way to make sure no scrapers/robots can see
the videos on the page since it costs them money for bandwidth. In addition to
referrer checking and captcha, they also have inline javascript that evals
itself to un-obfuscate itself and load the video on the page so that if
someone somehow beats the first two methods and loads it by a command line
interface, they still don't get the URL to the video.

~~~
catshirt
pretty cool, wasn't aware of this at all. thanks for the explanation. but even
if it were unpacked, curl wouldn't execute it.

------
clemesha
Can this be used to suck in streaming Flash video?

There is this streaming camera of the ocean that I check often, but it's Flash
and I'd love to check it from my iPhone. Could Ghost.py be used to get the
Flash video? (then turn it into images by other means). Thanks.

~~~
kanzure
You could just rip the RTMP stream itself, or whatever the source data is.
Decompile the swf and check it out for yourself, or the source url for the
video feed is probably provided in some lame xml config file. No need to write
software around an entire browser to get your ocean feed.

~~~
eli
Just turn on the Net Inspector in your favorite browser's debugger and start
the stream; you'll probably see what url it's loading.

------
andreif
Earlier discussion: <http://news.ycombinator.com/item?id=3896441>

------
greattypo
Anyone know - how does it handle file downloads?

------
antihero
Seems to get a timeout error on a lot of stuff, even if I set the timeout to
~60seconds with wait_timeout :(

------
sscheper
What is a webkit web client?

~~~
grakic
WebKit is a famous browser rendering engine. Client says that you can use
ghost.py to "operate" or "drive" a WebKit instance as a headless browser.

------
dlsym
Giggle at the ghostie! (Ok - and now burn my karma)

~~~
daeken
I don't know which one scares me more: the fact that there are My Little Pony
references on HN, or that I actually _got_ the reference.

