

Harvesting data from websites using WebKit and PyQt4 - silkodyssey
http://www.rkblog.rk.edu.pl/w/p/harvesting-data-websites-using-webkit-and-pyqt4-part-1/

======
mahmud
Ugh! This is wrong, immoral and idiotic:

 _Using WebKit in PyQt4 we can write an app that will collect data of all ads
on a webpage and parse the data for marketing guys .. [snip] .. Next stage
will be to make application that will open ad URLs saved in DB_

NO! Any half-decent ad-network will quickly detect the unusual clicking and
will freeze the site owner's account (While still billing the advertiser for
the click!) My ad engine will learn your behavior in about 5 clicks and after
that, welcome to my de-optimization hell, hope you like PSAs, non-profits and
humanitarian causes (until you became too much of a nuisance, then you're a
"drop" rule in a proxy filter.)

If you're trying to explore an ad network's inventory for competitive
advertiser poaching (you want to bring their advertisers to your site) you
need to just save the banner ads and use actual humans to read the ad, google
the firm and contact them. Every advertising link you see on line is a 302
redirect and someone is paying for it, usually the advertiser, and if you
click on it fraudulently like this, both advertiser and publisher. Besides,
don't refresh the same page; the ads to that page are already contextually
targeted, and you only see matching assets. Instead, hit multiple pages on
multiple sites, and each few times to get the big picture.

There is no bigger danger online than a moron with a "for" loop.

~~~
riklaunim
Note that in the third tutorial where the app "clicks" the ads the USER AGENT
is changed to bot-like so it's not a hidden evil click-fraud app. The use of
the app is one of many possible (it also may not click them, just count how
much adds shows up on certain pages etc.), and not all ads are contextual like
Google Ads but randomly displayed from a pool of adds for a given page - so
refreshing gets more data.

This tutorial shows whats possible with WebKit, which isn't possible with CURL
:)

~~~
mahmud
What is a "bot-like" user agent? Is there a list somewhere of user-agents
excluded from advertising clickstreams?

The networks can detect fraudulent clicks easily, it's the site owners who
will have their accounts frozen.

~~~
riklaunim
They may get frozen accounts, but the ad publishers don't froze them just like
that (everyone could paralyze ad publishing on sites they don't like) so if
they see that "click" was made with "I'm a bot that click things" user agent
they will detect "invalid" click, not sneaky fraud attempts :) (although
someone may also use PyQt4 webkit apps as a fraud apps, but it's not the point
of this code tutorials). It may be used in "bad" or "good" ways.

------
newsio
For anyone who is interested, parts II and III:

[http://www.rkblog.rk.edu.pl/w/p/harvesting-data-websites-
usi...](http://www.rkblog.rk.edu.pl/w/p/harvesting-data-websites-using-webkit-
and-pyqt4-part-2/)

[http://www.rkblog.rk.edu.pl/w/p/harvesting-data-websites-
usi...](http://www.rkblog.rk.edu.pl/w/p/harvesting-data-websites-using-webkit-
and-pyqt4-part-3/)

------
catch23
why webkit and pyqt4? couldn't the same be possible using something simple
like selenium-rc?

~~~
riklaunim
PyQt4 is a GUI framework that has full browser engine - WebKit. Sellenium
tests run in the browser - so you would have to make such tests that dump DOM
tree (parsed HTML source) from the browser and not the standard non-parsed
page source.

~~~
zackattack
Indeed! I am very excited about this. I've been trying to figure out how to
parse rendered JavaScript for so long. Currently installing pyqt on my
workstation ;)

