
WebXray - snake117
http://webxray.org/
======
coleifer
Seems like a great opportunity to use SQLite, then folks could share around
database files. Why the hell don't more people use SQLite for things like
this?

~~~
TeMPOraL
The author ('tilbert) is hellbanned. You can see his comments when you turn on
showdead in your profile.

Re SQLite, he replied: "@coleifer: there was a sqlite branch, but it doesn't
scale well to many-million record sets which is what I have been doing. the
design of the software allows drop-in db replacement, it just lacks the code.
I can't decide to go back and sqlite or to just make a web front-end."

~~~
dang
> The author is hellbanned

Comments by some (not all) new accounts are killed by default when they look
like possible spam or troll activity. It's not a great solution because it
leads to false positives like this one. On the other hand, doing nothing is
worse. It's a hard problem, and it certainly doesn't mean that the user is
banned.

We're about to release software to let the community unkill these, which we
hope will be a much better solution.

------
BinaryIdiot
> The core of webXray is a python program which ingests addresses of webpages,
> passes them to the headless web browser PhantomJS, and parses requests in
> order to determine those which go to domains which are exogenous to the
> primary (or first-party) domain of the site.

Naturally this means a different user agent and finger print which could
ultimately mean the script is fed a different page altogether. The odds of
that are probably low but still; someone could have a really shitty website
that uses hundreds of trackers but could serve WebXray something completely
without them.

I would like to see this type of stuff as web browser extensions. That way we
can get the exact, most correct information possible. Also would simplify a
semi-convoluted build process that seems to have tripped up a few readers.

~~~
jonknee
The major ad blocking extensions could do this. They already know about the
requests. It's also possible to change the user-agent (say, use the top 10)
and quickly gather the data using a bunch of cloud servers.

~~~
tlibert
also, the main extensions use black-lists, webXray grabs all the requests.

~~~
jonknee
Yes, but they can figure out what on the blacklist was used for a given site.
For example, Ghostery has Ghostrank which is pretty similar to this--it sends
back to advertisers what stuff was blocked.

[https://www.ghostery.com/en/faq/how-does-ghostery-make-
money...](https://www.ghostery.com/en/faq/how-does-ghostery-make-money-from-
the-add-on/)

------
tlibert
hey, the captcha system here is an f'n nightmare - anyway I wrote the
software! happy to answer questions will try to do so below

since I am hell banned I can only edit this comment to reply.

as for video, I discuss research here:
[https://www.youtube.com/watch?v=OqW8erWi1Wo](https://www.youtube.com/watch?v=OqW8erWi1Wo)

but if you mean screencast of the software I haven't had time.

\---

@coleifer: there was a sqlite branch, but it doesn't scale well to many-
million record sets which is what I have been doing. the design of the
software allows drop-in db replacement, it just lacks the code. I can't decide
to go back and sqlite or to just make a web front-end.

\---

@captn3m0 I have an academic paper in revision that is an analysis of the
alexa 1M list, I also have other projects i development.

\---

@snorrah: this was my first python project, so I went with the newest version.
it's made a lot of things very difficult, especially porting to a web version.

\---

@linuxlizard: proxy is problem when you want to do a lot of concurrent tests,
I usually load about 64 pages in tandem to get good speeds on large sets.

\---

@radmuzon: webdxray runs large batch jobs, so you can get lunch, and come back
with all of pages analyzed. I know it does work on windows, and I apologize
for not being able to provide directions...see comments above.

\---

@TeMPOraL: thanks!

\---

@pearjuice: yeah, I wish there was an easier way to get python3 to talk to
mysql, that's the biggest PITA.

------
zeman
At SpeedCurve we've built something similar for tracking and understanding the
impact on website performance that third party requests can have. It's a big
issue for websites when their user experience can be affected by resources
that are not even under their control. We've seen websites where over 90% of
the requests on a page are made to third parties.

Here's a dashboard showing third party usage for The Guardian over the last 30
days:
[https://speedcurve.com/demo/thirdparty/1/1/chrome/1/30/39tfn...](https://speedcurve.com/demo/thirdparty/1/1/chrome/1/30/39tfnozeq94p1o0hndk1kpbg4vb7cg/)

Great to see an open source list of domains linked to organizations. We've
built our own list as well and we'll look at contributing them to this
project.

(Disclaimer: I'm the founder of SpeedCurve)

~~~
tlibert
Would love help building the org_domain.json list, please get in touch!

------
linuxlizard
I think this would be quite interesting integrated into a web proxy. Surf
through the proxy, gathers all the nth party HTTP.

~~~
tlibert
proxy is problem when you want to do a lot of concurrent tests, I usually load
about 64 pages in tandem to get good speeds on large sets.

------
codewithcheese
Here is the domain->org data
[https://github.com/timlib/webXray/tree/master/webxray/resour...](https://github.com/timlib/webXray/tree/master/webxray/resources/org_domains).
It is missing alot of mobile ads and tracking players. Anyone know how we can
fix that?

~~~
tlibert
constantly being updated, my hope with open-source is people will help add to
it. if you want to help, I'd love that!

\---

I see the hell ban, I'm posting all my replies at the bottom in a comment I'm
editing - could somebody note this above?

~~~
codewithcheese
I'll see if I can make a contribution :)

Do you have a technique your using to match domains to organisations sometimes
it can be hard to discover.

~~~
tlibert
whois, detective work, crunchbase...also do work in china which is even
tougher: [http://www.theguardian.com/technology/2015/sep/21/google-
is-...](http://www.theguardian.com/technology/2015/sep/21/google-is-returning-
to-china-it-never-really-left)

------
radmuzom
Since I am on Windows, it will take some time for me to set it up. Can someone
please explain what extra information I get as compared to the Lightbeam
extension (formerly Collusion) in Firefox?

~~~
tlibert
webdxray runs large batch jobs, so you can get lunch, and come back with all
of pages analyzed. I know it does work on windows, and I apologize for not
being able to provide directions...see comments above.

------
pearjuice
I am pretty frustrated by build processes of modern day applications. Wanted
to give this a quick spin, but looking at the installation instructions all I
see is compilers, optimizers, minifiers, interpreters, package managers,
package-package managers, dependency systems and then, maybe then, you pray to
your configuration-God that everything clicks together and runs on your
system.

I was about to ask why isn't there a simple, unified build tool for ANYTHING,
but I think that is what got us here in the first place...

~~~
moron4hire
It's the technological singularity. Thanks to the various "code academy"
initiatives going on around the world, there is a growing middle area--between
software developers and users--of scripters, people who plug components
together but don't do a lot of greenfield programming.

It used to be that being a scripter was a stepping stone on the way to
developers, mostly because back then scripting could only get you so far. Now,
you can apparently make an entire career being a scripter, if said code
academies are correct in the promise that they can find you a job with the
extremely shallow curriculum I've seen them provide.

This isn't a bad thing, it's pretty amazing that it doesn't take a decade of
dedicated study to do so much anymore. It's just that in our current culture
we lump them in with developers because they are clearly more than just users.
We still expect a set of resources put into a Github repository to have a
significant amount of new programming, rather than just being glue code
between a few commonly available libraries. But that's more of a problem of
our lack of ability to differentiate between large, greenfield projects and
small, configuration-oriented projects at-a-glance than it is a problem of
programming being "too easy".

Though, if we recast scripters as users instead of developers, then it's a
terrible thing. It means that the real software developers of the world have
written a bunch of software with a really, really shitty user interface.

Either way, it's not the scripter's fault.

~~~
tlibert
I don't use any extra python libraries, it's all to get python3 to talk to
mysql. (also I'm not a skiddie...)

------
snorrah
Pleasantly surprised to see something requiring Python 3!

~~~
tlibert
this was my first python project, so I went with the newest version. it's made
a lot of things very difficult, especially porting to a web version.

------
captn3m0
A great project on top of this would be to run this over the Alexa top 20k
sites list to a depth of say 5-10 and see the results.

~~~
tlibert
I have an academic paper in revision that is an analysis of the alexa 1M list,
I also have other projects in development.

------
Already__Taken
Funny how this uses all platform agnostic software and the Windows install
instructions are to buy an ubuntu VPS.

~~~
olig15
Because then they can link to DigitalOcean with their referral code

~~~
cloakandswagger
Ha, they actually did. Tacky.

~~~
tlibert
see comments below, I don't know windows so can't write directions. I'm poor
so just wanted to cover my hosting, but I removed the referral regardless as I
didn't realize I would be attacked for it.

~~~
iconjack
Why are you hellbanned?

~~~
tlibert
it thought I was spamming b/c replying quickly.

------
tonylemesmer
a screencast / video of an example would be completely helpful.

~~~
tlibert
as for video, I discuss research here:
[https://www.youtube.com/watch?v=OqW8erWi1Wo](https://www.youtube.com/watch?v=OqW8erWi1Wo)
but if you mean screencast of the software I haven't had time.

~~~
tonylemesmer
I kind of meant screencast but saw from other comments you were kinda busy.
Great project ;)

------
phelmig
Very nice, this could be the basis for some cool visualizations!

~~~
tlibert
I have done a few, but it's not my forte. vice included some I made in an
article they did about my work a few months back:
[http://motherboard.vice.com/read/looking-up-symptoms-
online-...](http://motherboard.vice.com/read/looking-up-symptoms-online-these-
companies-are-collecting-your-data)

------
icebraining
tilbert: you're hellbanned, apparently the system thought you were a spammer
or troll, you should send an email to HN (hn@ycombinator.com) asking them to
un-ban you.

~~~
codewithcheese
i am not tilbert...?

~~~
nickthemagicman
We are all tilbert.

~~~
tlibert
I certainly am.

~~~
nickthemagicman
HA welcome back

------
bildung
tlibert, you are hellbanned. Perhaps a moderator can help with that? I think
you triggered spam detection by posting too fast.

~~~
leoedin
tlibert should also note that although they're editing an existing comment,
that comment is [dead] and so can only be seen by people who show dead
comments (not many). I guess they need a new account or to contact admins.

------
kazinator
Very confusing presentation here. For one thing, the author seems to be
referring to links/anchors in a web page as "requests"!

A request is the dynamic, transient action which occurs when a client such as
a browser initiates a connection to a server and presents a command like GET
or POST.

I suggest an opening paragraph along the following lines:

 _" WebXray" is a sort of web crawler which analyzes a given cluster of pages
for their relationships with each other, as well as external pages which they
link to._

 _It provides information about pages which direct a user 's browser to
various sites for the purposes of tracking, using tricks like hidden images._

(Do I have that approximately right?)

~~~
tlibert
it monitors third-party HTTP requests, not links: "webXray is a tool for
detecting third-party HTTP requests on large numbers of web pages and matching
them to the companies which receive user data...The core of webXray is a
python program which ingests addresses of webpages, passes them to the
headless web browser PhantomJS, and parses requests in order to determine
those which go to domains which are exogenous to the primary (or first-party)
domain of the site. This data is then stored in MySQL for later analysis."

~~~
kazinator
So in other words PhantomJS is used as an API to have some pages crawled, and
the links emanating from those pages are captured.

~~~
tlibert
sorry, it doesn't monitor links, it captures network-level requests, that's
what I analyze.

------
moron4hire

        Windows Specific Instructions
    
        Get a linux cloud server (which cost fractions of a cent per hour these days).
        Ubuntu is the easiest flavor of Linux to get started with and the directions
        above will serve you well. Seriously, this is your best option. You can do it.
        I'm both confident in your abilities and proud of you for taking this important
        step in life.
    

This is not a very helpful attitude. The only "UNIX-y" thing I see it's doing
is forking for concurrency. I understand that Python's global interpreter lock
limitation makes processes more desirable than threads for concurrency, and on
UNIX-like systems this isn't a problem because starting new processes is very
cheap. But that doesn't mean it wouldn't "work" on Windows, just be a little
slow on starting each subprocess.

Or is it more about not wanting to track down how to install software on
Windows?

(EDIT: and as others have pointed out, it's kind of cheesy to use the moment
to plug your referral code for DigitalOcean)

~~~
aninteger
I can't speak for getting a Linux cloud server, but I've seen plenty of
recommendations to use a VM when trying to use some software on Windows.
Unless you're specifically writing C# or using IIS or SQL Server I don't
necessarily believe it's bad advice. In the coming years, with the way C# is
changing, Linux/FreeBSD may become the desired platform to run C# code (just
like it is for Java.)

~~~
phkahler
>> I've seen plenty of recommendations to use a VM when trying to use some
software on Windows...

Is that a sign that developers are starting to not care about Windows?

