
Some analysis of the 1M most popular sites on the web - jacquesm
http://jacquesmattheij.com/one-million-websites
======
jwr
Likely culprits are "performance analyzers" that grade a website and report an
"F" (failing) grade for not using CDN-hosted common libraries.

This is a red herring: this idea that the user will already have a cached copy
of CDN-hosted jQuery is bogus. Even for a common library like jQuery: the
number of versions of jQuery that are in use is likely above 50, and the
number of popular CDNs that host jQuery is surely above 10. So we are hoping
that the user will have a cached copy of that exact jQuery version from that
exact CDN.

This is somewhat similar to the situation we have with operating systems: we
created shared libraries to save disk space and memory. These days using them
is pretty much pointless and incurs a performance penalty, yet everybody still
uses them.

For JavaScript, a much better approach is to a) make code Google Closure-
compatible, b) compile everything using advanced mode into a single JavaScript
file. That way you get an optimized subset of all the code that the site
_actually uses_ (this works wonders for ClojureScript apps). Most sites
probably use less than 10% of jQuery, so why include all of it?

~~~
pyre
> These days using them is pretty much pointless and incurs a performance
> penalty, yet everybody still uses them.

Would you rather than when (e.g.) there is a security patch for OpenSSL, that
you have to wait for _all_ software using OpenSSL to deploy updates? Or would
you rather that one update to OpenSSL (likely from your OS vendor) fixes all
of the software depending on it?

Edit: People seem to be commenting to this through the lense of CDNs and
JavaScript, but the sentence previous to the one I quoted was:

> This is somewhat similar to the situation we have with operating systems: we
> created shared libraries to save disk space and memory.

Which is _not_ talking about CDNs and JavaScript, but shared libraries on your
desktop. I'm not saying that all usage of shared libraries is valid. I'm just
saying that to toss out the concept as entirely useless (and having no
redeeming value) in a modern setting varies from the truth.

~~~
xorcist
Google doesn't back-port fixes to JQuery.

You _can_ link without specifying the version number, but then you don't get
full caching, so it's not common in practice.

~~~
cbsmith
You get pretty close to full caching... often better than using your own copy.
The reason it isn't a common practice is more about potential bugs caused by
newer versions.

------
gingerlime
I wonder what's considered external though. If I compile / minify my
javascript and CSS and then use a CDN to cache or host it -- is this
considered external?? If so, how is it different from trusting my hosting
provider to host my site in the first place, or my domain provider for
resolving for me? How can this analysis know whether or not the resource is
external? based on dns records alone? because I can still use *.my-domain.com
but point it to an external resource...

The concerns raised are valid, but I'd like to see the methodology for
analyzing the data because it can definitely skew results.

~~~
jacquesm
If the url used to fetch the file is not related to the domain the original
html comes from then that would be counted as external.

You can point *.my-domain.com to an external resource but it would see that
resource as still under your control.

I will post the code soon.

~~~
mixologic
It is pretty standard practice to host assets on a "cookieless" domain you
control, but _not_ on the same domain as the original site. For example,
www.example.com has all the html, but all of the images are hosted at
www.images-example.com. That would skew the results considerably.

~~~
lifeisstillgood
Why use another domain and not a sub-domain? I assume something to do with the
cookie-less comment - but not clear what?

~~~
mixologic
The main reason is sometimes you have *.domain.com authentication cookies for
single sign in across a suite of sites, however you do not want those
authentication cookies sent to domains that do not need authentication.

------
leeoniya
i have always been irked by the fact that _my bank_ and most websites behind
an https connection (especially after login) have ANY external resources. this
is a major flaw on numerous levels; how such a thing is still allowed by
browser vendors is IMO borderline voluntary negligence. all such sites, if
they must host external resources, should do so only within an iframe
w/sandbox [1]. the fact that the external resource is also "https" [on some
foreign property] is completely and utterly meaningless.

[1]
[http://www.w3schools.com/tags/att_iframe_sandbox.asp](http://www.w3schools.com/tags/att_iframe_sandbox.asp)

~~~
jacquesm
My bank (ABN/Amro in NL) is even worse, they not only include external
resources, they include external resources that are critical for their site to
function, in other words, if I disable the various trackers and analytics
elements on their page the site simply no longer works. You'd expect the
opposite!

~~~
leeoniya
wow! i think there should be a place on the internet for publicly naming and
shaming such practices. like a Darwin Award or Razzies [1] of webdev.

[1]
[https://en.wikipedia.org/wiki/Golden_Raspberry_Awards](https://en.wikipedia.org/wiki/Golden_Raspberry_Awards)

~~~
jacquesm
That's one of the things I'm considering right now. To re-write the top 1000
or so with annotations and then to sort them by category as well as an example
of a site that is 'clean' in the same category.

There are a ton of offenders and some of them are very well known.

One of the interesting things you find when you look at this data is that the
bigger sites really do have their stuff set up better (for instance, by using
in-house analytics) but there are lots and lots of exceptions and some of them
are quite shocking. For instance, I've found two major car brands that include
evercookies on their corporate websites in Eastern Europe, a thing that no
respectable company should ever do. I suspect an ad agency is the cause of
this so I'm still digging away at that.

~~~
corford
If you want a never ending list of shocking offenders, be sure to check out
most airline sites. They really are abysmal.

------
mangeletti
[http://www.w3.org/TR/SRI/](http://www.w3.org/TR/SRI/)

~~~
jacquesm
Indeed. I can't wait until that is implemented across the board. It still
leaves the privacy issues.

------
captainmuon
One thing in this context is that it is basically impossible for a website to
check the integrity of an external (js) resource without loading it. This is a
consequence of the web security model.

Its basically impossible to get the contents of a .js file without executing
it, say for checksum verification (at least without CORS, and even with, you
might trigger an additional download, I haven't tested it). But it's trivially
easy to include an external .js in the page, with the same access rights as
directly embedded script (including access to credentials).

That's what we're used to, but it seems completely backwards to me. I would be
much better IMO if a script could make arbitrary HTTP requests to other sites
- but without having access to those sites' credentials. (Remember in the
2000s when "mashups" were all the rage? I spent a weekend parsing some data
source in javascript to display it on a map, just to realize that what worked
locally didn't work over http. Imagine the disappointment.)

What's also missing is a way to run an external script sandboxed, or in a sub-
interpreter. There ought to be a way to restrict what banner ads or font
loaders can do to my page.

~~~
throwaway41597
Look SRI up on this very page.

Web pages can make requests to other origins (GET image, script, XHR, POST to
iframe, XHR). CORS allows you to read the response. But what you're asking
would probably be hard to transition the whole web to without too much spam
and DOS'ing.

The sandboxing for an external script you want already is feasible with an
iframe with a different origin.

------
cbr
Subresource Integrity hashes [1] should let sites get the caching and CDN
benefit of using shared resources like jquery without letting 3rd parties have
the ability to XSS them. Basically, you can specify a hash of what the url
should point to, and if it doesn't match then the load is blocked.

This isn't quite out yet: it's in Chrome trunk [2] and still under review in
Firefox [3].

[1]
[https://w3c.github.io/webappsec/specs/subresourceintegrity/](https://w3c.github.io/webappsec/specs/subresourceintegrity/)

[2]
[https://code.google.com/p/chromium/issues/detail?id=355467](https://code.google.com/p/chromium/issues/detail?id=355467)

[3]
[https://bugzilla.mozilla.org/show_bug.cgi?id=992096](https://bugzilla.mozilla.org/show_bug.cgi?id=992096)

------
theg2
We offload a ton of our scripts to S3 buckets on random unrelated domains and
it's a pretty common practice. Did this take that into account?

~~~
jacquesm
No, it did not. It would have to tie in the whois data to make that match (and
even then it might not). The analysis is URL based, but I don't think changing
that to account for those sites that use random domains to store chunks of
their site would make a huge difference, but it's a valid criticism.

------
cbr
It seems like you're marking sites down for using a cookiless domain for
resources, even though that's faster and no less secure? For example, you'd
mark google down for referencing gstatic.com or facebook down for referencing
fbcdn.com.

I realize there's no publicly available way to tell that yahoo.com and
yimg.com are the same entity, but it would be good to at least note this as an
issue with the analysis.

~~~
jacquesm
I'll do so.

Edit: done.

~~~
cbr
thanks!

------
pjungwir
I agree re not using externally-hosted Javascript. In fact I seem to remember
a year ago Google Code having connectivity issues and jQuery all over the
place failing to load. I was glad on that day that I always host my own
jQuery.

Re tracking, I ran into this embedded in some webfonts CSS a project was using
(downloaded from one of those font websites):

    
    
        /* @import must be at top of file, otherwise CSS will not work */
        @import url("//elided.example.com/count/35d82f");
    
        @font-face {font-family: 'Foo'; font-weight: 300; src: url('/webfonts/foo.eot');.....}
    
    

That @import returns nothing. It is just part of their tracking/licensing. And
it was really slow! And I love the lying comment they included.

~~~
Raphael
Well, the @import itself only works if it precedes other statements.

------
rufugee
Evercookies sound terrifying. Not that I'm doing anything that I really worry
about hiding, but I can't stand invasion of privacy like this.

Are there effective protections against them? If not, I wonder why the EFF
hasn't taken up the charge to fight them?

~~~
puredemo
Too many battles to wage for one fairly small organization

~~~
rufugee
Then should the rest of us not step up? Or is it that there's no effective way
to combat it without making major changes to browsers we have little control
over? At the very least, someone could have a website which listed steps you
can take to protect yourself.

~~~
schoen
You can get some benefits from EFF's Privacy Badger and from Mozilla's
Tracking Protection feature.

[https://support.mozilla.org/en-US/kb/tracking-protection-
fir...](https://support.mozilla.org/en-US/kb/tracking-protection-firefox)

[https://www.eff.org/privacybadger](https://www.eff.org/privacybadger)

Both of these tools are focused on cross-site ("third-party") tracking, rather
than cross-session tracking by an individual site ("first-party"). Third-party
tracking is technically easier to try to detect, and some people regard it as
more intrusive.

As I mentioned upthread, EFF's own research on browser fingerprinting shows
that it's hard to stop all user tracking (because your browser and OS and
device might be different enough from others to be unique in a population in
ways that could be observable by a remote site). Tor Browser is doing great
work on this

[https://www.torproject.org/projects/torbrowser/design/#finge...](https://www.torproject.org/projects/torbrowser/design/#fingerprinting-
linkability)

and I think they've made concrete progress. (I think the Tor Browser
developers might say that the privacy benefits of using their changes _without
Tor_ are unclear because you could also so easily be tracked by IP address.
But it's possible that some of their changes will find their way into mainline
Firefox, at least as options.)

------
lifeisstillgood
So, as someone who has never really bothered with blockers of any sort, what
would be the ideal blocker to install / write? (I am thinking iOS as sadly
that is my primary medium these days)

\- able to prevent download of any third party hosted assets \- able to hash
the above assets and allow user to approve their use (ie can approve jquery
v1.5 from cdn.google.com) \- is this whitelist approach going to work? Does
ghostery or similar already do this?

I vastly prefer a whitelist approach - but if 2/3 of the web will break I am
at a loss ...

------
ktusznio
What about services like npm that distribute code? Are these analogous or do
they have additional security in place?

~~~
jacquesm
Isn't that server side?

~~~
ktusznio
Yes, but the same attack could happen if an attacker gains control of an npm
module. Users without tight control over their modules could unwittingly pull
in malicious code.

------
lifeisstillgood
some relatively serious questions on the methodology:

\- how did you define third party assets vs domain-managed assets? Is anything
not hosted under example.com automatically third party? What about Twitter.com
and t.co? I know this one is picky but would like a feel for the figures.

\- how deep did you scrape the (million!) sites? If it's front page or similar
Inwould not be surprised to see figures revised upwards significantly - once
off the beaten track of even major sites the number of "let this one slide"
decisions spikes a lot.

\- how long did polling a million sites take?! What was the setup you used -
very interested even if it has nothing to do with methodology :-)

Thank you - you have at least made me rethink my lack of blockers

~~~
jacquesm
I will release code + data for bootstrapping but until then here are my
answers to your questions:

> how did you define third party assets vs domain-managed assets? Is anything
> not hosted under example.com automatically third party? What about
> Twitter.com and t.co? I know this one is picky but would like a feel for the
> figures.

That's based on the hosting domain being the same or a superset of the domain
that the page originally came from.

> how deep did you scrape the (million!) sites?

Just the homepage.

> If it's front page or similar Inwould not be surprised to see figures
> revised upwards significantly - once off the beaten track of even major
> sites the number of "let this one slide" decisions spikes a lot.

That's true.

> how long did polling a million sites take?!

20 days. About 50K sites per day which significantly cramped my ability to do
other work here.

> What was the setup you used - very interested even if it has nothing to do
> with methodology :-)

A simple laptop with 16G of ram and a regular (spinning) drive on a 200/20
cable connection. 40 worker threads concurrently with a simple php script to
supervise the crawler and another script to do the analysis.

Most of the data was discarded right after crawling a page, only the URLS that
were loaded as a result of loading the homepage were kept as well as the mime
type of the result.

~~~
lifeisstillgood
Thanks!

Two things leap out. Firstly I love the way you chose to do 1 million sites. I
would have gone, hmm, maybe top thousand, and called it a representative
sample :-) The scale of the modern world is still something I am grappling
with.

Secondly, is that 200 Mbps down / 20 mbps up? I think the UK has some
broadband access lessons to learn if that's true. My wet piece of string is
getting threadbare.

~~~
jacquesm
It's maybe overkill to do it on the whole set instead of just a sample,
probably the numbers would not change all that much.

The 200/20 is indeed 200 Mbps down and 20 up, this little trick saturated the
line pretty good though. I probably could have saved some time and bandwidth
by letting phantomjs abort on image content but I was lazy.

~~~
lifeisstillgood
I'm slap bang in the commuter belt round London - and broadband availability
is having an actual effect on house prices and decisions to move out of the
area.

It's surprisingly low on the political agenda nationwide.

I'm about to get all English Middle class over this Sinai will stop now :-)

~~~
jacquesm
Code has been released to:
[https://github.com/jacquesmattheij/remoteresources](https://github.com/jacquesmattheij/remoteresources)
have fun.

------
ifdefdebug
Are modern updated browsers resilient against those "evercookies" or not so
much?

To be more specific: I have my Firefox configured to delete cookies on exit.
Does that deal with "evercookies"? I must admit, never heard about them
before...

~~~
jacquesm
Evercookies go a lot further than the regular cookies that you can delete per
session.

------
rhblake
> The request for the code contains a referring url which tells the entity
> hosting the script who is visiting your pages and which pages they are
> visiting (this goes for _all_ externally hosted content (fonts, images etc),
> not just javascript)

This can now be mitigated thanks to Referrer Policy [0]:

"The simplest policy is No Referrer, which specifies that no referrer
information is to be sent along with requests made from a particular settings
object to any origin. The header will be omitted entirely."

Voilà:

    
    
      <meta name="referrer" content="no-referrer">
    

It's a W3C draft, but it's supported by latest FF/Chrome/Safari, _and_
Microsoft Edge [1], although currently, with Edge, you'll want to use the
legacy keyword "never" instead. (AFAIK "never" works with all the
aforementioned browsers.)

> Google analytics junkies in particular will have to weigh whether they feel
> their users privacy is more important to them than their ability to analyze
> their users movements on the site.

There's a nice alternative - Piwik [2]. It's very much like GA, but GPL and
self-hosted, and with various options for privacy [3]. You can even use it
without cookies, if you don't mind the somewhat reduced accuracy and
functionality.

Regarding fonts from Google Fonts, it's super-easy to host them yourself.
There's a nice bash script [4] that downloads the font you want in all its
formats/weights and generates the proper CSS. There's also the google-
webfonts-helper service [5], and Font Squirrel has a webfont generator [6].

[0] [https://w3c.github.io/webappsec/specs/referrer-
policy/](https://w3c.github.io/webappsec/specs/referrer-policy/)

[1] [https://msdn.microsoft.com/en-
us/library/dn904194%28v=vs.85%...](https://msdn.microsoft.com/en-
us/library/dn904194%28v=vs.85%29.aspx)

[2] [https://piwik.org/](https://piwik.org/)

[3] [https://piwik.org/docs/privacy/](https://piwik.org/docs/privacy/)

[4] [https://github.com/neverpanic/google-font-
download](https://github.com/neverpanic/google-font-download)

[5] [https://github.com/majodev/google-webfonts-
helper](https://github.com/majodev/google-webfonts-helper)

[6] [http://www.fontsquirrel.com/tools/webfont-
generator](http://www.fontsquirrel.com/tools/webfont-generator)

~~~
chadscira
Im impressed with the amount of browser support this has already. Thanks for
the info.

------
jcr
jacquesm, maybe you didn't want to wade into the details too much, but you
didn't mention a major attack vector on third party scripts, namely, the
transparent caches run by nearly all ISPs. Also, unless a third party script
is served over HTTPS to users, regularly verifying the scripts is useless
since _your_ ISP will give you _their_ cached copy, and similar is true for
all site users. Transparent CDN's are another consideration for related
caching problem.

~~~
jacquesm
The examples given were just examples, I can see a lot more possibilities
beyond the ones mentioned in the article but to be honest I had not thought
about the ISP caches.

------
aw3c2
> Flash seems to be very rapidly on the way out, less than 1% of the domains I
> looked at still contained flash content

What exactly did you look at? Homepages?

~~~
jacquesm
Yes, homepages and all the content subsequently loaded (directly or indirectly
through multiple layers of scripting or iframes). Essentially what you'd get
if you were to visit each and every homepage on the top list and logged the
urls that were loaded as a consequence of that.

~~~
aw3c2
As much as I hate Flash, I don't think you can infer that then. Does your
analytics consider [https://www.youtube.com/](https://www.youtube.com/) using
Flash? I see no Flash on it here.

~~~
jacquesm
If it's not on the homepage then it would not consider it using flash. The
analysis was run on the homepages, _not_ on all the pages in those websites.
(And that would require a lot more work on my part and likely would not change
the results all that much).

I believe overall flash usage on the web is now about 10%, but larger sites
are generally much better at keeping their sites up-to-date and to follow
trends.

Advertising is another good indicator. The typical trick nowadays is to check
if flash is installed using some javascript or header inspection and only to
serve it up if support has been detected.

Websites that categorically include flash are the ones that were detected.

That's a good point though, I should update the text to that effect.

edit: ok, updated the text to be much more precise about flash usage and the
conditions of the crawl which will lead to under-representation of flash.

------
throwaway41597
How deep did you crawl? I would have guessed the flash usage to be higher.

How big is the dataset? How long did it take? Which tools did you use besides
phantomjs?

Nice job!

~~~
jacquesm
> How deep did you crawl?

Front pages only.

> I would have guessed the flash usage to be higher.

When adding all the pages in a site it no doubt will be. I'll update the
article to clarify this.

> How big is the dataset?

In flight: huge, but after culling and keeping only the bits that I needed it
was a lot smaller, about 20G.

> How long did it take?

About 10 days.

> Which tools did you use besides phantomjs?

Just some php glue scripts, nothing fancy, about 500 lines.

------
heyalexej
This is very interesting. Will you release the data and code at some point?

~~~
jacquesm
Yes, I will definitely release the code and the dataset required to bootstrap
the rest. It takes a long long time to run and you'll need a good bit of
bandwidth. I won't be releasing the raw data because there is simply too much
of it.

~~~
heyalexej
Sorry for bugging you. Did you store results from the _response_ ¹ metadata
object for every domain and process it later or use Regex to parse the HTML
content?

I crawl large-ish websites (most recently
[https://code.google.com](https://code.google.com) with 1.8MM repos) often and
am really looking forward to your dataset & code.

[1] [http://phantomjs.org/api/webpage/handler/on-resource-
receive...](http://phantomjs.org/api/webpage/handler/on-resource-
received.html)

~~~
jacquesm
> Did you store results from the response¹ metadata object for every domain
> and process it later or use Regex to parse the HTML content?

That would have constrained throughput too much so I opted for culling it
during the crawl to just content-type and url, this was then processed to
extract the various bits of information. I did use the 'resource received'
trick you linked above. Very useful.

------
nvk
Run it against Coinkite.com specially on Signed in pages.

------
beamatronic
>> "50% of the domains contained advertising of some form."

That's much lower than I would have expected

------
andrewljohnson
This article should be read as ranty research, not practical advice. It'd be
fine to fix these issues, but not at the website developer level.

 _" If you have to use externally hosted resources such as javascript
libraries then at a minimum you should verify regularly that the code has not
changed "_

No, you shouldn't. You should focus on stuff that matters to users, not
existential internet security holes. Let someone else fix this problem for
you, when it stops being existential.

 _By far the safest approach for website owners that care about their users
and their users privacy is to simply not include anything at all from other
people’s servers._

FTFY "A safe, but impractical and productivity-destroying aproach..."

~~~
larrys
Agree the advice is well intentioned and is correct (in theory according to
what I read) but not entirely practical. For example:

"then at a minimum you should verify regularly that the code has not changed
(you have to hope that you are looking at the same code that your users see)"

Who exactly is the "you" in the above statement and who pays the "you" money
to fix this and keep on top of it on an ongoing basis? And for how long?

In the physical world the different between ideal and practical can be
described by my experience with production machinery. The machinery came with
guards to protect the operators from getting their hands caught or cut off.
The guards also came with switches to prevent the machines from running when
the covers were taken off. But what would happen is the operators would want
to oil or tweak the machines so they would take the covers off and disable the
sensors so that the machine would run bare. Of course you would tell them not
to do this, but they would still forget to put the covers back on or be lazy
quite often and there was little you could do about it. You had production to
get done under deadline and weren't likely to fire someone even though you
knew there was a small safety risk in doing this type of thing (older machines
of course came with no safety guards at all, operators just had to be careful
at their own peril. (And good operators were impossible to find anyway so once
you had someone they became a primadonna ..

~~~
jacquesm
Regularly pulling a hash for the libraries you include and alerting you when a
hash changes unexpectedly is no work at all.

And if you need to be paid money to fix it then you have a problem anyway so
one would assume that you'd be paid just as much to fix it when you're being
alerted to it by a cron job as you would be paid to when you're alerted by a
horde of users.

As for machines without guards: I've worked (extensively) in the metal working
industry and the number of people missing digits and limbs has decreased
steadily ever since tampering with guards, safety-interlocks and lock-outs
became a firing offense so I don't think that's a very good example.

~~~
larrys
Machines: Yes this was the 80's (sorry I didn't point that out my mistake) and
things have changed. However to that point if you have your golden machine
operator turning out good work (and he is only 1 of 2 on a particular line)
and it's not easy to hire a replacement, let alone a good replacement, you
tend to get a bit lax.

Security: I am primarily a business guy (who does some light programming and
knows Unix since the 80's) so I hire others to do work for me. I am just
thinking that for the people that I have hired in the past how would anyone
know if any of this is happening (other than code audits) and what is the
mechanism to make sure the right thing happens even if you know what the right
thing is? It's kind of a version of the advice "make backups but make sure
that you test your backups as well".

~~~
jacquesm
The motto is 'trust but verify', and indeed that goes for your backups as
well. And incidentally that's one of the most failed items during the dd's
I've done and after verification several companies turned out to have lived
without backups at all.

It usually takes two things to go wrong for a disaster to happen: some $0.05
part that fails _and_ a procedural error.

And the consequences can be just about anything.

~~~
larrys
One of the first books that I read talked about the story of the backup tapes
on the car seat that were erased when someone in Sweden (?) with heated seats
drove home. (urban legend iirc).

~~~
jacquesm
Iirc Saab pioneered heated seats because one of their engineers had colon
cancer and Saabs are pretty common in Sweden, but I'd still wager that's an
urban legend because the heating is done with DC current and to reliably alter
the contents of a tape you'd need a lot more of magnetic field to overcome the
resistance of the magnetic particles to change direction (remanence) and you'd
want that field to alternate.

