
WordPress usage worldwide - tanrax
https://github.com/tanrax/calculate-wordpress-usage
======
rbritton
By only profiling the top 1M sites, I wonder if this may be sampling from a
set not normally distributed? I suspect the frequency of use of WordPress
might go up the further down the list you go — some random blog is less likely
to be on the top 1M list yet be more likely to use WordPress.

~~~
rdiddly
I suspect this is so. The top million (specifically a subset of that) is where
the sites fall whose owners can afford to hire people to build their own
custom stuff. Small independent sites seem more likely to rely on something
like WordPress.

~~~
AJ007
That is definitely not true. I think that statement is correct roughly for the
top 50,000. The fall off in economic viability is fairly abrupt.

There are some websites in certain markets where the earn a disproportionately
high amount of revenue per visitor, and these are the exception. But, given
the nature of web and the difficulty in how the user actually found that site,
a good chunk of their margins may have accrued to another party such as
Google.

~~~
rdiddly
I said "a subset of that." The top 50,000 is absolutely a subset of the top
million. But you don't need to conjecture about that number 50,000. Again I'm
saying the top million most-visited sites is where all the profitable sites
likely fall. (Not that they're all profitable.)

------
modernerd
The test is likely to be under-reporting WordPress installations as it stands.

The script seems to detect a WordPress site by looking for a meta generator
tag containing WordPress:

[https://github.com/tanrax/calculate-wordpress-
usage/blob/5aa...](https://github.com/tanrax/calculate-wordpress-
usage/blob/5aa344aa2d55c43ae0885b1b8c8f4fa09a787c0a/src/wordpress_used/core.clj#L24)

It's pretty common to remove that meta tag — popular WordPress theme
frameworks like Genesis do it by default.

A more reliable test would be to look for additional strings in the source
that point to the use of WordPress, such as “wp-content” and “wp-includes”.

A faster way that avoids string searches would be to send an HTTP head request
to `/wp-login.php` and check for:

Set-Cookie: wordpress_test_cookie

(/wp-login.php doesn't always appear in the root directory and it's not always
accessible to all IPs, but that setup is most common).

~~~
eugenekolo2
It'd be interesting to do all that you said (and more) and then determine
what's the combined amount, as well as what % of sites do some sort of
obfuscation... and why?

------
benbristow
The way this works seems to check if there's the "generator" meta tag in the
<head>.

Custom themes etc. might choose to omit that so it's not a 100% reliable check

~~~
sandov
The article says that they also look for "wp-admin" et al in robots.txt

~~~
reaperducer
I suspect (hope) that means none of the WordPress sites I maintain will be in
the tally.

Customizing the metadata, stripping out unnecessary cruft, and moving the user
and admin login pages behind IP-restricted firewalls (when possible) are among
the first things I do.

------
mxpxrocks10
you have to use a larger sample. There are 300M+ domains. The longer tail will
surely have more wordpress in it.

~~~
mxpxrocks10
Some other notes: 1) you're not checking subdomains like blog.company.com or
paths like company.com/blog 2) if you use something like zgrab you can do 1M
site crawl in a couple of hours. Consider checking it out.

------
buboard
> the list of the first million domains with the most visits

How is that a random sample?

~~~
mxpxrocks10
are you using http or https? following redirects? how many? This is all going
to change the study.

------
audessuscest
Only ? Looks like a big number for worldwide usage, no ?

~~~
rbritton
WP itself usually puts the estimate at 25-30%.

~~~
mfer
W3techs measures the top 10 million sites and has stats like that. See
[https://w3techs.com/technologies/overview/content_management...](https://w3techs.com/technologies/overview/content_management/all)

------
ga-vu
Google Translate link for the article at the end of the repo:

[https://translate.google.com/translate?sl=auto&tl=en&u=https...](https://translate.google.com/translate?sl=auto&tl=en&u=https%3A%2F%2Fprogramadorwebvalencia.com%2Fanalizando-
un-millon-de-paginas-para-saber-cuanto-se-usa-wordpress-2019%2F)

------
capableweb
The Readme says "Warning that it can take a long time: between 20 to 30 days."

How in the world can it take so long time? The csv file seems to be 24mb in
size and the computation performed can't be that advanced. Did the author do
something seriously wrong?

~~~
buboard
> To run it you'll need either 2Gb of RAM

What? I guess this is a toy program used to learn clojure or sth - it even
uses sed for line parsing. A 10-line php script could do the same with a few
MB of RAM

~~~
tanrax
I agree, even a Bash script would be more efficient. Everything was born as an
exercise to learn Clojure. I can tell you that it actually uses 1.1Gb, but for
safety I recommend twice as much. Why does it use so much RAM? Ask Java :) (I
am the author of the script)

------
mfer
This is a pretty amazing feat. The top 1 million sites includes many who have
the money to afford custom sites and yet Wordpress is still almost 1/5 of
sites.

~~~
smacktoward
WordPress is the software HN loves to hate, but while it certainly has plenty
of warts it's also a very flexible, pliable system for building the kinds of
web sites that most people want to build. It'll never win any architectural
beauty contests, but market share is driven by utility, not beauty. And
WordPress can be very useful software.

------
blondin
so they studied the top 1M websites and found 19% is using wordpress? man,
that's a good chunk if you ask me.

~~~
tanrax
It's amazing. So we can say that 19% use PHP (at least) and MySQL.

------
hestefisk
I’m wondering why this takes 20-30 days to run all up? Seems crazy for 1M
requests. Could one make this a concurrent task and get much greater
efficiency?

