Hacker News new | past | comments | ask | show | jobs | submit login
Analyzing HN readers' personal blogs (dannysalzman.com)
243 points by dsalzman on April 11, 2020 | hide | past | favorite | 94 comments



Of those who responded to the 'what is your blog and why should I read it' thread. HN is large. Likely larger than many of its visitors realize because the participants are a relatively small fraction of the readership.

So HN readers are not necessarily contributors. And not all contributors would plug their blogs in a thread asking them to do so.

If you want to get an idea of the HN readership rather than of the HN contributors you may want to start off by scraping all the profile pages instead, it will give you a much larger set of sample data to work with.


> you may want to start off by scraping all the profile pages instead

Oddly enough, I'm in the middle of this project right now! There are over 600,000 users, and it turns out that many of them use their profiles to share links to things other than personal blogs. I've done some scripting to automate deciding if a link is a personal site or not, but the whole endeavor has been significantly less trivial than I had hoped.

Regardless, I plan on sharing results soon, so be on the lookout!


This sounds like a very interesting project. Will definitely be on the lookout for this!


That’s a fair point! I would have to read HN’s terms of use though. Not sure if that’s allowed or not. I felt good scrapping the comments section since everyone there “opted in “ to share their website to the broader community.




You can take it as read that people who write blog posts do that to share them with the broader community.

Could one of you downvoters please explain why you think people would post blog posts if they do not want to share them? I fail to see how that would be possible.


And you never foresee the ripples. I wanted to buy books online and trade stocks when I was a student. MasterCard/Visa in Algiers, Algeria was a rarity, and banks didn't communicate.

I called every bank in the country that was listed by the central bank, I knocked on their doors until I found one that offered students a card.

I even had to write a document because they didn't have a form for that case. I finally got my card after a week of trips to the bank, city hall, administration, providing documents that were unlisted, etc.

I computed that this wasted time multiplied by the number of people who would eventually go through the process and the fact almost nobody in the country had the card warranted detailing the steps. It would save time, and contribute to reduce the "underbanking". I wrote a blog post explaining how to do it with steps, as in provide the following documents, take these amount in these currencies. I attached the document/form I wrote so people could fill it and take it directly to the bank and save a trip. That asymmetry in information bothered me.

That post 2013 post is read hundreds of times per day. It had more than a thousand comments, although it only lists 700+. People asking me questions, then getting their own cards, then themselves answering other people's questions, then updating me on what has changed.

People I would meet in real life would tell me I looked familiar, and then they'd put it together and tell me they followed the post to get their card. Sometimes people referred me to my own post in case I wanted to get a card.

I received emails from people freelanced online who wanted to bring money back here. One person even sent me credentials to their online account with thousands of dollars in and asked me if I can find a way to transfer it here (I told that person not to do that again and she said she felt she could trust me).

Many times I'd receive a phone call from a friend who'd say they wanted to get a card, looked online, found a great post, then saw my name and laughed out loud because they knew me.

And of course, I met interesting people.


Yes, indeed. I have a few like that and I always wonder why those were the ones that got legs. This one for instance generates a couple of emails per week still years after:

https://jacquesmattheij.com/if-you-have-nothing-to-hide/

This one too, even inspired a play!:

https://jacquesmattheij.com/trackers/


I read the two pieces sequentially: from tragic to tragicomic.


Not a downvoter, but I did have a blog that I didn't promote, which I used mostly to get practice with my writing. I enjoyed it being public but with zero traffic, because writing that is 100% private can easily turn into a private rant. If there is the chance it will be seen, I hold myself to a higher standard, and hopefully that practice comes across when I engage in written communications with others, whether on HN, other sites, or at work.

Ultimately, though, all the sharing of our blog URLs and this related discussion made me realize that I didn't really want an audience, so I killed off my domain.


Direct personal experience is why I think someone would write a blog post and not have much interest in sharing. I’m not an enthusiastic self promoter. Turning up my lizard brain volume gets in the way of living in ways that I find more fulfilling.


I occasionally blog about interesting technical problems I encountered and how I solved them. Someone who have the same problem might happen upon my posts through a search engine, and find them helpful; but I don't see the point of "promoting my blog" to people who're not looking to solve those problems. So the answer to "what is your blog and why should I read it" is you probably should not until you find my blog when researching on a problem.


One reason is they don't actually want the "broader" community to see them. More than once I've seen people lament the fact that the "orange site" picked up one of their posts. There are Twitter accounts dedicated to making fun of things people say in the comments here. There are people who wrote scripts that do special things when the referrer to their site is HN.


Can you point me to some of those twitters? Sounds entertaining.


All real quotes:

https://twitter.com/shit_hn_says?lang=ca

I think the threads on imageboards mocking HN are generally funnier, though.


Thanks, and hell - I'll take those too if you got em


N-gate would be one, not sure if they're on Twitter though.

http://n-gate.com/


#hnwatch


And not all contributors would plug their blogs in a thread asking them to do so.

Exactly. On a pseudanonymous forum that’s as good as asking people to doxx themselves!


Or maybe they've seen the thread and decided not to participate. For instance myself - my blog is nothing special, so I didn't include it, even though it's linked in my bio.


There is a difference between acknowledging publicly who you are and doxxing, which I consider to be publicly releasing your phone number and physical address (at least).


If you participate under a handle here and use your real name on your blog, it may effectively amount to the same thing.

People can have a variety of personal concerns, from a nutcase stalker ex to "I work for BigCo and want to spout off online without getting fired" to "I happen to have some bizarrely unique name, so using my real name anywhere amounts to doxxing myself."

Lots of people on HN use a throwaway email just for HN and don't want general HN readership to know much about their lives.

People use a wide variety of approaches to having an online life while looking out for their own specific privacy concerns. Please note that most people with privacy concerns will not chime in to this discussion to explain to you why they make the specific choices they make as that would tend to be counterproductive and undermine their goals of protecting their privacy.


Linking to your website has this effect. For example, one of the few pieces of information on my website is my amateur radio callsign. You can take that to the FCC's helpful database and then get my home address. I have it there because I think crime is low enough that it's worth having a Google result for "who is that KD2DTW guy that I just heard?"


>doxxing

Where does this strange typo come from?


> "Doxing" is a neologism that has evolved over its brief history. It comes from a spelling alteration of the abbreviation "docs" (for "documents") and refers to "compiling and releasing a dossier of personal information on someone".[9] Essentially, doxing is revealing and publicizing records of an individual, which were previously private or difficult to obtain.

The term dox derives from the slang "dropping dox" which, according to Wired writer Mat Honan, was "an old-school revenge tactic that emerged from hacker culture in 1990s". Hackers operating outside the law in that era used the breach of an opponent's anonymity as a means to expose opponents to harassment or legal repercussions.[9]

https://en.wikipedia.org/wiki/Doxing


Doxing, not “doxxing”.


From the same place as phising, leet etc.


>phising

>leet

one of these is not like the other


Yes. I'm afraid to post my website because the mods might ban my account because they know me and don't like me.

I run a popular website, but I remind people Apple is Evil.


Link to the raw file is broken - Try this http://dannysalzman.com/files/hn-blogs.csv


Interesting to see the high numbers for GA. A bit of ‘do as I say, not as I do’ going on?


Every time I see a privacy outrage thread over here I think about how many readers/commenters on this site work for advertising/tracking companies or companies whose products include tracking code. (Full disclosure: I don’t.)


And the number of readers who just aren't that bothered (and are even less bothered about getting in to an argument on the internet about it).


Probably because there isn’t a good free analytics service that is easy to use (no need for self-hosting) and is able to collect the info that one wants. GA is free, and easy.


Piwik is free and does everything GA does. You can self host the analytics dashboard so no 3rd party privacy issues.


Self hosting is about as far from easy as you can get though.


“User privacy is only great so long as it’s effortless for the site operator”?


This was sourced from a posting asking people to promote their blog. People that promote their blogs are likely going to have some kind of analytics tool installed. Also, lots of blog platform sites come with analytics built in and some of those use Google.


Or perhaps you think there are more people against GA here than there actually are.


I've wanted to dump GA forever, but it was only a year or so ago I finally took the plunge and set up Fathom analytics on my sites. Now I'm happier with the faster load times and the ability to control my own analytics data (though some of it is more limited since no cookies are used)... but I do have one more small VPS I'm maintaining pretty much in perpetuity (eventually it'll move to a personal K8s cluster but resources still cost a little).

The friction is just great enough that most people still stick with GA, since it's already pretty much everywhere.


What does the "Programming Languages" section mean? Blogs that discuss the languages? Use the languages in snippets within posts?


I'm assuming the programming languages that CMSs people are using are written in, hence the dominance of PHP and Node.js.


I like the analysis. However, I am curious why entries are not sorted by counts. As a rule of thumb, sorting alphabetically makes no sense!

Also, 382 sounds like a very small number, given the size of HN. I did try to find many blogs I read, but couldn't find one. So, crucially - was it a random sample? Or sample from top-liked, or from a particular month?

some findings (e.g. the prevalence of Wordpress) may depend on this procedure.


When I've seen the thread, it had like 500 comments. My comment wouldn't make a blimp and like five people would have seen it, so I just didn't post it.


Thanks for the advice! I went in and sorted the tables. I love Markdown for its simplicity and forces me to focus on the content, but sometimes simple things like sorting a table are a pain. Such is life.


At the top he writes how he got the data, from the 'what is your blog and why should I read it' post.


I missed that thing, without clicking the link. (I understood it was an inspiration rather than "Analysis of self-posted blog post to thread X".)


If you’re interested in more about what these blogs contain, I review the links in all these threads at https://kickscondor.com/hrefhunt/.

For example, here’s one from last year’s thread at this time: https://www.kickscondor.com/hrefhunt/6/


[off topic] FYI, i found this comment by subscribing to the RSS feed of your HN comments on Fraidycat (by linking to https://edavis.github.io/hnrss/ for your username)

Very cool!


Oh hey - just saw this. Thank you for this link - I’ve been looking for just such a thing!


I think I can spot my blog in the Analytics section, I guess I'm the only one using Gauges as analytics.


Ha -- I'm one of the 4 hosted with Caddy web server!

All of my other attributes were in the majority (Hugo, Google Analytics, etc...)


Yep. That came from annoying.technology


Hehe. My blog falls through every statistic: No tracking and self made generator.


Yeah I can spot my blog as the only Pelican user as well, oops.


> Static Site Generators

how can that be detected ? I'm curious.


Only 22% did have any static site generator detected. Beyond that, 33% had a CMS detected.

That leaves nearly half of analyzed sites as unknowns! Speculation aside on what might be effective aside, the real answer wrt OP is basically "it wasn't".


I generate my blog with a static site generator I wrote, but I don't see how anyone would be able to tell by inspecting the output.

A heuristic that might work would be to add a cachebuster query parameter to the page url (?cb=$RANDOM) and see how long it takes to respond. The idea is that the three most common setups are:

* static site served with apache/nginx/etc, which will just ignore the query param

* dynamic site, which will regenerate the page

* dynamic site behind a cache where the cache doesn't know that the query parameter isn't needed, and so the cachebuster will cause the page to be regenerated



Jekyll only adds that if you use their “SEO” plugin: https://github.com/jekyll/jekyll-seo-tag/blob/0943563d0aac60.... (I don’t use that plugin, so I add it manually to mine.)


If I'm not mistaken, Jekyll also adds it in the RSS feed metadata, if you use that plugin, but that's already a bit far fetched.


I checked my own site which is built with Hugo and that's what it uses

<meta name="generator" content="Hugo 0.56.3">


I added this to my RSS feed:

<generator>Jekyll v{{ jekyll.version }}</generator>

But it's optional.


Some of them output META-tags, or footers such as "powered by XXX".


Or look for a link to the website source on github, gitlab etc and go from there. But that's more or less manual work given how many custom made SSGs there could be.


I'd say identify common templates? Or "made with Jekyll" blocks

If it's a custom templates I think it's impossible.


That’s a helpful list of Google Analytics alternatives to check out, thanks!

There are 13 websites using Parse.ly, which starts at $500 per month! For a personal blog??



Neat. Here’s my blog if you do another run. https://www.forrestthewoods.com/blog/

It’s artisanally hand crafted HTML with a little VanillaJS on a few pages. No static generator used. Also hosted on Netlify. Although I use BunnyCDN for large media. I post very infrequently.


I‘m surprised and a bit disappointed that only 4 of them use Matomo. I assumed this to be the quasi standard alternative to Google...


When I announced my own analytics thing here a few months ago I got quite a few signups from the HN thread, and based on the bug reports, support requests, etc quite a few of the users are just running it on their personal blog. I'm aware of some other solutions as well that aren't mentioned at all in that table.

Personally I found Matomo rather hard to use, I tried it for a while but decided that writing my own was easier than figuring out to get Matomo to do what I want (also: cheaper and easier). I think it's a good alternative for some use cases, but far from all. Related comment I left on Lobsters a few days ago: https://lobste.rs/s/cdrrty/why_you_should_stop_using_google#...


I looked into Matomo. Too much work. My blog is a hobbyist blog. I’m not maintaining a server for the next 30 years for my stats.


It's just a blog, matomo is more common in production.


It's not an alternative if it's not free.


The self-hosted version is free.


I suspect that the larger-than-one-might-expect representation here for Erlang and Cowboy (an Erlang web server) is caused by sites hosted on Heroku, which would return Cowboy/Vegur in the Server header, rather than because many HN users are actually maintaining Erlang application servers to serve their blogs.


Interesting, I didn’t know Heroku used that stuff. It could be from Elixir/ Phoenix websites right?


I would like to see a word frequency count gathered from all the articles written on all these blogs and see if there’s any interesting patterns.

Bonus if you can segment the data by date as well so we can see trends over time.

A person that builds such a system would have access to some pretty useful data.


> ended up copy and pasting all of the text from the entire post and then using regex on the command line to spit out a list of URLS.

I had a chuckle. This is how so much data analysis happens in practice. Nothing like the command line for quickly cleaning some data.

Great work!


I missed the original HN post, added my comment to the thread. Great work on the analysis.


Thank you for putting this together!

Surprising insights: many actually use Google AdSense, nginx over apache

I'd definitely be curious in: PageSpeedInsight (score, load time and size), post frequency/length past 12months, external link density


I answered some of my own questions:

  "time": {
    "minTime": 0.7,
    "maxTime": 53.5,
    "meanTime": 5.5,
    "medianTime": 3.1
  }

 "size": {
  "minSize": 1,
  "maxSize": 33484,
  "meanSize": 1536,
  "medianSize": 1565
 }


It would also be interesting to analyse the domain name registrars and DNS hosting providers used.

Based on my experience with the HN crowd, I'd predict that Gandi + Cloudflare would be a common one, with NameCheap closely behind.


Looks like the link to the CSV file is broken, which is a shame.



Whoops. Should be fixed now


One interesting thing that this didn’t catch was my CMS, Craft—which I think is a signifier that Craft didn’t leave a lot of fingerprints.


It would be interesting, in a future part of this series, to cross-reference the performance data with the technology used.


Not sure if I added my blog late to that thread but it's not in the dataset.

Granted there were a lot of URLs in that post!


It's interesting, and telling, that Windows/IIS doesn't appear at all.

Or did I miss something?


We run all our website on ASP.NET Core, but we have a forum and blog which basically automatically mandate MySQL. We played around with PHP on IIS [0] for many years before giving up. We also tried the opposite and hosted our app on Mono before giving up on that due to bugs that would randomly cause compilation errors [1].

Long story short, two completely separate backends each running on most reliable platform for the stack. And nginx is in front of it all. Ping the root domain and you’ll think you are on a big-standard Linux/nginx confit, even though it definitely is not.

So don’t assume no IIS!

[0]: https://neosmart.net/blog/s=php+iis

[1]: https://neosmart.net/blog/tag/mono/


I would have thought that more blogs were on dev.to than these technology analytics suggest.

Their stack is Rails+Preact.


Nice work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: