

HN long term change - jacquesm
http://jacquesmattheij.com/hn-long-term-change

======
onewland
I'm relatively new, so it's difficult for me to see any major change that's
happened since I first joined. My personal opinion is that there has been
little change since I arrived, but much talk about how much has changed.

That said, I think an article about economics or politics can be more profound
and deep than a lightweight article about technology and be better for the
tone of the site. My personal preference is for news here to be technical, but
an article being technical is not enough for it to be interesting. Overall,
however, I want articles with meat; that dig a little deeper than "Head First
SQL" or "How I made a blog engine with Erlang" (or "I am Phillip Greenspun and
I don't like people in Northern California").

While I love beautifully designed languages as much as the next guy, I've seen
probably 50 blog articles posted here about "Why language design matters" or
"Why language design doesn't matter" or "Why there can never be a better
Lisp". While these may be technical in nature, they're often very shallow and
redundant.

As far as I know, it would be an impressive feat to determine depth
automatically, but I think it would give you a better picture of how the tone
is changing over time in a relevant way. And maybe it would be a good filter
for submitted articles.

------
pg
If anyone wonders why there was such a high proportion of articles about
startups at first, it was because the site was initially called "Startup
News." After 6 months or so we changed the name and focus.

~~~
zitterbewegung
Do you think that as a site becomes large enough people start to change their
own focus and sort of get lost in the crowd? That eventually people are just
disillusioned by the site itself? I noticed this is sort of happening with
reddit / digg.

~~~
pg
I don't think largeness is the problem in itself so much as the decline in
quality and civility that usually accompanies it. If we can avoid decline
we're ok. This is mostly uncharted territory, but I'm hopeful. I'm going to be
working this year on tweaks to encourage people to be more civil in comment
threads.

------
goodside
Wonderful idea, but I'd like to see this with:

* Objectively ranked categories based on word usage, not author-provided tags

* Legible graphs with fewer colors

* Analysis that weights comments by karma (or, better, a reasonable non-linear function on karma)

* Open source for reproducibility and better outside critique

Edit: In fact, I'd like to see it enough that I might build it. Any other
ideas?

~~~
BearOfNH
I'd be interested in a breakdown of the 10 or so most popular source sites on
a (say) month-by-month basis. You'd expect to see a lot of articles
referencing Google, techcrunch, HN and the like but I'm also surprised by the
number of articles from the NY Times, WSJ and such.

Maybe determine "popularity" first by number of articles and then again by
score.

~~~
jacquesm
I'll see if I can do something like that.

I have to do all kinds of tricky date parsing to figure out when an item was
posted so this is not as easy as it seems at first glance.

------
jacquesm
I'll build another graph tomorrow that is weighted by votes, that might give
another perspective.

There seems to be some confusion about how to interpret the graph, the Y axis
is 'per post', the x axis is per block of 1,000 posts.

The graph would have been harder to read at 138x1000 pixels so I've stretched
it a bit.

------
chaosmachine
One thing this graph can't account for is the quality of submissions. A
technology article about 8051 assembly language is a lot different than a
technology article about the top 10 SEO tips for bloggers.

~~~
jacquesm
That's absolutely true.

When this is all done I'll make all the data available so that people can mine
it for what it is worth.

A fair amount of work went in to this little project, not all of it automated
(unfortunately), a lot of time went in to making sure the tags would be
somewhat relevant.

If there would be any major trends that I could not explain by looking at
samples of the data I would have certainly investigated.

I hope that if there are such trends that I missed that they will come out in
the follow up (weighted by votes, see above), if not then the analysis will
have to be _much_ more detailed, and that will probably mean a lot more
handwork than what went in to making this graph.

The 'good news' for me is that there is no unbounded growth of the
'unspecified' category, that would be a fairly large indicator of trouble.

------
tdedecko
Can you provide more information about how the sampling was done and how you
categorized the articles?

~~~
sidmitra
The graph also needs better labelling. For eg. What is X-Axis? is that
snapshots over time?

~~~
jacquesm
The x axis is the rank number of the posting divided by 1000, so that's a
constant sampling interval in blocks of 1,000 but more compressed in time
towards the right because of the higher posting frequency.

~~~
jacoblyles
It would make more sense to use a constant time x-axis. Also, how did you do
the article labeling/clustering?

~~~
jacquesm
It wouldn't make much difference actually, apart from greatly complicating the
matching up of the Y axis.

The bigger issue is the fact that this is just everything that is posted and
not flagged, so it is if you wish a view of the 'new' page, it has nothing to
do with the 'home' page, I'll try to address that tomorrow.

As for the labelling and clustering, that was based on keywords in the title
from a fair sized sample, and from the urls the links pointed to.

What I am specifically searching for is larger trends, smaller trends would be
very difficult to catch using this method.

I'm actually quite surprised how even the graphs come out over the longer
term, I would have expected more variation in the submissions.

So if there is a problem at this point in time I would conclude that the
problem is not in the submissions, they seem to have roughly the same subjects
over the long term as they did in the beginning, with the exception of a shift
of focus away from 'startups' in the first year or so of operation.

I think that has to do with an influx of programmers / people interested in
technology in general whereas originally most of the people on news.yc were
active in the startup scene.

~~~
gruseom
_they seem to have roughly the same subjects over the long term as they did in
the beginning_

That's been my impression for a long time. Do your techniques allow you to
measure the trend of people complaining about the site deteriorating? Because
that's been going on for a long time too, and in approximately the same way
(though possibly in cycles).

 _I think that has to do with an influx of programmers / people interested in
technology in general whereas originally most of the people on news.yc were
active in the startup scene._

Pretty clearly that is because the site was originally named Startup News and
had a relatively narrow scope, then was renamed to Hacker News as part of
explicitly broadening the scope.

~~~
jacquesm
> Do your techniques allow you to measure the trend of people complaining
> about the site deteriorating?

No, especially not because plenty of those get flagged and die.

------
teuobk
That looks surprisingly consistent to my eye.

Any data on how other similar sites (e.g., reddit) have changed?

~~~
JacobAldridge
Agreed, with the exception that the proportion of articles tagged 'startups'
decreased reasonably fast in the first quarter of the graph.

This was probably part of the iterative change from 'Startup News' to 'Hacker
News'. I can't recall exactly when the name changed, or whether it was a
response to that trend or precipitated the wider focus.

Edit: Change was made 14 August 2007.

------
mahmud
FWIW, yesterday I submitted a link to the history of Waite Group publishing by
Mitch Waite himself. It was an awesome piece on entpreneurship, failure, and
the history of personal computing and it's still sitting at 1 point:

<http://news.ycombinator.com/item?id=1047482>

Meanwhile less topical, but more sensational stories shot to the top.

~~~
wallflower
I'm tempted to create some kind of site that checks the 1 pt submissions
automatically. Not sure how much semantic analysis but even minimal keyword
checking could flag possibly good articles. Hidden.HN

~~~
jacquesm
That's actually a piece of cake to do.

------
dill_day
I think this is interesting, but wonder if the change people talk about could
be more in the quality of discussion than type of submitted articles?

------
dunstad
I'm having difficulty matching the key with the graph for the thinner
categories, and for some similarly-colored categories.

~~~
diN0bot
i believe the layers (stacked lines) and legend are in the same order.

~~~
jacquesm
yep.

------
johnfn
Seems like the proportion of articles actually focused on hacking have
decreased, but everything else is pretty consistent.

Also, what is "as khn" and "as kyc"? It looks like one replaced the other.

~~~
tdm911
I believe it's just the text formatting gone awry. They are:

ask hn (ask hacker news) and ask yc (ask Y Combinator).

ask yc was obviously more popular in the early days.

~~~
jacquesm
You're correct. I don't know why the plot headings did that, in the tags they
are correct.

------
DTrejo
Technology is listed twice in the legend?

~~~
jacquesm
Hey, sharp eye ! I must have mis-spelled it at some point and then the
'matcher' then used both labels.

Those two should be summed, but the misspelled on is used very rarely
(fortunately).

I'll re-do the graph tomorrow when I'm awake, it won't affect any other rows
or the shapes though.

There were many subcategories as well, but I've used only the top level of the
tags to make the graph legible.

In total I used about 200 different tags.

------
fauigerzigerk
Thanks, that's interesting. However, I'm a little confused about the
categorisation. It looks like the categories add up to 100%. If that is the
case, the category "blogs" doesn't make sense in my view. All other categories
characterise the subject of the content whereas "blogs" says something about
the publication channel. In my view, there is no sensible way to label a blog
post about technology either "technology" or "blogs".

~~~
jacquesm
Correct, the problem here is that even though most of the blogs are technology
blogs it is very hard to categorize the majority of them as something
specific. For instance, Bruce Schneier blogs about security, most of the time,
so all articles that could be tagged like that are now under hacking,security.

But he also has lots of stuff that is not so easy to categorize, so that ended
up depending on the ease with which the title let itself be identified either
under 'technology' or, in the worst case under 'blogs'.

A similar problem appears with the 'mainstream' media websites, and it was
solved in the same way with the top level category as a catch-all after other
matches were ruled out.

------
gambling8nt
It looks like there are too many "unspecified" articles to learn much from
this visualization, other than a moderate decrease in the number of articles
in your "startup" category, supplanted largely by "ask" topics--a trend that
largely leveled off early in the x-axis on this graph (which would be
dramatically more useful if it had some amount of real-time benchmarks to give
some sense of scale).

~~~
jacquesm
See the text in the article about the 'unspecified'.

As for the scale, it doesn't get much more precise than this, the only
concession to legibility is to stretch the graph horizontally because
otherwise it would be only 138 pixels wide, vertical is very close to one
posting per pixel.

As the volume of postings on news.ycombinator increases due to increased
traffic to the site the graph will stretch more further to the right.

This could be counteracted by changing the algorithm to 'bin' more posts to
the right hand side to get for instance one month per bin, but in practice the
outcome would be the same, you'd just have another weighting to do to get the
Y-axis of the bins to line up.

------
shib71
Does this data include flagged/dead items?

~~~
jacquesm
No.

Those are only available to logged in members directly from HN.

I agree that that would make it a lot better.

------
vinutheraj
The green area is marked _technlogy_ and the orange area is marked
_technology_ ! Isn't that an error ?

~~~
jacquesm
Yes, it was remarked on before, but you're wrong about the areas, if you look
closely you'll see that the rows are 'in order' and the green area is actually
a one time occurrence somewhere near the top. The other green area is the one
that has unclassified submissions in it.

------
mhb
I wonder if tracking just the number of flagged posts would provide similar
insight.

~~~
jacquesm
That would have to include some points threshold, lots of spam gets flagged as
well.

The biggest indicator of something being 'populist' but not 'HN' is when it
gets killed after receiving more than 10 upvotes.

------
revorad
Nice one jacquesm. Do you mind sharing the data? I'd like to play around with
some graphs too.

~~~
jacquesm
All in good time, 3 or 4 more days before it is really done, this is the first
bit of usable info that I could extract.

The tagging has been a lot of work, to put it mildly and it is far from
finished. Eventually I hope to crowdsource that part to get it perfect.

------
clistctrl
Interesting. Some quantitative proof to support comment rule #7.

~~~
jacquesm
I'm not quite ready to draw that conclusion, more work needs to be done for
that. Stay tuned :)

