You could also try pulling the handles from https://news.ycombinator.com/active and comparing the distributions you get, since that comparison will give you an idea of the volatility.
Then write up your results in a blog post and submit it here. I'm sure people would be interested.
Or is "codegeek" just a wannabe name?
========
EDIT: You know, I did wonder before hitting "submit" as to whether this comment was a bit harsh. I know that the YC mods want us to be less negative, but go look at my other comments. I'm usually pretty appreciative of the work people do, encouraging of the efforts they make, and free with suggestions, advice, and positive feedback. On this occasion I just felt that the best thing to do was point out that HN is intended for hackers, and that I expect that hackers will go away and do something, or build something, will experiment and explore. So that's why I'm asking the question: why did this question get asked? If you feel it's harsh and out of place then fine, downvote me. I feel that it's actually a positive contribution to the ethos of HN, and stand by it.
Along those lines, here's working code, with a deliberately long (60 second) crawl delay and a severely truncated search criteria (only the first 5 items on the front page, only the first 10 users). I'll let someone else do the full analysis, and risk being IP banned.
from __future__ import print_function
from bs4 import BeautifulSoup # pip install beautifulsoup4
import urllib2
import time
import sys
# https://news.ycombinator.com/robots.txt has a 30 second crawl delay
DELAY = 60 # I don't need speed.
sys.stderr.write("Using a %d second delay\n" % (DELAY,))
def get(rest):
s = urllib2.urlopen("http://news.ycombinator.com/" + rest).read()
time.sleep(DELAY) # Play nicely with robots.txt
return BeautifulSoup(s)
soup = get("") # get the main page
# Pull out the links to the items: <a href="item?id=9374889">
item_hrefs = [tag.attrs["href"] for tag in soup.find_all("a")
if tag.attrs.get("href", "").startswith("item?")]
# Find the users with comments in the stories from the front page
# Users look like: <a href="user?id=dalke">
users = set()
for i, item_href in enumerate(item_hrefs, 1):
sys.stderr.write("processing item %d of %d (%r)\n" % (i, len(item_hrefs), item_href))
soup = get(item_href)
users.update(tag.attrs["href"] for tag in soup.findAll("a")
if tag.attrs.get("href", "").startswith("user?"))
if i == 5:
sys.stderr.write(" ... 5 is good enough. Stopping.\n")
break
creation = []
for i, user in enumerate(users, 1):
sys.stderr.write("processing user %d of %d (%r)\n" % (i, len(users), user))
soup = get(user)
try:
created = soup.find(text="created:").findParent().findNextSibling().text
except AttributeError:
sys.stderr.write("Could not find 'created:' for %r\n" % (user,))
sys.stderr.write("Soup: %s\n" % (soup,))
continue
fields = created.split()
creation.append((int(fields[0]), fields[1], user.partition("?")[2]))
if i == 10:
sys.stderr.write(" ... 10 is good enough. Stopping.\n")
break
creation.sort()
for delta, unit, name in creation:
print("%s - %d %s" % (name, delta, unit))
Here is what I found from the first 10 arbitrarily selected recently active users:
id=VieElm - 151 days
id=cushychicken - 514 days
id=fit2rule - 550 days
id=david-given - 760 days
id=Already__Taken - 883 days
id=justin66 - 1038 days
id=revelation - 1162 days
id=cygx - 1538 days
id=whatupdave - 1664 days
id=InclinedPlane - 2030 days
I am not at all offended by this comment. You have a good comment history and you are one of the first on HN.
Having said that, the reason I did this poll was not just to get the numbers but let fellow HN'ers reflect on how their HN experience has evolved over the time period they have been on HN. I should have perhaps clarified this better in the Poll description.
Oh and I am a "wannabe" totally :). So you got that right. I don't mean this as a sarcastic comment and really my coding skills are at best being able to write a few scripts or edit HTML/CSS . I would totally label myself as a codegeek wannabe.
* Pull down the top 5 pages
* For each handle:
* * Look up their profile page
* * Compute how long they've been on HN.
You could also try pulling the handles from https://news.ycombinator.com/active and comparing the distributions you get, since that comparison will give you an idea of the volatility.
Then write up your results in a blog post and submit it here. I'm sure people would be interested.
Or is "codegeek" just a wannabe name?
========
EDIT: You know, I did wonder before hitting "submit" as to whether this comment was a bit harsh. I know that the YC mods want us to be less negative, but go look at my other comments. I'm usually pretty appreciative of the work people do, encouraging of the efforts they make, and free with suggestions, advice, and positive feedback. On this occasion I just felt that the best thing to do was point out that HN is intended for hackers, and that I expect that hackers will go away and do something, or build something, will experiment and explore. So that's why I'm asking the question: why did this question get asked? If you feel it's harsh and out of place then fine, downvote me. I feel that it's actually a positive contribution to the ethos of HN, and stand by it.