What would be great for kartik's app is a live "push" style feed of all the new content on hacker news (possibly including votes too). This way, the load on the site would be minimal.
This would also let a lot of other people scratch their own itch to try out outlandish new ideas. Few of us newsyc readers are even going to bother thinking up ideas for your site because we realize that there's a low chance of them being implemented. You just couldn't implement them all if you wanted to. If you provide this feed though, it would allow people to get the instant gratification of coding up their own ideas and, I think, a lot more innovation would occur.
I don't think you're trying to get rich by selling this site so there's not much to lose. Brand dilution likewise seems unlikely in this case. Any other downsides?
This would also let a lot of other people scratch their own itch to try out outlandish new ideas.
There's an even better way to do that, which is to let users write little Arc programs to control the way pages are generated. That is the eventual plan.
Great idea. I always knew my crawler was a bit of a hack and would be eventually obsoleted.
Until we get there, though, may my crawler be let back in? Like I said in email I promise to only restart it after I embed controls over how many URLs I get per hour deeply inside it, and to log my crawling activity and keep an eye on it in future.
The reason I was crawling every few minutes was to get notified of updates to conversations as they happen, before the 'herd' moves on to new feeding grounds. But I can make it crawl even hourly and it'll still allow me to stay kinda abreast of the conversation here.
"... I believe that in the long run, all credible large-scale Internet companies will provide Level 3 platforms. Those that don't won't be competitive with those that do, because those that do will give their users the ability to so easily customize and program as to unleash supernovas of creativity. ..."
It would be a good way for hackers to develop, share, debug Arc code. Boost the language develpment using the pre-release code.
Until ojbyrne mentioned it, I had never really looked at the comments page. It's almost right, but it would be much easier to follow the conversation if each comment was tagged with the story it refers to. Right now the comment threads are disjointed from each other (that's the price of a chronological list) but they're also disjointed from their source article.
On hystry, it's 2 columns, with the left column being the title and link and the right being the comment. This makes it easier to mentally sort the comments into conversations. Like I said, most of it is already in the Comments page, which I now use instead of hystry. You might think of another way to do it that also works well.
It's a useful addition, but it messes up the visual flow, making it hard to quickly scan the comments page. How about putting the title at the end of the line with the rest of the metadata. E.g.
3 points by pg 8 minutes ago | link | parent | source: Reinstate hystry's Hacker News
A 'source' or 'root' link would be useful when navigating comments also, to save having to click through each parent.
Perfect! I actually thought about doing it that way (root next to link | parent) while I was driving home. The only observation is that all of the Ask YC type discussions (not links to other sites) have news.ycombinator.com as the url (without the link to the specific conversation). Other than that, it's perfect (until the Arc customizations are released)!
"When one of the customer support people came to me with a report of a bug in the editor, I would load the code into the Lisp interpreter and log into the user's account. If I was able to reproduce the bug I'd get an actual break loop, telling me exactly what was going wrong. Often I could fix the code and release a fix right away. And when I say right away, I mean while the user was still on the phone.
Such fast turnaround on bug fixes put us into an impossibly tempting position. If we could catch and fix a bug while the user was still on the phone, it was very tempting for us to give the user the impression that they were imagining it. And so we sometimes (to their delight) had the customer support people tell the user to just try logging in again and see if they still had the problem.
And of course when the user logged back in they'd get the newly released version of the software with the bug fixed, and everything would work fine. I realize this was a bit sneaky of us, but it was also a lot of fun."
pg, I think the current implementation has a problem. Most of the time, "on:" links go to NYC comment pages, but when the comments were done replying to the posted article (and not to other comments), "on:" links go directly to the posted article / website. I think all links should go to the comment pages always, to keep the "on:" behavior coherent.
Thanks a lot for implementing this feature, that was fast!
Edit: this happens only on the "threads" page I think.
Much better! Just adding pagination means I don't have to worry about missing new comments if I crawl too infrequently.
But the crawler is almost entirely unnecessary now. If you could make it convenient (ie no reload) to show the ancestors of comments (the context of each conversation) that pretty much covers all the features I built.
I've been using the new /newcomments. One other thing I noticed that's missing: it's not stateful. There's no way for me to kill off some users/threads so I don't see them again. I think my query mechanisms were pretty elegant and lightweight. hystry permitted filtering conversations (rather than comments) involving specific people, and also easily blacklisting going-nowhere threads.
"... It was really pounding the server. Crawling for a search index is ok, but retrieving pages to dynamically update stuff generates a lot of requests. ..."
I hit the RSS feed every 15m and have so for months. I call the RSS file, extract the necessary data from this file, then re-hit user pages once [0] to get their contact info. Some of the things I looked at to try and reduce hits is to check some basic HTTP options [1]:
- 'Last-Modified/If-Modified-Since' to only take, if updated
- 'ETag/If-None-Match' checking admin defined time to reload
- 'gzip' to reduce size of downloads
But none of the above are implemented (probably for good reason - complicated & not really required) checking the HTTP spec implemented is pretty bare. I only found server status. I could also check the the robot.txt file, but like I said I only download the RSS file and user pages. A quick check for "robot.txt" reveals none.
So you could automate/enforce crawling rules using a combination of HTTP and/or robot.txt files. But this does require the client to correctly implement them. Having said that increasing the RSS feed from 25 to 30/35 entries to cover the front page [2] is about all I can really think of.
[0] Have to check this. Mainly get contact info & cache if found.
[2] Having only 25 entries means you might be tempted to grab the front page (and maybe the second page) instead of RSS because you miss the bottom 5 entries.
If you can tell me what you think is missing from News, I can try to add it.