HN has a hair trigger about banning IPs that request too fast (sorry about that; we don't have a lot of spare performance), so I wrote something people can use to get their IP unbanned once if it gets banned by accident.
Obviously you have to use it from another IP address, like your phone.
...and if they are all hitting HN at the exact same millisecond, then their connection should be delayed
HN serves with connection-close, not keep-alive, so as soon as one request is done, the connection is freed for the next visitor on the same IP. This would just force them to be in single file on a very quickly moving line instead of requiring dozens of connections to be served all at the same time.
Think of grocery store with one super-fast express lane vs no express lane and a dozen very slow cashiers and people with full carts ahead of you.
Don't knock connlimit until you try it. Again, it's not a ban, just backlogs the requests.
That sounds better, but it feels like a band-aid solution to me. For example, I worry about whether it will actually fix the load problems if a bad network has lots of requests, resulting in a very long queue and lots of open connections. It sounds like it's worth trying, at least.
As it currently stands, they would simply be unable to use HN if they were loading it at the same time, as the server would just ban them; do you feel that is really a better solution to the proposed delay?
I think that the proposed solution gives preferential treatment to users who were around long enough (or have enough money) to be on a network where they are assigned their very own personal IPv4 address. If IP addresses mapped 1:1 to users or machines, then I'd be all for using xt_connlimit to throttle users who perform excess requests.
Even if you add a proposed delay, a user behind one of these NATted networks could (unintentionally, I hope) cause a DoS by sending lots of requests to make the queue unreasonably long, which, to someone behind the NAT, is just as bad as a server ban.
pg: I have fair bit of lisp dev experience. If, as a weekend project, I modified the HN src to use postgres and memcache would you consider using it in production? Obviously, I don't expect carte blanche prior agreement, but I wouldn't want to invest the time unless I thought it was plausible the work could actually help.
I would expect it to solve most of your performance problems for the foreseeable future (at the very least, by letting you scale horizontally and move the DB, frontends, and memcaches to separate boxes - plus ending memory leaks/etc by moving most of the data off the MzScheme heap).
The obvious downside is that it would use your (or someone at YC's) time. First to merge the changes I make to http://ycombinator.com/arc/arc3.tar into the production code, then to buy/setup some extra boxes and do the migration. We're probably talking, roughly, a day. It also has the unfortunate side effect of costing HN's src some of its pedagogical value, since it adds external dependencies and loses 'purity'.
Been looking for an excuse to learn arc for a while now ...
Careful now :) It's not like there's anything stopping HN attracting a wider audience anyway; there's no restriction on who can register. Anyone can come and join in, which (in my opinion) is as it should be.
Of course. I'm not suggesting that there should be any limitations on who can join, but as the community moves more mainstream, quality will dilute. As the site is rather un-sexy right now, it seems to attract those who are genuinely interested. Remember what happened to Digg...
Very generous offer, but I would argue that HN's slow performance is a feature, not a bug. The average drive-by person, that is attracted to sensationalist articles and titles, simply doesn't have the patience for the slow load times of every page. The user that is seeking intelligent conversation, however, is more than willing to have 5+ second wait times if they know that they will be getting valuable content. Couple that with page load times having consistent slow load times, rather than surges of performance, and I wouldn't put past PG to build a delay into page loads to act as a sort of filter. Even if it's unintentional, I would still argue that is still useful in driving out some riff-raff
I also believe that Hacker News runs on a small stack of services developed by some past companies from Y Combinator.
I would agree that there is also little to no desire to make Hacker News "the news place" - where it supports thousands of posts a second and is extremely popular. In general Hacker News is used (and the hope is to stay that way) by startups and people interested in startups - it's slowly growing out to include more types of people - marketing, companies, blog posts who just want a lot of hits, etc - and not many people want to purposely support that.
It makes me think that one non-negotiable feature of any webapp architecture is to detect situations when inbound strings are placed in any context where they can be interpreted as code, and either refuse to run or at least spit out a severe warning.
And there are no webapp architectures which do this.
Neat. Something like SafeBuffer is a practical way to approach the problem.
It seems like with the rise of 'zero copy' approaches we could do even better - simply designate a memory region as unsafe, and transform it into a safe version depending on which context it is used. These transforms would want to add a little metadata pointing to the original unsafe region in case the transformed region is ever subsequently used in a different execution context. Alas, from the perspective of one program the input to another always just looks like a string, which means that somehow our host program (and programmer) needs to signal the appropriate transform on, say, concatenation. The only way I can think of around this requirement is to force implementors of contexts to tag their interfaces as a context, and for callers to construct arguments to those functions such that constituents that derive from unsafe regions are detectable. For example we have a SQL context that takes an array of string pointers, where some of the pointers point to 'unsafe' regions, and we just concatenate the elements of the array to construct the context argument.
Ah, the fun part of this is "interpreted as code". Which language? html, xml, js, css, json? Get that part wrong or slightly off, and what you sanitized for one isn't for the other. And sometimes there can be nested contexts.
While the idea of "taint" is useful, it is only half the battle. The other half is accounting for the context.
Do you have a rough set of guidelines for how fast we should request from HN? For a side project, I was thinking of writing something that scraped the HN frontpage and all the associated comment threads every 10 minutes or so, and I'd rather not cause performance issues or get banned. I'd be happy to rate-limit requests to whatever is convenient.
The first and foremost reason for me to consult HN is because it is fast. I am in China, and usualy send time on the web only on my phone, with 3g connection.
HN speed beats all other link agregators, blogs, news site, and even goolgle search, and -- most interesting: even fast Chinese sites.
I don't know why it is so fast (except when it is dead, obviously), maybe because of this flat-file architecture, which could just make sense. (Git is very fast too, right?)
And I think it is interesting that the "make it fast" is a leitmotiv that has been forgotten by so many people, Google firstly, but is still a reason for some (me, at least) to pick this site over that sire.
If the old ways work well enough, why bother to change?
I still call my Windows scripts .bat files (instead of .cmd, that's clearly for OS/2 programs).
It was only in the last three or four years that I stopped naming my files with all-uppercase names not longer than eight letters, with an extension not longer than three letters, to be sure they would be compatible with a FAT16 filesystem.
I'm rather distrustful of GUI's for doing things like moving or copying files.
I never drag-and-drop files into programs, partly because I seldom use GUI file managers, but mainly because most programs didn't support the metaphor when Windows 95 first came out, and I haven't bothered to check if things have gotten better yet.
Given these facts, you might find it surprising to learn that my age is less than 30.
Still and all, you can't help but notice how awesomely rock-solid browser textareas have gotten. I've actually had my laptop run out of charge and unceremoniously die on me in the midst of a humongous comment. Reboot, login, open browser, tabs all pop up -- and there's my comment. It's utterly amazing.
Yeah... if I open Chrome I am pretty much guaranteed to be banned for days. :( The mechanism should really be changed to account for this: a ton of requests per second for only a few seconds should not trigger an issue, it should be a number of requests per second spike along with some sustained usage per minute. I actually made modifications to Chrome to change how it loads tabs mainly because of Hacker News' weird IP ban system, but I still got burned recently as I accidentally hit "undo close tab" one too many times, which reopened an entire window.
Yeah: I ended up figuring out a way to add it. I now generally like having the feature, but it was a complete necessity due to the Hacker News IP ban rules (although, as I mentioned, still doesn't solve the underlying problem for this site, which is incredibly touchy).
It is so annoying that all the other browsers STILL have not implemented this little but very effective idea - please speak more loud about this, as it seems even most developers here did not even notice this feature...
My solution is to use a firewall with per-application rules and just turn off network access for chrome before I launch it. On my laptop I just unplug the wired/wireless network for during the launch. This was mainly because of HN but also has the added benefit of taking less system resources since a blank page typically is less resource hungry than a real page.
Firefox has a better solution for this but then again, I don't use firefox.
Repost from "Show dead" that relates to this issue:
[−]sunstone1 10 hours ago | link [dead]
Well I never had my IP banned but I did have my account hell banned after about a dozen posts as you can see. Oh, actually no, you can't see, because it's banned. No, I never bothered to get another account, now I'm just a taker not a giver.
Most of the time it's clear why a user was banned, but looking at sunstone's history I don't really see a reason. While the algorithm will never be perfect, it would be nice if there was a clearer solution for misfires.
Great news! I was banned last week (http://news.ycombinator.com/item?id=4736919), the bann was lifted in the meantime. But this will come in handy the next time I'll be developing an extension for HN and will refresh it all the time :)
Well, I might as well try striking while the code is hot..
It occurs to me that I would like to interact with noprocrast in a different manner. Currently, I leave noprocrast disabled most of the time. I like to use longish minaway times (~day), but this makes me feel as if my first visit to HN will start the clock ticking, and I'd better be sure to get my HN fill before the timer runs out (yes, this is kind of ridiculous). So I only enable noprocrast (with a short maxvisit) upon realizing I'm stuck in a web loop.
The mechanism that I envision is either a button that immediately starts a one-shot noprocrast ban, or a page-count based maxvisit. The latter might be better since it could always be left enabled.
Thanks Paul! I'm reluctant to try this in conjunction with developing any HN scrapers since I'm not sure what set it off in the first place and your language suggests it will only unban the IP once (I will, however, make sure the CMU IP I was using gets unbanned). It would be helpful to know what, precisely, that hair trigger is so we can make sure to avoid it.