

What we did about expired links on HN - kevin
http://pastebin.com/bSW5dfRQ

======
kevin
Written by dang in response to a question asked here:
[https://news.ycombinator.com/item?id=8423482](https://news.ycombinator.com/item?id=8423482)

------
timdierks
This is clever, and the incremental approach is a good engineering direction.
That said, readers should be aware that the closures approach is a dead end
for a large & complex service: unless you can figure out how to serialize the
full closure state to data that can be exchanged between machines, you're
always going to be RAM-constrained and subject to machine failure.

The loss of state isn't a big deal for a news site, but if you lost the user's
state in the middle of a significant process (e.g. dropping the cart mid-
checkout), it would be a big problem, even if it was infrequent.

What would be great if a language provided a convenient way to repose closures
into data blobs that could be exchanged between machines or sent via the
browser (an untrusted channel that would have to be secured with
cryptography). That said, it's not obvious how to capture the semantics of the
full closure of the system state without significant work (look at how fragile
object serialization / pickling is). Definitely a good area for language
research; I'm sure there's some advances I'm not familiar with.

~~~
drostie
One thing that was exciting about Datomic to me (and still is, I just don't
have any projects big enough to use it for) is that you get an immutable
database state. You can emulate this in any normal database, but it gets
harder to enforce the discipline among other members of your team. The basic
idea is that you can serialize those same closures by just pointing to "the
state of the database at revision #5968", and then, though the database moves
on, you can always use that ID to compute the view of that database at that
point. It does the same "heavy lifting" that storing these closures is doing,
but you can easily share that ID across a distributed service with no "expired
links" problems.

It's worth mentioning that you _can 't_ send a closure in a non-functional
context. That is, if Alice sends a closure to Bob, it cannot any longer be the
case that Alice's other operations can mutate Bob's state. So you must be
serializing an "orphaned" environment tree with a bunch of closures which
point at different nodes of that tree.

You could definitely do this even better by stealing some ideas from
Smalltalk: encapsulate all of the states in some computational node (the
original notion of "object" in OOP) which interacts with all of the other
parts of the system communicate with by message-passing, and nothing else. To
change the code on-the-fly, you just swap out the "code part" of some node for
a new code part, and perhaps transform the state, queuing up the messages
while you do so; then you can start replaying those messages to the new code.
The benefit is that now at any time the nodes can move around servers
arbitrarily, as long as you've got a good name-resolution service to tell you
where the object is now.

In other words: (1) interpret all the things so that code and data are the
same; (2) shared-state is your enemy; serialize orphaned states only; (3) you
have to explicitly handle the case where someone makes a request while you are
sending their closure to another server.

------
lacker
I'm curious if you don't mind sharing, how many people are working full time
on the HN codebase right now? I don't know if that number is even less than or
greater than 1.

I'm also curious if there's a straightforward way to serialize these closures.
I'm guessing no or at least they would not be persistent across restarts.

FWIW, if I were working on HN, I would be concerned that at some point, HN
will require more hardware resources to scale, and it may require some
fundamental architectural changes, like more separation between app logic and
DB. I wonder to what degree you have scaling plans, like if we hit capacity X
we would need to replace subsystem Y.

~~~
kevin
There are two people dedicated to HN development. Daniel and Scott. Four other
people from the YC software team also contribute, but not full time because
they have to also work on other projects: Nick, Brett, Trevor and myself.
Other people on the software team: Garry and Dalton.

We actually just upgraded the hardware for HN very recently. A lot goes on
behind the scenes and it's a testament to the team that it's all fairly
transparent to the community to the point where most of you think nothing
changes at all. The point is to keep the community focused on contributing,
voting and commenting the very best content for hackers.

It's one of the reasons I was delighted to see Dan write about some of the
work they do like reducing expired links. It's a rare look at how much thought
goes into what feels relatively simple.

~~~
jacquesm
Making the complicated look simple is the hallmark of any great success. You
guys are doing absolutely stellar work, every now and then I get a glimpse of
what is being done and I feel that HN has definitely changed for the better
over the last year or so in a technical sense and the moderation transparency
has also greatly contributed to the change in atmosphere.

~~~
drostie
If you think about what magicians do and how they do it, "making the
complicated look simple" is also a nice definition of magic.

In other words, if you're ever thinking, "how could she have _known_ that I'd
choose the 7 of diamonds?!" you're probably victim of some implicit thought
like, "she couldn't _possibly_ have bought 52 decks, pulled out the 7 of
diamonds from each of them, and built a deck consisting only of the 7 of
diamonds, so that no matter how I shuffled and cut that deck the top card
would be the same. No one in their right mind would spend the time and effort
to do _that_." But, that's exactly what she did. She did a lot of _abstracted
preparation_ so that the hard work behind the scenes just vanished at the
higher level of performance. That's what a good cook does, it's what a
magician does, and it's the entire role of "administration" and "middle-men,"
theoretically-speaking.

------
ams6110
The lesson is don't cling to a technically or intellectually elegant
architecture if it's falling over in the real world.

~~~
kevin
Wouldn't exactly describe Hacker News as failing. It allowed one person to run
a site for a growing community in his spare time and the sacrifice was the
occasional expired link for some of its users. Not only do I think it was a
fairly elegant solution, I feel like it was certainly a reasonable trade off.
I'm sure pg would have been the first to change it had it actually effected
things that truly mattered for a community site like Hacker News: story and
comment quality.

------
barrkel
I've occasionally mused about the possible value of introspectable and
serializable closures. Rather than being memory-only, it would be nice if the
weight of keeping them around could be palmed off to the browser using cookies
or hidden fields.

To be practical, it would require that the activation record chain kept alive
by the closure is reasonably short, that the number of live variables in the
chain is fairly small, and there be a reliable way of mapping code references
in and out. But I think it can be done.

One of these days, I'm going to implement a toy language with this feature
combined with my other favourite, automatically recalculated data flow
variables (think: "variables" that work like spreadsheet cells). These guys
are highly applicable to data binding, and making them a first-order language
concept makes them much more elegant to use.

~~~
brudgers
Storing closures on the browser creates a dependencies on JavaScript, user
browser settings. and implementation details of various browsers. At the time
HN was written, IE was common and Google Chrome did not exist. Doing arbitrary
work on an arbitrary client increases application complexity significantly.
Sometimes there's a big payoff. Sometimes there isn't.

~~~
barrkel
I meant closures (or rather, continuations) serialized, encrypted and stored
as a string. They wouldn't be callable in the browser.

~~~
JoachimSchipper
Take a look at Termite for Gambit Scheme; its focus is different from yours,
but it seems to be very good at serializing stuff (including continuations)
and automatically proxying the rest (e.g. file descriptors).

(I don't think this is actually a good idea - intra-datacenter traffic is much
faster, upgrading becomes hard in this scheme, and your security model needs
to be quite complicated - but I'd be interested in learning what you find.)

~~~
barrkel
I will take a look. Thanks for the suggestion.

Upgrading is hard, yes, but I have some ideas. A lot depends on the level of
the language, and I'm thinking of something fairly high-level.

------
redstripe
Does that mean HN runs in a single process and doesn't do any concurrent
request processing? That's pretty impressive if that's the case.

~~~
jacquesm
Actually, it shows you how incredibly inefficient most other ways of building
a web application are. The amount of overhead is stupendous.

~~~
kijin
There are different kinds of overhead, different kinds of inefficiency.

The HN codebase doesn't have the overhead of parsing hidden POST fields (for
example), but it does incur a massive RAM overhead to store all that
information as closures.

A run-of-the-mill web app, on the other hand, would incur the overhead of
passing state back and forth, and perhaps even store sessions on disk. But it
might consume less RAM.

What matters is what kind of inefficiency you're willing to tolerate in
exchange for what kind of benefits. RAM overhead is a smart choice if you want
your app to be very fast and you can afford to use a lot of RAM. A different
organization, however, might choose to incur a bit more code-complexity and
slower execution in exchange for other benefits. It all depends on what your
priorities are.

~~~
couchand
Yes there are tradeoffs, but I'm not sure this is all that great of an
example: a slower application and harder to maintain code are simply not worth
it. Computers are cheap - devs and customers are expensive.

------
IgorPartola
Would it be possible to extend this system to allow you to serialize and
deserialize the closures? Then you could store them externally in a "real
cache" and let that take care of the expiration, etc. You could then also open
up a nice security hole :)

~~~
pshc
Oh! Taking that a step further, what if you mmap'd an empty file first, say
1MB of zeroes. Then, just start writing closures to it, one after another.
When you hit 1MB, use mremap to add another 1MB. And just keep going! The
pagefile would become a fossil record of the webserver's access history. The
earlier in the file, the older the request, and the less likely it would ever
be mapped into physical RAM ever again. On a 64-bit machine with a modern hard
drive, you could probably go forever :)

~~~
espeed
Chronicle Map would be great for that: [http://openhft.net/products/chronicle-
map/](http://openhft.net/products/chronicle-map/)

------
aneeskA
Thank you for the changes. I appreciate very much not seeing "Expired link"
these days :)

------
iaw
Thank you for this. I thought I had noticed less expired links, and figured
something was changed on the back-end, but this was very informative.

------
perlgeek
Shouldn't it be possible to create a very compact way of serisalizing the
closures, and then using that serialization as an URL parameter?

If you can uniquely identify the function, you can know the variables that are
being closed over, so you would "just" have to serialize the function ID +
list of values of the variables, and URL-encode it.

Though I haven't done enough Lisp (or even Arc) to evaluate if that's doable
without too much effort and fragile magics.

(EDIT: clarified a bit)

------
rtpg
This seems like a pretty good example of perfect being the enemy of good.
Would there be that much of a usability conflict if /?page=2 30 minutes ago
was slightly different than /?page=2 now?

Not to mention that if a scale of time that large has passed, there's a high
chance that the closure-style links expire anyways

~~~
EpicEng
Agreed. I hated the "expired link" issue. It constantly sent me back to the
front page. I would much rather see a newer version of page 2 than to be
forced back to the beginning (after a backspace and a refresh).

~~~
couchand
It sounds like you don't mind the tradeoff made here but you'd like a better
failure mode. That's a great point.

~~~
EpicEng
Yes, exactly. I'm fine with an imperfect solution to this problem, but I think
that seeing a page N which was not the page N I would have seen had I clicked
the link 30 minutes ago is far preferable to getting a silly error and being
forced to navigate back to the front page.

------
d23
Can someone explain to a non-functional guy how a closure makes this more
simple, perhaps with a bit of pseudocode?

~~~
TeMPOraL
Disclaimer: I haven't looked at the code, I'm making an educated guess here.

I think it's implemented like this:

    
    
        function renderPage($stories) {
            renderHeaders();
            renderStories($stories);
            renderMoreLink(storeLink(function() {
                        renderPage(next30($stories));
                    }));
        }
    
        function handleLinkClicked($id) {
            $code = fetchLink($id);
            $code();
        }
    

Using storeLink() you save a _function_ , than can be later executed when user
clicked a particular link (remember, HN runs as a single process and does not
restart itself every time like say, PHP does). This function remembers the
context in which it was created - in this case, _$stories_ variable - so all
the data required to fulfil the request at later time gets stored with the
function.

------
duckingtest
Perhaps sending the same data to everyone and letting js filter it be much
simpler and faster on a server end.

------
SiVal
Of course, I may be overlooking something, but this sure seems like a case of
overeducated engineers overengineering a solution that then underserves users
relative to a naive solution.

Individualized closures to keep track of what each user has already seen? Why
not just have a single, current ranking of stories for everybody and let me
pick how many of them I want on each page (up to a point). So, I have a
preference that says I want 100 stories per page. Well then I get 3-1/3 pages
worth of non-repeating stories without all the computer science. If I later
click the "more" button, I get the second hundred, whatever they happen to be
at the time I click "next".

That's easy enough for me to deal with. If it has only been a few minutes
since I loaded the front page, most of the second page will be new to me. I
don't care if a few of them aren't. I'll just skip past them. I'll have plenty
of titles to scan over. But if it has been, say, a day since I loaded the
front page, instead of clicking "more", I'll probably just reload the front
page. I can take care of that myself. No big deal.

With the new API, of course, we can now just build it ourselves, but all this
engineering sophistication that resulted in a fixed, 30-article front page and
the maddening inability to ever get far past the first 30, reminded me of Ted
Nelson's Xanadu Project, which was so cleverly designed to prevent dead links
that it never went anywhere, while the naive Web (dead link?, oh well) changed
the world.

~~~
brudgers
My impression is that Hacker News is largely a case of a single over-educated
engineer hacking up a side project. Closures were duct tape...not in the
"Alabama chrome" sense, but in that they are a stable well understood
technology. Files 'on disk' rather than a RDMS are a similar engineering
simplification.

Sometimes HN chars a bagel. It's a toaster not a microwave oven.

~~~
rthomas6
>Sometimes HN chars a bagel. It's a toaster not a microwave oven.

This is one of the strangest analogies I've ever heard.

~~~
brudgers
HN is a simple mechanism. It fails in a few predictable ways. No raw spots in
the chicken or exploding bowls of oatmeal.

------
pbreit
I am the last person to advocate a rewrite but such a thing seems appropriate
at this point. Has anyone attempted that in rails, Python, node? Considering
the source code is available, should be feasible?

~~~
zachbeane
The code hasn't been released in five years. There have been a bunch of
security fixes since then, and likely many other changes as well.

------
chris_mahan
I just resigned myself to stay on the first page on HN.

------
chrisBob
Is there a good reason to use pastebin for these type of posts instead of just
putting all of the info in the HN post? The only thing I can think of is that
it avoids being grouped into the _ask_ section which I think has some
disadvantages.

~~~
spb
And what about Gists?

~~~
dang
I forgot about those!

------
dang
Url changed from [http://pastebin.com/dETyYtpX](http://pastebin.com/dETyYtpX)
because I added some things.

