Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: The site was offline. What changed?
229 points by kogir on June 10, 2014 | hide | past | web | favorite | 100 comments
Obviously things took longer than intended and expected, but in the end:

Items moved from /12345 to /12/34/12345. HN now starts in one fifth the time, and better utilizes the filesystem cache. Backup speeds are also improved.

Profiles moved from /kogir to /ko/gi/kogir, and from /KogIr to /%k/og/%kog%ir, which works on case-insensitive filesystems. Similar performance improvements were observed.

Passwords moved from a 45MB user->hash mapping file into user profiles themselves. Previously, this mapping file was re-written in its entirety every time a new account was created, and is why the site went down on May 18th. New account creation is now incredibly lightweight, and should allow us to further limit and possibly eliminate our use of captchas.

There were additional changes to support new features currently in the pipeline, but which we're not yet ready to announce.

I'm sorry that we were offline for so long. Nothing else we currently have planned should require anything more than a simple restart, so with any luck this will be the last major disruption the quarter, and maybe even this year.

I know it's crazy talk, but glancing at my own profile, I count maybe 100 bytes of data? Yet to represent that data in memory, it's going to blow up to 4096 bytes plus structs to represent the inode and directory entry/entries because you put each profile in its own file.

By that count, you might get somewhere near a 40x cache utilization improvement if you just used a real database like the rest of us do - even just an embedded database.

This of course before saying anything about transactional safety of writing directly to the filesystem

We're on the same page. First you stay up, then you improve with the time you bought.

Sometimes it isn't worth the effort to fix the old, and instead just go to the new and improved.

When the load balancer for reddit broke once, we did't bother fixing it, we just replaced it with better (though untested) technology on the assumption it would work better. We figured it couldn't be any worse than it was, and we'd rather spend our limited time moving forward instead of treading water.

We considered this pretty seriously, and it might be required some day, but we think we'll be able to incrementally move toward a more highly available, better performing architecture without a continuity break.

Why not just port to the Reddit codebase? The functionality seems similar.

The HN code base is as much an experiment in Arc as it is a social news site. Also the feature set is surprisingly different (although I wish some of the features were here like comment collapsing and async comment submission).

Maybe it would be too much work to rewrite the specific logic that prevents HN 'manipulation' for reddit? Although I suppose much of the behavior and target audience is similar...

reddit.com/r/hackernews the new hacker news.

You might be joking, but in case not: reddit's code base is open source, so others can use it without moving the community under the reddit umbrella.

When we upgraded LBs we didn't have a continuity break. We just flipped the IP when it was ready.

For most DB upgrades we did dual writing so we didn't have to have a break.

This of course before saying anything about transactional safety of writing directly to the filesystem

You do realize that rename(2), open(2) with O_CREAT | O_EXCL, mkdir(2) and still other POSIX filesystem operations are fully atomic, right?


See followup comment below. You're confusing logical atomicity for physical consistency and durability: yes, these operations may have certain atomicity guarantees from perspective of the application, but they are entirely asynchronous from the perspective of the storage medium unless you explicitly fsync(), and for example on Linux, even then the default behaviour of ext4 is to allow metadata updates to complete prior to data updates (no "write barrier").

In other words:

1. fd = open("super-safe-file.tmp", O_CREAT|O_RDWR);

2. write(fd, "super-safe-data", 15);

3. close(fd);

4. rename("super-safe-file.tmp", "super-safe-file");

5. (kernel flushes file and directory metadata to disk)


7. Machine reboots, "super-safe-data" exists, but no longer contains any data, since file data itself was never flushed.

8. Tears are shed, programmers are fired, backups are restored

I understand the problem scenario with ext3/ext4 journalling you're referring to here and below.

However, HN runs on FreeBSD, and my understanding is that the combination of soft-updates + journalling there actually do provide atomic rename, even in the case of catastrophic failure. McKusick talks about it here: http://www.mckusick.com/softdep/suj.pdf

Also, just to anchor the discussion a bit, the HN code does use the "write foo.tmp; mv foo.tmp foo" trick all over the place. (Or at least, the most recent version of news.arc I've seen does.)


You said POSIX, which makes no such guarantee.. soft updates are cool, though as far as I know they still don't provide durability. Still, that's far better than the default Linux behaviour

And this is one of many reasons why you should use ZFS for your data. ZFS guarantees the atomicity of renames and would not have this problem. On Solaris and FreeBSD at least. I don't know about ZFS on Linux.

Using less virtual memory to store user data doesn't imply better cache use, only a smaller cache size. The tradeoff is memory over CPU. With a database you're using more CPU. It's a fair bet to say that your memory capacity will increase at a greater rate than your CPU, not to mention costing less to power and/or cool, and a simpler software architecture to support. Wasting memory is a simple hack to increase performance and decrease complexity.

> This of course before saying anything about transactional safety of writing directly to the filesystem

1) transactions aren't made or broken by what or when they're written, they're made or broken by being verified after being written, and 2) this is a user forum for people to comment on news stories, not an e-commerce site. Worst case the filesystem's journal gets replayed and you lose some pithy comments.

> With a database you're using more CPU

That's incorrect, you're potentialy trading several system calls (open, read, close) and their associated copies, which have high fixed costs for, with the right database, no system calls at all. I've spent most of the past year working with LMDB, and can decisively say that filesystems can in no way be competitive with an embedded database, by virtue of the UNIX filesystem interface.

> this is a user forum for people to comment on news stories, not an e-commerce site

That much is true, though based on what we've learned in the parent post, until today all passwords on the site were stored in one file. Many popular filesystems on Linux exhibit surprising results rewriting files, unless you're incredibly careful with fsync and suchlike. For example, http://lwn.net/Articles/322823/ is a famous case where the many-decades traditional approach of writing out "foo.tmp" before renaming to "foo" could result in complete data loss should an outage occur at just the right moment.

So you're saying LMDB looking up a user-specific record and returning it will always be faster than either an lseek() and read() on a cached mmapped file [old model] or an open(), read(), close() on a cached file [new model] ? Is the Linux VFS that slow?

In terms of transaction guarantees, I thought the commenter was talking about the newer model where each profile is an independent (and tiny) file; if that's the case, then deleting and renaming files wouldn't be necessary, and any failures in writing could be rolled back in the journal rather than be a file that's now non-existent or renamed. From what I understand, the most the ext4 issue would affect this newer model would be to revert newly-created profile files, which again I think would be a minor setback for this forum.

Yes, absolutely LMDB will always be faster, because LMDB can return a record with zero system calls.

Can't make the same guarantee about other DB engines. Take a look: http://symas.com/mdb/inmem/

> …with the right database, no system calls at all.

How does that work? Doesn't the database talk to the filesystem? Aren't there a bunch of syscalls going on there?

Serious database can use raw partitions with no filesystem for storage. Even when storing data on a filesystem a database is unlikely to be using a single file for each entry; the database might make one mmap system call when it starts, and none thereafter (simplified example). The point is that the database can do O(1) system calls for n queries, whereas using the filesystem with a separate file for each entry you're going to need at O(n) system calls.

You could of course avoid this problem by using a single large file, but that has its own problems (aforementioned possibility of corruption). Working around those problems probably amounts to embedding a database in your application.

In the read-only case, pretty much any embedded DB with a large userspace cache configured won't read data back in redundantly.

In the specific case of LMDB, this is further extended since read transactions are managed entirely in shared memory (no system calls or locks required), and the cache just happens to be the OS page cache.

Per a post a few weeks back, the complete size of the HN dataset is well under 10GB, it comfortably fits in RAM.

I guess even most persistent embedded databases are on top of a good auld FS... SCNR...

Can someone explain why "Items moved from /12345 to /12/34/12345. HN now starts in one fifth the time" that increases performance? why is it better?

Large directories can be very slow to scan (readdir). By fanning out the items into more directories of fewer items, performance is improved. Git does something similar with its objects directory. This is an old Unix trick.

Note that Linux ext3/ext4 have a dir_index option which improves readdir performance.

Edit: it's not readdir performance so much as just name lookups which are improved by this technique (and dir_index): http://ext2.sourceforge.net/2005-ols/paper-html/node3.html

My experience was the opposite. In code I wrote

- I never needed to scan the directory to find "all users" (after all, if you have millions of users, this is going to take a while whatever directory structure you use)

- Modern filesystems either use a tree or hash structure to identify files, meaning that lookup by name, and creating/deleting files, is quick, even if you have millions of files.

- Given no performance benefits are to be had by directory nesting, I always went with the option of simplicity, i.e. having everything in one directory.

(I blogged about this here: http://www.databasesandlife.com/flat-directories/)

But no doubt the HN developers had a reason for doing this change, I'd love to know what it is (e.g. if they need to do something I never needed to do, or if they need to do the same things but I was wrong.)

I read your blog entry. Your experience was with tru64 and you also mention zfs. These and other file systems may indeed use data structures to make filename lookup performant.

But traditionally, ufs and ext2/3/4 (without dir_index) have to perform a linear scan through a linked list for lookup, and so they do indeed grow slower with number of files. This is likely where the fanout strategy originated from.

So as usual, YMMV and you should test on your file system of choice.

Personally, I don't really consider that fanout adds much complexity and I'd be surprised if it hurt performance.

edit: HN runs on FreeBSD. Not sure if they are using zfs or ufs, but I'm going to guess ufs. UFS apparently has a dirhash which improves directory lookups, but it's an in-memory structure so it won't help in the cold-cache case after reboot and it can be purged in low memory situations too.


edit2: I wonder whether the HN admins ever tried tuning the dirhash settings? http://lists.freebsd.org/pipermail/freebsd-stable/2013-Augus...

We used to run UFS and we tuned both the kernel itself and the dirhash settings. Now we run ZFS.

The site loads 5 times faster. So clearly there are performance benefits to directory nesting.

Sure, but that's the least helpful possible response you could have made. We've got an observation:

> I never needed to scan the directory to find "all users"

> lookup by name, and creating/deleting files, is quick, even if you have millions of files.

And a question: given these observations, where do the benefits of filesystem fanout come from? Is it not true that looking up a file by name is fast no matter how many other files sit in the same directory? Is HN doing something weird?

You can't answer the question "where do the performance benefits come from?" by saying "look, the performance benefits exist".

> You can't answer the question "where do the performance benefits come from?" by saying "look, the performance benefits exist".

I think he is trying to say is that the parent poster's observations must be wrong. After all, we are talking about an unsubstantiated claim ("there's no benefit to fanning out files") that directly contradicts another claim which we have data for ("HN is 5x faster after fanning out files").

Again, when someone asks why they're wrong, it's not useful to tell them "but you're wrong". Parent poster already acknowledged that the combination of his ideas and the facts on the ground didn't make sense. What good does it do anyone to repeat it back to him?

I guess when you're absolutely sure that you're right, but the observation proves you wrong, you have to be prepared to consider the possibility that you're wrong.

The comment I was replying to was saying that the file system takes care of it automatically, so there's no purpose to arranging millions of files into directories. I'm not going to speculate how it all works under the hood.

But this wasn't all they changed, e.g. password storage is now different.

Large number of files in windows-based directories was killer. I can not remember the full details as it was years ago but once you went above 10,000 files or 20,000 files performance just died in Windows. It was because a bunch of the main API calls for accessing files in directories were inefficient I believe.

I agree on Windows (esp when using a GUI), but HN does not host on Windows.

An aside: When storing large quantities of things like this my personal preference is to split the id from the end.

I was once i charge of a large number of images of book jackets named by the books' isbn. At least in that population (which of course is an extreme example considering how an isbn is created) the distribution is much more even (that is, the directory sizes are relatively equal) when using the end than the beginning, but I would not be surprised if that is a normal outcome.

Maybe it's an application of Benford's law [1].

[1] http://en.wikipedia.org/wiki/Benford%27s_law

Directory scans suck. So if you break up the space into sets of prefixes you limit the number of files in each dir and traversal gets much faster. Ditto adding/removing items.

Some filesystems are worse at this than others (xfs... let's not go there).

Wasn't the big miracle of Reiser4 supposed to usher in an era where this was no longer a problem?

Reiser3 is enough to avoid this issue; I still use it when I have to run linux.

I think Hans killed that era.


Would the HN software need to scan the directory e.g. to read in all users, at any point? I don't know the source code of HN but I can't see why that would be necessary.

(And if it did need to, presumably it's now need to recursively scan all sub-directories, which would also take a while?)

No, you need to scan the directory every time you read a file. So most filesystems do a lot of work to optimize this but it is still a significant factor.

Not sure I agree with that. I ran the software for community with 6M users, and on 2003 hardware we had millions of files in one directory. That was with advfs on tru64 so things might be different with other file systems. But e.g. zfs can do this no problem as well. I just sort of assumed other FSs must have caught up in the intervening 10+ years but I haven't looked into their source code so an prepared to admit I might be wrong about them.

The same way a database fetch doesn't load the whole table, filesystems can and do use trees and hashes to organize directories so that file lookup, creation and deletion by name can be fast and can be concurrent.

I posted this in another comment, but this was my understanding of the situation in 2010. http://www.databasesandlife.com/flat-directories/

There must be a reason why they did this change, either I am wrong about performance (perhaps my results really were particular to those filesystems) or perhaps I am right and they made the change for another reason. I'd like to learn the answer.

I don't think that fact that advfs on tru64 did well is any evidence that other filesystems are not dealing with this poorly. I'm running an XFS filesystem right now that still totally sucks at this particular aspect but I'm loathe to move all the data off the machine, rebuild it all and then to move it back.

For one it would need a vast amount of temp space, the site would be down while doing it and the end result would be much the same as what it is today (I rarely modify the filesystem).

You'd want to do so to check for dup usernames, for instance.

Someone will explain this better but, as I understand it, having a huge number of files in a single directory becomes inefficient at a certain number; organising those files hierarchically brings an improvement. If HN really stores each item as a file in one big directory, that directory used to contain almost 8 million files ...

Not only inefficient; some linux commands fail when they're used on more than a few million files at a time - there is a maximum number of arguments that they can handle.

EDIT: bzbarsky's explanation below is more accurate.

Typically the issue is not the commands themselves but the shell. Trying to do "something *" on the shell command line will expand out the glob, and if the resulting string is too long (e.g. you fail to malloc() it!) the shell will do something ranging from crashing to not running the command and giving a useful error message.

Not claiming to be an expert here but I typically do this when confronted with a large number of files.

Instead of: command *

for i in [someregex]*


command $i


I know I could also do command [someregex]* but like the comfort of having each item echo back to the terminal so I know the progress.

That still relies on the shell expanding a glob of millions of files. Another method is to use 'find' and 'xargs' to avoid specifying the files as arguments explicitly.

This is a common technique to limit the number of files in any single directory when you store your date in many many files on disk. You will also see this in things like web cache software and the like.

Thats what it is. The specific "why" is there aren't many O(1) directory traverses.

So a million files in a dir is going to take longer to access any individual file than if there's only 3 files to pick from.

And if it scales worse than linear, a tree structure, although hitting the FS multiple times once at each level, in total can take less time.

Finally if you can avoid a smooth distribution hash and intentionally order by something important (time?) then you only need a cache the most recent directories in memory and the deep historical archive can fend for itself rarely accessed without getting in the way of the busy files. If you rarely if ever leave /stuff/thisYear/today/ then whatever is in /stuff/2011/dec25 will never slow today down or get in the way.

If you look up "directory hashing" (possibly also "directory sharding"), that'll explain things. Particularly relevant to things like email servers and shared web hosts.

Wow that's interesting that you use files to store the data. Is there any sharding across machines or is it all just one machine? Do you use big SSDs or old spinning disks?

It's a single machine.

Newbie question here - I'm just curious. Is it quicker to store the data this way in loads of files or to use Postgres?

For new development I'd recommend PostgreSQL over flat files for most projects.

Really depends on what you're trying to store though. For large data (images, audio) that can't fit in a table row, the filesystem is way better.

In our case we started with flat files, and buying breathing room is the first step to move past them.

Curiosity: Why did you start with flat files? It looks like hackernews was started in 2007, relational databases had been around for quite some time at that point, and were the standard way to store such forum data (see: every popular forum framework at the time)... the decision to store this sort of data (news/link forum with comments, 100% text) as flat files is very confusing to me.

  My guess is that Arc - the lisp language running HN created by Paul Graham - was new, and coding and maintaining a database driver was out of question. 

 Today, perhaps the way to go would be to use some sort of json webservice interface to a database written in another language rather than writing a driver.

That would be my guess as well. It's one thing to decide you want a simple forum and have it coded within a couple of days. It's entirely another to spend months creating a stable database library and keep it upto date with all the latest changes.

Or you could, you know, use a more popular language.

I don't know why he did it, but some possible reasons:

If you already understand your OS well, filesystems have simple, known, and reliable performance characteristics. Databases involve a lot more code, and are harder to reason about.

If you're starting something, it pays to start simple. How many 2007 projects made it to today with this much traffic? A very tiny fraction.

If you're keeping a lot of data hot in RAM and working with it directly (which I hazily understand is HN's approach), then databases don't buy you much. Typical database usage is to use a database not just as a persistence engine, but a calculation engine, a locking engine, a cross-machine coordination engine, and other stuff as well. If all you need is persistence, then that isn't very hard to do yourself.

For things you intend to build and maintain yourself, "standard way" may not buy you anything. Graham already had toolbox he knew perfectly well. He didn't have a lot of incentive to learn somebody else's way.

See also the Viaweb FAQ -- http://www.paulgraham.com/vwfaq.html

pg calls out lisp and flat files as unconventional choices that worked well enough.

There are too many variables to give a straight answer. Writing to a bunch of different files at the same time? Postgres is probably faster since all that disk IO contention gets turned into a write to the WAL. Don't have the resources to give postgres? Files maybe faster than watching pg choke on a lack of RAM or CPU time.

I just don't get one thing.. Don't you have dev and staging server, where you could develop, test, and pre-deploy everything without shutting down the website?

Online data migrations are complicated and error prone. We deemed it not worth it.

All the code was tested before going live, but at some point you actually have to move and re-format all 8.5 million files.

Why worry about case-insensitive file systems if you are not using one currently?

Because running out of the box on OS X makes testing way easier. Currently we use DMGs, and performance is terrible.

I know a lot of folks develop on OS X, but this just seems like another argument to develop on Linux or a BSD.

But then, I haven't owned a non-Linux box in 16 years, so I may be an outlier...

It isn't an argument. DMGs are fine for development and testing and they can be made case-insensitive. Or you could just, you know, create a separate partition.

If the DMG speed is too slow for development and running tests(not testing with the full live dataset) you are doing something wrong.

So we shouldn't test using a snapshot of live data? Seems prone to finding errors only on production.

When developing and testing new features/bugfixes? Unless the bug is directly tied to production data I have no idea why you would have to use production data for that. I'm not saying don't do it on a staging server, but you don't develop on the staging server.

Right now I have a vagrant box where the VM images is on a DMG and all data is NFS mounted from the same DMG onto the VM, which is kind of the worst scenario I can think of. The testing database is around 2GB and the source+data files etc is ~200MB, just because I actually do need to fix a bug related to a portion in the production data. What's slow is the CPU, and that is still doing fine. It's not the disc even though I'm abusing it this way. That's on a 2011 macbook, 16GB, 400GB SSD.

HN is a small piece of software which should be easy to write tests for, 2GB db and 200MB source/data-files should be more than you'd ever need to work on something like HN. If you want to stress-test, test fs speeds etc you cannot do that on Mac OS X anyways since you're not running Mac OS X in production.

Changing the specs of a piece of software in order to make development more easy seems totally backwards. You're developing for production, not the other way around.

And at last, why not simply add a case-sensitive partition to your Mac if speed is such a big problem?

Any reason why you're not using Case Sensitive HFS+? I realize most people aren't so this is a nifty change but that seems nicer than using DMGs.

The only problem I ever ran into with it was needing to rename the internal file structure of Adobe apps since they were not particularly careful about matching file name case in code and configuration files. There were scripts to fix their apps. I hope the newer CS editions don't have that problem still...

Steam is also completely broken on case sensitive filesystems. Currently I run a case sensitive partition and store my source and /usr/local directories there. That way all my development is sane, and the rest of the OS can be stupidly insensitive all it wants.

> Steam is also completely broken on case sensitive filesystems.

Except, y'know, Steam for Linux.

We're talking about Macs and, last time I checked, Steam won't even install on a case sensitive filesystem on the Mac (or Windows). They have to on Linux—there's no other choice.

Suppose they want to change platforms in the future or release the codebase for others to use. It was apparently an easy enough change to implement now, so why not?

Pre mature optimization.

This is the exact opposite of pre-mature optimization. They have a mature product that works. We're using it right now. And now they're future proofing it a bit.

And the question for this answer: https://news.ycombinator.com/item?id=7872121

> and is why the site went down on June 18th.

Do you mean the site will go down on June 18th?

YC's time machine startup will be in stealth mode.

I love how hn doesn't load 150 circus elements. I can reload pages all day long.

I was just thinking.....

  function byId(id) {
     return document.getElementById(id);
    // hide arrows
    byId('up_'   + item).style.visibility = 'hidden';
    byId('down_' + item).style.visibility = 'hidden';
could be :

  function hide(id) {
     document.getElementById(id).style.visibility = 'hidden';
     // hide arrows
     hide('up_'   + item);
     hide('down_' + item);
Then it is 150 in stead of 185 chars, 19% smaller or 23% bigger.

And the function name is better of course.

Thanks, I got so much work done today!

And, HN becomes responsive. ✧\ ٩( 'ω' )و //✧

Responsive as in "responsive design"? Hardly. The only thing that uses media queries is the vote arrow.

Before responsive design came around, the word "responsive" actually meant a server would respond quickly to actions.

Words change with time. When you say 'responsive' in a web setting it's nearly always in regards to responsive web design - not 'this page is responsive now, returning results quickly!'.

bshimmin shouldn't have been downvoted.

No, responsive as in "responsive". (reacting quickly).


However, the parent should NOT be downvoted, people.

Despite total "ignorance" ... of a 2010 article and its aftermath. If you can call that ignorance. [1]

[1] http://en.wikipedia.org/wiki/Responsive_web_design*

To the parent: the "Responsive Web Design" movement picked a really stupid name, but it's legitimate terminology and I guess the rest of us - and I say this with biterness - have to find some other word for talking about a server that is responsive, if we want to avoid confusion.

For the moment it's legitimately confusing on the part of everyone who uses the word responsive the way the parent does - basically, they took one of the clearest and most positive adjectives you can say about a site, with no downsides for the user whatsoever, and misappropriated it to refer to something completely orthogonal and unrelated (in that it has nothing to do with render speed or anything else that the word meant prior to 2010.)


I suppose "responsive design" was so called because the page layout, and elements within it, "respond" to changes in the device's screensize, rather than remaining static (sites with simple floated layouts and no fixed widths, were always, to an extent, "responsive", but no one used that term before 2010); "design" obviously because it is a conscious decision on the part of the designer to create different layouts for different breakpoints. I agree that it's a stupid name, though perhaps not orthogonal to its intended meaning.

If a client says to me "I would like the site to be responsive" now, in 2014, I would consider what I know of the client before assuming which they mean of "reacts quickly" and "has media queries" - and I would always seek to clarify regardless.

(I don't care in the slightest about the down-votes, but thank you for your concern!)

Just curious: why are user names case sensitive?

There are 114 accounts that share the same lowercase representation.

Pardon if I'm ignorant here, but is there a blocker to open-sourcing HN? I'm sure the community would love to help.

It's definitely snappier and now looks responsive too! Thanks for the hard work kogir et al!

I bet some before/after load charts would be pretty impressive!

Can we classify this as "archeology"?

That's exactly how i would have changed things to make the website faster!!! -4 jan 1991-

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact