It's temporary.

futhey · on July 9, 2022

Interesting. If I had experienced this, I'd probably pick a random cloud provider, recreate the entire service in their cloud, image and document it, and have that documentation ready if I ever experienced another fire drill like the one yesterday.

Far easier to spin up a few large VMs on AWS for a few hours while you fix an issue than provision identical backup dedicated servers in a colo somewhere. And you can potentially just throw money at the issue while you fix the core service.

¯\_(ツ)_/¯

edanm · on July 9, 2022

Maybe you should pin this comment? I almost lost it in the shuffle.

dang · on July 9, 2022

I thought about it, but it's fun to leave rewards for people who read entire threads.

skywal_l · on July 9, 2022

In this kind of thread, I usually do CRTL+F "dang" ...

boulos · on July 9, 2022

I upvoted your top-level comment! But what happens if the comments on this topic get long enough for pagination to kick in? Very few people click the new next button for the linked list :).

andreareina · on July 9, 2022

The "next" link is such a godsend, and a nice little surprise when I discovered it. Love it (and not getting an obnoxious "what's new" dialog box)

porker · on July 9, 2022

TIL what the "next" link does. Thanks!

herpderperator · on July 9, 2022

How will you switch back to the new server once it's ready without losing database records?

brudgers · on July 9, 2022

Based on the “great rebuild” [0] (was it about a decade ago?), my understanding is that the database is text files in the file system hierarchy arranged rather like a btree.

Comment threads and comments each have a unique item number assigned monotonically.

The file system has a directory structure something like:

  |—1000000
  | |-100000
  | |-200000
  | |-…
  | |-900000
  |—2000000
  | |-100000
  | |-200000
  | |-…
  | |-900000
  |-…

I imagine that the comment threads (like this one) while text are actually arc code (or a dialect of it) that is parsed into a continuation for each user to handle things like showdead, collapsed threads and hell bans.

To go further out on a wobbly limb of out of my ass speculation, I suspect all the database credentialing is vanilla Unix user and group permissions because that is the simplest thing that might work and is at least as robust as any in-database credentialing system running on Unix would be.

Though simple direct file system IO is about as robust as reads and writes get since there’s no transaction semantics above the hardware layer, it is also worth considering that lost HN comments and stale reads don’t have a significant business impac

I mean HN being down didn’t result in millions of dollars per hour in lost revenue for YC…if it stayed offline for a month, there might be a significant impact to “goodwill” however.

Anyway, just WAGing.

[0] before the great rebuild I think all the files were just in one big directory and one day there were suddenly an impractical quantity and site performance fell over a cliff.

krapp · on July 9, 2022

You don't have to speculate - the Arc forum code is available at http://arclanguage.org.

dpifke · on July 10, 2022

That's long out of date, unfortunately.

dang · on July 9, 2022

I guess the same way we switched to this one?

I wrote more about data loss at https://news.ycombinator.com/item?id=32030407 in case that's of interest.

herpderperator · on July 9, 2022

You switched to the new one while the old one was down... that's not the same as switching between two live systems. Though, perhaps, in this particular case, the procedure might be.

dang · on July 9, 2022

Ah, I see. If we have to bring HN down for a few minutes, or (more likely) put it into readonly mode for a bit, we can do that–especially if it makes the process simpler and/or less risky.

thegagne · on July 9, 2022

Lol missed this comment while writing mine.

dang · on July 9, 2022

Not a problem in a thread about replicas and redundancy :)

dredmorbius · on July 9, 2022

So there's this great joke about cache coherency.

I'm sure you'll get it, eventually.

thegagne · on July 9, 2022

He can choose to do it the easy way and just take an outage to move it, or possibly go into read only mode while moving.

We tend to over engineer things as if it’s the end of the world to take a 10 minute outage… and end up causing longer ones because of the added complexity.

kijin · on July 9, 2022

Last time I checked, nearly every database worth using on a busy site supported some sort of real-time replication and/or live migration.

If that doesn't work, there's always the backup plan: say the magic words "scheduled maintenance", service $database stop, rsync it over, and bring it back up. The sky will not fall if HN goes down for another couple of hours, especially if it's scheduled ahead. :)

solarengineer · on July 9, 2022

You could even flush the DB to disk, mark a ZFS snapshot, resume writes, and then rsync that snapshot to a remote system ( or use zfs send-receive)

pmoriarty · on July 9, 2022

I'd be curious to know if HN was on ZFS.