
Tell HN: Server Status - kogir
HN went down for nearly all of Monday the 6th. I suspected failing hardware.<p>I configured a new machine that is nearly identical to the old one, but using ZFS instead of UFS. This machine can tolerate the loss of up to two disks. I switched over to it early morning on the 16th, around 1AM PST.<p>Performance wasn&#x27;t great. Timeouts were pretty frequent. I looked into it quickly, couldn&#x27;t see anything obvious, and decided to sleep on it. I switched back to the old server, expecting to call it a night.<p>Then the old server went down. Again. The filesystem was corrupted. Again. So I switched back to the new server. During this switch some data was lost, but hopefully no more than an hour.<p>And here we are. I&#x27;m sorry that performance is poor, but we&#x27;re up. I&#x27;ll work to speed things up as soon as I can, and I&#x27;ll provide a better write-up once things are over. I&#x27;m also really sorry for the data loss, both on the 6th and today.
======
barrkel
By tolerating the loss of two disks, do you mean raidz2 or do you mean 3-way
mirror?

Raidz2 is not fast. In fact, it is slow. Also, it is less reliable than a two
way mirror in most configurations, because recovering from a disk loss
requires reading the entirety of every other disk, whereas recovering from
loss in a mirror requires reading the entirety of one disk. The multiplication
of the probabilities don't work out particularly well as you scale up in disk
count (even taking into account that raidz2 tolerates a disk failure mid-
recovery). And mirroring is much faster, since it can distribute seeks across
multiple disks, something raidz2 cannot do. Raidz2 essentially synchronizes
the spindles on all disks.

Raidz2 is more or less suitable for archival-style storage where you can't
afford the space loss from mirroring. For example, I have an 11 disk raidz2
array in my home NAS, spread across two separate PCIe x8 8-port 6Gbps SAS/SATA
cards, and don't usually see read or write speeds for files[1] exceeding
200MB/sec. The drives individually are capable of over 100MB/sec - in a non-
raidz2 setup, I'd be potentially seeing over 1GB/sec on reads of large
contiguous files.

Personally I'm going to move to multiple 4-disk raid10 vdevs. I can afford the
space loss, and the performance characteristics are much better.

[1] Scrub speeds are higher, but not really relevant to FS performance.

~~~
shiftpgdn
Why not Raid 60 with Btrfs? It'll tolerate two disk loss with pro-active
parity protection via btrfs and be faster and provide you with more disk
space.

~~~
rbanffy
They are running BSD. I don't think BtrFS is an option and ZFS can do pretty
much every trick BtrFS can.

The data corruption on the first machine seems like a hardware problem.

~~~
KaiserPro
ZFS has inbuild raiding and much much friendlier tools.

The BTRFS tools are like MDAM and lilo had a child.

------
makmanalp
The trend I'm noticing is people mentioning that if only HN was moved to
<insert-cloud-provider>, problems would go away.

Instead of doing that, they probably dropped a bit more than a thousand
dollars on a box, and are probably saving thousands in costs per year. This is
money coming out of someone's pocket.

This site is here, and it's a charity, being provided free of cost, to you.
Who cares if HN is down for a few hours? Seriously? Has anyone been hurt
because of this, yet?

~~~
catinsocks
HN is not a charity, it is a marketing platform for YC with some community
aspects.

There is a very strong bias to everything YC.

The HN community has also outgrown the software HN was built on you can see
this in threads like:
[https://news.ycombinator.com/item?id=7051091](https://news.ycombinator.com/item?id=7051091)

but even that thread is an extreme many front page items that gain traction
are hard to go through because of things like lack of foldable comments. Other
things that are extremely noticeable are expiring links which pg has said he
doesn't think are important enough to fix. There are many small UI issues that
won't be fixed for the community.

~~~
Harj
_HN is not a charity, it is a marketing platform for YC with some community
aspects._

Feels more like a community with some YC marketing aspects to me.

~~~
OafTobark
I disagree. It feels very YC driven.

------
cincinnatus
I'm sure it has been asked many times before, but I'd love to hear the latest
thinking... Why in 2013 is HN still running on bespoke hardware and software?
If a startup came to you with this sort of legacy thinking you'd laugh them
out of the room.

~~~
matthewmacleod
_this sort of legacy thinking_

That's the kind of facile statement that makes people riotously mock the
entire startup community, like "MongoDB is webscale" but even less valid.

Cloud services are not a panacea, and there are myriad situations in which
running one's own infrastructure can be a good idea. What matters is that the
issues and benefits are taken into account; if one can show research
demonstrating that a custom infrastructure is cheaper, or more reliable, or
less prone to legal issues, for example, then there's nothing to laugh at.

And remember that PaaS in particular can cost a buttload of money - I'm
certain it's contributed to the downfall of more than one otherwise promising
startup.

~~~
eloff
The advantage of cloud servers is if one experiences corruption or goes down
you just kill it and start a new one. If your cloudy EBS equivalent
experiences corruption, you restore from snapshot and off you go again. Either
way it involves less downtime than HN seems to have. The downside is it costs
more (usually, depends on how high your server management and data center
overheads are.) I'd like to point out that despite several high profile down
time incidents in the aws, never was there a case that I'm aware of where you
couldn't just restore from your last snapshot to another availability zone or
region.

~~~
shiftpgdn
You realize cloud services are vulnerable to data loss as well?[1] The cloud
isn't some magic machine off in a datacenter somewhere. Its a bunch of servers
and SANs just like what you or I would roll out if we needed bare metal
infrastructure. The only difference is the extraordinary markup that you're
paying amazon to use their servers.

[1][http://blogs.computerworld.com/18198/oops_amazon_web_service...](http://blogs.computerworld.com/18198/oops_amazon_web_services_ec2_cloud_lost_data)

~~~
eloff
Yes, obviously. That's why I said you should restore from backups in that
event. If you lose your backups on S3, congratulations, you had better odds of
winning in your state lottery. The big difference is not just the price as you
say, but the flexibility.

------
whalesalad
There's a lot of tuning that can be done on a ZFS setup to improve
performance. I'm not a pro, so others will have more feedback and knowledge,
but some things off the top of my head to get you started:

Add a flash memory based (SSD) ZIL or L2ARC or both to the box. That'll help
improve read/write performance. I believe the ZIL (ZFS intent log) is used to
cache during writes, and the L2ARC is used during reads.

You might want to look into disabling atime, so that the pool isn't wasting
energy keeping access times on files up to date. Not sure if this is relevant
with the architecture of HN or not. This can be done with

    
    
        zfs set atime=off srv/ycombinator
    

Finally, ZFS needs a LOT of memory to be a happy camper. Like 3-5GB of RAM per
TB of storage.

I actually think you'll probably have a lot of fun with ZFS tuning, if that's
the problem with news.yc. FreeBSD's page is pretty detailed:
[https://wiki.freebsd.org/ZFSTuningGuide](https://wiki.freebsd.org/ZFSTuningGuide)

~~~
stock_toaster
> I believe the ZIL (ZFS intent log) is used to cache during writes, and the
> L2ARC is used during reads.

I think the ZIL (zfs intent log) is an intermediary for synchronous writes
only. My understanding is that it effectively turns the sync write into an
async write (from the standpoint of the zpool) -- this is why it requires a
faster device than the pool it is used with. If it is absent, the pool itself
houses the zil.

------
hartator
Not really related but any update on releasing the HN code again?

[the current release is pretty old:
[https://github.com/wting/hackernews](https://github.com/wting/hackernews)]

------
JayNeely
Being the sysadmin on a site frequented by sysadmins has to be frustrating at
times.

Thanks for all you do!

------
erkkie
This reminds me I'm still looking for a (pki?-)encrypted zfs snapshots as a
backup service, /wink-wink @anyone

Hoping the box has ECC ram, otherwise zfs, too, can be unreliable
([http://research.cs.wisc.edu/adsl/Publications/zfs-
corruption...](http://research.cs.wisc.edu/adsl/Publications/zfs-corruption-
fast10.pdf))

~~~
lucb1e
Tarsnap?

~~~
erkkie
could possibly work if wrapped around zfs send, something plug-n-play would be
nice though

------
shawn-butler
Using DTrace to profile zfs:

[http://dtrace.org/blogs/brendan/files/2011/02/DTrace_Chapter...](http://dtrace.org/blogs/brendan/files/2011/02/DTrace_Chapter_5_File_Systems.pdf)

I'm sure other more experienced DTrace users can offer tips but I remember
reading this book and learning a lot. And I believe all the referenced scripts
were open source and available.

------
ishener
may i ask where are the machines hosted? is that on AWS? if not, why don't you
move to a more reliable hosting, like AWS?

~~~
ishener
from the downvotes i gather i may not ask...

~~~
oneeyedpigeon
Probably due to the fact that you levelled an unqualified, unsubtantiated
claim that AWS is better than any alternative

~~~
viraptor
That's not what he said. There's nothing about AWS being better than
alternatives to AWS.

> "a more reliable hosting, like AWS"

------
Goladus
I've been reading this site regularly for almost 7 years. 6-Jan-2014 is the
only downtime I remember, and it was really a very minor inconvenience. Sucks
about the data loss though, always hard to own that when doing system
administration. Thanks for the explanation.

~~~
elwell
It's not the only one I remember. It seems like it's down for several hours
every couple months.

------
conorh
Have you thought about perhaps open sourcing the server setup scripts for HN?
I'd love (and I'm sure many others here) to help with the configuration.
Perhaps a github repo for some chef recipies that people could work on given
the current servers?

------
nmc
Thanks for the info!

Out of curiosity, do you have an idea about the source of the corruption
problems?

~~~
hartator
The OP is saying "the loss of up to two disks", maybe a hard drive failure?

~~~
Kudos
That's on the new server.

------
rrpadhy
I am curious to know the server configuration, architecture and the number of
hits it is getting.

If someone does offer a new software architecture, and hosting, would people
be open to move hackernews there?

------
avifreedman
Assuming the disk footprint is small...

Would recommend a new SSD-based ZFS box (Samsung 840 Pros have been great even
for pretty write-intesive load), with raidz3 for protection and zfs send
(and/or rsync from hourly/N-minute snapshot for data protection which should
eliminate copying FS metadata corruption, as not sure if zfs send will).

Happy to provide and/or host such a box or two if helpful.

------
richardw
Thanks for the update. No worries, it's just a news message board and no
businesses are hurt when it's down. I quite enjoy seeing how these things are
solved and I'm sure all will be forgiven if you post a meaty post-mortem.

------
rincebrain
ZFS instead of UFS on what, an Illumos derivative, FBSD, or actual Oracle
Solaris?

~~~
pmarin
FreeBSD.

~~~
lewq
I'd love to show you how HybridCluster, which automates ZFS replication and
failover (FreeBSD + ZFS + jails), might be able to help. Relatedly, we've just
announced free non-commercial licences which would be perfect for HN:
[http://www.hybridcluster.com/blog/containers-distributed-
sto...](http://www.hybridcluster.com/blog/containers-distributed-storage-
future-now-free-hybridcluster-non-commercial-licenses/)

------
lsc
Are you bottlenecking on high iowait? or something else?

just one random bit to try... Obviously, I have no insight into your system
and I'm not saying I know more than you or anything, but I've been seeing more
situations lately where I had massive latency but reasonable throughput and
the disks mostly looked okay wrt. smart, and I mostly just wanted to write
about it:

[lsc@mcgrigor ~]$ sudo iostat -x /dev/sda /dev/sdb /dev/sdc /dev/sdd Linux
2.6.18-371.3.1.el5xen (mcgrigor.prgmr.com) 01/16/2014

avg-cpu: %user %nice %system %iowait %steal %idle

    
    
               0.00    0.00    0.05    0.02    0.00   99.93
    

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm
%util

sda 0.70 75.11 35.66 1.38 4568.62 611.67 139.85 0.36 9.61 0.53 1.95

sdb 0.46 75.10 35.62 1.39 4566.77 611.67 139.89 0.22 5.89 0.45 1.66

sdc 0.80 75.14 35.63 1.35 4569.63 611.63 140.10 0.64 17.18 0.57 2.10

sdd 0.46 75.09 35.62 1.40 4566.60 611.63 139.87 0.13 3.47 0.40 1.49

(this is a new server built out of older disks that appears to have the
problem. It's not so bad that I get significant iowait when idle, but if you
try to do anything, you are in a world of hurt.)

Check out the await value. re-do the same command with a '1' after /dev/sdd
and it will repeat every second. If sdd consistently has a much worse await,
it is what is killing your RAID. Drop the drive from the raid. If performance
is better, replace the drive. If performance is worse (and with raid z2, it
should be worse if you killed the drive) the drive was fine.

(Of course you want to do the usual check with smart and the like before this)

The interesting part of this failure mode that I have seen is that
/throughput/ isn't that much worse than healthy. You get reasonable speeds on
your dd tests. but latency makes the whole thing unusable.

------
lukasm
How about error page show the last static HN page? Most people just need likns

~~~
codfrantic
I used [http://www.hckrnews.com/](http://www.hckrnews.com/) for that :-) But I
actually usually prefer reading the comments more than the links ^_^

------
scurvy
Why on earth are you not using SSD's? The HN footprint can't be _that_ large.
The extra speed and reliability from a pair of SSD's has to far outweigh the
costs.

~~~
ivoras
I'd guesstimate that the READ load is served practically entirely from RAM
(file cache) and the WRITE load is non-critical enough that it's done
"eventually consistent" (e.g. synchronous_commit=off in PostgreSQL, or
fsync=off elsewhere) - or at least that's how I'd run it. YMMV.

------
jffry
Thanks for the writeup.

------
carsonreinke
Maybe you could provide details on the current configuration and architecture
and some suggestions could be made on how to improve. Just a thought.

------
rdl
I still like hardware RAID because it's conceptually simple and nicely
isolated. Sometimes horrible things happen to it, though, too.

I didn't realize HN had enough disk storage needs to need more than one drive.
I guess you could have 1+2 redundancy or something.

------
0xdeadbeefbabe
Don't worry about it. I visited facebook for the first time in years when hn
went down. Is hn on linux using zfs or bsd?

------
smalu
The world would be better place if software could exist without hardware.

~~~
ama729
It already does, just take your pen and some paper and you're set.

Oh, and don't forget the aspirin, you'll need it...

------
superice
Good you posted this, but it came a little late. After the first series of
timeouts you could've posted an update so everybody knew what was going on.
But hey, thanks for the update, this clears up a lot.

~~~
whalesalad
This passive-aggressive attitude is going to get you nowhere in life. Back
handed complements are only cool in pre-teen TV shows.

------
waxzce
Hi, I'm the CEO of [http://www.clever-cloud.com/](http://www.clever-
cloud.com/) and I'll be happy to help you on this, ping me on twitter :
@waxzce

~~~
lucb1e
Can you prove that you are the CEO? I'm doubting between this being a bad
prank from someone impersonating you, and you actually placing advertisements
for your own company here.

~~~
Gmo
I know he is, he has posted on HN several times before.

