Hacker News new | past | comments | ask | show | jobs | submit login
Rockstor, a Linux and BTRFS Based NAS Solution (rockstor.com)
71 points by schakrava on Sept 26, 2014 | hide | past | web | favorite | 49 comments

The only data protection options I could find were Raid 1, and 10. (raid 0 is a performance option) and as data loss on attempting to re-silver a 3TB mirror is 1 in 5, data protection here is not enterprise quality yet).

The UI stuff is great, but the tricky bit about building a storage system is not provisioning it, or getting the access protocols right, it is all about finding all the ways that data can be destroyed (both silently and noisily) and guarding against them. So if you want to stick with the Enterprise target, then you need something like the ZFS On Linux page which describes every way you can get data zapped and how you will prevent that from happening.

If you want to be just an off the shelf "hey here's something that will make your access point into something like a NAS device." then you get to lose data when a disk goes bad, or a memory chip goes bad, or a network cable is loose, or the powersupply cuts out, or the cat knocks it off the table etc.

Thank you, I couldn't agree more. It's a tall order and we have work cut out.

I've started to do some DR testing myself, but it will take a little while to publish our findings and recommendations.

Where did you get your hilarious "data loss on attempting to re-silver a 3TB mirror is 1 in 5" statistic from?

The non-recoverable bit error rate spec.

NetApp tracks it with their Nearstore product line which used SATA drives in a NAS box (they have been for a while actually, when I left they had data on about 65 million drive hours) and while Seagate quotes it a 1x10^15 bits but its actually closer to 5 in 10^15 bits. A 3TB drive has 3x10^13 bits of data (closer to 3x10^14 when you account for track markers and error recovery bits).

If you're bored some time try reading every sector from one of these drives. To maximize your chance of success make sure you operate the drive at a slightly warm temperature (keeps the lubricant from sticking) and isolate it from vibration. Its worse if you read it randomly (you will get some arm servo movement just because the drive will have replaced some blocks from spares, but minimizing it also keeps vibrations down.)

Long before it became an issue on single drives, like it is today, it was an issue when trying to reconstruct a RAID4 (or 5) group that was 3.5TB (which at the time was a 7 disk raid group of .5T drives. 14 disk groups (a full shelf) were pretty much guaranteed to see a second error in the shelf during reconstruction. Which was also way RAID6 or dual-parity RAID became a must have enterprise feature back in 2005 or thereabouts.

On an interesting side note, because the chance of hitting an unrecoverable read error is evenly distributed through a drive, 3X replication is still recoverable even with intermittent read failures. There isn't really a RAID number for that but it does work reasonably well and avoids a pesky parity calculation if you embed check data in your blocks as they do in GFS.

[1] https://www.usenix.org/legacy/publications/library/proceedin... -- Peter Corbett's paper (he is the guy who invented NetApp's dual parity system, and from that paper the following --

"Disks protect against media errors by relocating bad blocks, and by undergoing elaborate retry sequences to try to extract data from a sector that is difficult to read [10]. Despite these precautions, the typical media error rate in disks is specified by the manufacturers as one bit error per 1014 to 1015 bits read, which corresponds approximately to one uncorrectable error per 10TBytes to 100TBytes transferred. The actual rate depends on the disk construction. There is both a static and a dynamic aspect to this rate. It represents the rate at which unreadable sectors might be encountered during normal read activity. Sectors degrade over time, from a writable and readable state to an unreadable state."

And in experience from the field put it at about 15TB transferred, so 3TB into 15TB, one in five.

3TB is 310^12 bytes assuming the decimal bytes used in the storage industry. The uncorrectable bit error rate is for the raw block storage. It does not include the low level formatting, which is no more than 20% of the storage on 512-byte sector drives and less than 10% on advanced format drives. The probability of an uncorrectable bit error when copying 3TB using decimal bytes) is approximately 1.5% under the assumption of a 5 in 10^15 uncorrectable bit error rate:

[1 - (1 - 5 10^-15)^(3 * 10^12)] ~ 0.01488...

If your 20% figure is accurate, the actual uncorrectable bit error rate would need to be something like 7 in 10^14. I am not disputing your empirical information, but your numbers are do not agree with it. The difference in what your numbers say and what you say is only about 1 order of magnitude. Doing statistical calculations with better records could allow the cause of that to be identified.

And to be clear, it is a bit error rate not a byte error rate. Nominal coding of data in magnetic media is 10 bits per 8 bit byte although a specific drive may use a different encoding on the platter. The Barracuda included 5120 NRZ encoded bits per sector and a 48 bit NRZ encoded checkword giving it a nominal 10.094 bits per byte. You're off by one decimal order of magnitude in the number of bits.

Just to be clear, I meant 3 * 10^12, not 310^12. The arithmetic that I posted uses the correct number.

To avoid markdown, either use reverse-slashes to escape your asterisks in paragraphs, or surround them with spaces, or put four spaces to the left of short lines that have "special" characters.

So every time you do a zfs scrub on a large pool (many TB) you should see errors that are detected and corrected.

But you don't...

> raid 0 is a performance option

Once I won a bet that RAID 1 was actually faster than RAID 0 on a given scenario.

My very first questions regarding a potential storage solution revolve around data loss:

    1. Can we enumerate the data loss scenarios?
    2. How is drive failure handled?
    3. How may data be corrupted and such corruption detected?
    4. For every data loss scenario, what is the recovery procedure?
Here is all I could find: http://rockstor.com/docs/faq.html#how-do-i-prevent-data-loss...

Of course, there is a wealth of information on such questions for standard RAID, but I would suggest for marketing purposes that rockstor synthesize available information (from the many relevant layers of data management) in a coherent fashion, specific to their product. It doesn't have to be deep, but it should be at least minimally comprehensive and broad, with pointers to more detailed, layer-specific information.

Also, it's fine if the recovery scenario is "restore from backup" for e.g. the scenario where data is deleted by an authorized user. If so, there should be at least a minimal "backup story".

That is great feedback for us. I've added a documentation issue with your feedback: https://github.com/rockstor/rockstor-doc/issues/37

We have added appliances <-> appliance replication recently, which can play an important role in recovering from bigger disasters.

We'll have all that documented. Please feel free to participate on our github.

the gui looks pretty cool. personally i would not trust btrfs for a nas. i have made not the best experience while running various production servers with btrfs. i switched (back) to zfs and never looked back, it its just better in every regard.

i also administer a freenas box for a small business and this stuff is rock solid, i would only wish a _easy_ solution to get the permission stuff right in a multi user setting.

none the less, thumbs up for creating this, cool stuff!

I'm in the process to rolling out btrfs on a lot of production servers (no raid, just subvolumes and compression) using Ubuntu 14.04 - what problems did you encounter with btrfs?

I've hit problems like reaching ENOSPC (even though the data extents were only 70% full) on a colocated server, and there isn't enough free space to run a balance operation to get more free space. (The docs literally suggest inserting a USB stick and adding it to your array to help make the balance work..)

Also, the fsck tool is still very immature. It takes many years to get good at detecting and recovering from corruption.

had that same problem and additionally performance problems on ssds with fast writes (postgresql) even when turning off the COW

If you don't already, I strongly recommend lurking on the btrfs mailing list. There are regular fixes to balancing, ENOSPC, send/receive and the btrfs-progs tools; occasional questions and fixes related to the compression code.

Be prepared to update your kernels and tools often and independent of your vendor. Btrfs-progs will likely need to come from the git repo, so building your own packages for distribution around your production nodes will probably be necessary too.

A word of caution: do not run btrfsck without consulting the wiki and mailing list first, and hopefully knowing exactly what you are doing. There are situations you'll encounter which do not require btrfsck to repair (but rather, other tools instead), and it will potentially make a recovery less likely.

FWIW, I have been watching the list for years, and reading regularly for about 6 months trying to get a sense of stability with respect to the features I want.

I would not put btrfs in production yet. Though, likely soon.. I'd guess another year or so.

Oh my god. The debate was between ZFS and btrfs and although I favored ZFS, the extra kernel module and the upcoming support in distros led to the decision for btrfs. However we won't do anything fancy with it. Basically just using the whole disk for a distributed filesystem without snapshots and we use btrfs because of checksumming and scrubbing weekly/monthly to detect corrupt disks and data and maybe compression with lzo and subvolumes. As far as I understood this should be safe?

New kernels should be no problem as Ubuntu will likely provide an HWE stack in the future and btrfs-tools is inside a well maintained ppa...

Damn' I should have pushed ZoL through.

I wouldn't use ZoL either -- I read that mailing list for quite awhile too, and skimmed most of the issues on GitHub. As of about six months ago, lockups were too frequent for my taste. All the implementations are improving though and the OpenZFS movement is promising. A caution here too: if you use ZFS, all implementations are not equal, you'll need to research the specifics for each platform on which you intend to use it; and the compatibility [with other implementations] if you want to move the file system [to a different platform]. If I was rolling out ZFS, I'd only use it on Illumos/OpenIndiana (vs., say ZoL).

I have been waiting and watching for a long time for most of these "new" filesystem features (pools, fs-level RAID, checksums, send/receive), but I am a "filesystem conservative" (especially in production; less so on my own machines) -- I'll keep waiting awhile longer. On production Linux today, I stay with EXT4 or XFS.

Thanks. We believe it's a matter of short time before btrfs will be trusted enough.

Can you elaborate on the permission-stuff issue you have? We'd love to get it right in Rockstor.

basically you need to integrate an ldap server with a decent frontend.

> better in every regard

Can't remove raidz's from zpools, but `btrfs device delete` exists.

But lacking raid5/6, even N-way mirrors. ZFS is not perfect, but btrfs is not even close in terms of features.

Btrfs does support raid5/6, I'm using it right now. It is still being refined and has a couple rough edges, but I haven't had any problems in the year or so I've been using it. It is not "production ready" yet for sure, but the support is there.

Everything I've read (status link from the official wiki: http://marc.merlins.org/perso/btrfs/post_2014-03-23_Btrfs-Ra...) says don't touch raid5/6 yet. Maybe the pages are overly pessimistic but lack of recovery features sounds like a no-go for me?

in a enterprise setting you rarely want to remove devices/space.

Rarely is not never.

That's not a bug.

Interesting =).

I'm currently running Freenas with ZFS.

Would be curious to see how this compares.

The one thing missing for me on FreeNAS is some kind of file search/indexing feature.

I wonder if the fact that this is Linux based will make adding something like that easier.

Perhaps. I have some ideas about search features. We can also get some cool stats efficiently from btrfs trees also. But I'd love to hear your thoughts. Is it possible for you to give more input on search/indexing that you wish to see? You can even write to us directly -- support@rockstor.com or file an issue on github: https://github.com/rockstor/rockstor-core

I've just filed a support ticket here:


Hmm, stats - don't know much about this topic, but I'd been keen to hear more about what's possible.

On the file indexing front, I think Recoll and Tracker/MetaTracker are the two most active projects - Recoll being the more active one. Strigi and Beagle are both discontinued.

All three of the server hardware suggestions are discontinued.

Anyone have suggestions for better servers? I wonder if Rockstore would work well with the backblaze case. Maybe some of the OCP cases would work. Anyone played with those?

I wish I knew first hand how Rockstor would work with backblaze. But 45drives can ship them with CentOS which is what Rockstor is based on.

I've had the opportunity to install Rockstor on various hp gen7 and gen8 servers and had no problems.

I witnessed Rockstor install just fine on an old Isilon node and was told that the performance was quite good -- sorry I have no specifics.

Demo page gives me an error message. Sends me to

Yes, it's a simple redirect. that's where the demo is hosted for now.

This looks pretty cool. Easier to use and nice gui.

Good stuff guys!

no afp support?

Since Apple has supported SMB for a long time, and actually made it the default protocol in 10.9, is there much need for AFP?

  > Since Apple has support SMB for a long time, and actually made it the default
  > protocol in 10.9, is there much need for AFP?
Time machine backups still require afp I believe -- unless you use the "TMShowUnsupportedNetworkVolumes" option.

Performance of SMB on Mac is only about half of AFP/NFS, and NFS is more complex to manage from an authorization/user management point of view in a Mac environment.

I'm running AFP on FreeNAS. I also have SMB setup.

I'm using OSX 10.9.4, and I've seen better performance over AFP than with SMB.

So yes, it'd be nice to have AFP support.

AFP is generally faster than SMB and SMB2, but SMB3 should be faster than AFP. YMMV of course.

No, but does is suggestion from our blog work for you?


Guys, you might want to remove that RSA Private Key. https://github.com/rockstor/rockstor-core/tree/master/certs

Thank you and appreciate your issue submission on github. We'll fix this right away.

It's still there.

Thanks for your concern, but we don't see a point in just removing it in git because it doesn't really help. the key is in several branches, in our iso file, every rockstor rpm in our yum repo and not to mention lot of users who have downloaded rockstor.

We changed the key in our live demo, but for our users we'll roll out the fix in the next update. As part of that fix, we'll also remove the key file from git.

I think that's a reasonable plan. Hope I am not missing something.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact