Of course on the other hand, there are the poeple touting it as the end all be all, a solution to kill NetApp.
I guess what Im wondering is: how did so many people get the idea that this article about a specific solution to a specific problem was actually some sort of general purpose solution attacking all the big name people? What am I missing?
A huge chunk of this article is about the hard drives and PSUs not being enterprise ready, but for an enterprise load, but I just don't see it. I bet a lot of these boxes run idle a large chunk of the time. I have 10 year old desktop hard drives running just fine in a file server, because it has a similar load: mostly idle most of the time.
Backblaze did it, They did it when they decided to put this graph in the article:
It's an apples to oranges comparison.
That graph is simply comparing their solution to the closest commercial equivalents. It just so happens that these are all way off because all hardware manufacturers want to design their hardware to provide the highest throughput.
(That said, our experience with the Sun X4500s hasn't been great from an I/O point of view.)
What are the labor costs for testing all their hardware (with 10 sata controllers no less)? What are the labor costs for assembling all those systems? What is the labor costs of developing the system architecture? What are the labor costs of creating their storage application which reduces their need for in-box redundancy? And if you want to talk about Amazon there is data center rental, power, cooling, and ongoing administration.
In short, they're comparing the costs of their components to the costs of a ready-to-go solution from other vendors. Sounds like apples and oranges to me.
As for the missing costs that you cite, think this through. What do you really think the per-unit costs for assembly and testing are? Even if each unit required a couple of man-days, the costs would probably only add ~10% or so per unit. The other costs you cite for the initial hardware and software engineering are fixed costs, the same whether they are storing 1PB or 1,000. Also, if they did their job right, much of the per-unit testing cost should be mitigated by the overall systems design. The system management automation can do a test cycle on new nodes before promoting them to production use, and of course, failures should be dealt with automatically.
As far as I'm concerned, you'd have to write the custom software even if you used Sun's boxes. I wouldn't trust data to a single machine of anyone's design. Though that's probably just me.
The S3 comparison is totally invalid though, I agree.
Of course it's not. It's an IBM PC down inside. I bet it can still boot MS-DOS and run GW-BASIC. ;-)
PCs are not a nice architecture for servers: there are starvation points all over the system, from registers (AMD64 solves some of it) to memory to I/O. Sun could have based the Thumper on a more server-ish (SPARC?) design, with plenty of memory and I/O bandwidth, but then it would not run Windows and the ability to run Windows is a defining advantage in the high-volume server market.
If I were to design such a box from scratch, I would couple the disks close to the network interfaces over a dedicated bus so the CPU could just say something like "hey, disk 3, drop blocks 10239 through 10300 on buffer 12 of your network controller while I go on with header parts and prepare to send it off". While I am at it I would skip the PSUs and go with DC power and a small battery, Google-style.
I fear I probably described something Thumper-ish and I will be ridiculed by someone with lots of server design experience, but that's the life of a hardware engineer who went the software very early. And I welcome such criticism ;-)
PCs are not a nice architecture for servers: there are
starvation points all over the system, from registers (AMD64
solves some of it) to memory to I/O.
- AMD64 is a little better than vanilla 32-bit x86, but it still has few registers compared to POWER or SPARC architectures. This increases the risk of memory access, which is bad. I suppose there is a point when it's pointless to add registers and x86s do some convoluted stuff with shadow registers, so the picture is not really clear. Optimizing compilers should alleviate this too, but, like car builders say, there is no substitute for the cubic-inch.
- Still about processors, the least a multi-threaded CPU can do for you is to keep an execution context in-chip and prevent a context swap from memory. That saves a lot of memory bus time that cannot be used by other parts of the system. AMD64s (and their Intel counterparts, AFAIK) max out at 2 threads/core. POWER and SPARC max out at 4 and 8 tpc respectively (again, a number off the top of my head).
- On PCs (defined here as "a computer that can run Windows"), there is little distributed intelligence. I never saw a PC where CPU, disk controller and network interfaces could do the chat I described. Contrast it with the typical vintage mainframe design, where there are as many things going in parallel as designers can think of. As as example, there is an IBM disk-drive in the Computer Museum where you can see two sets of heads/arms on opposing sides of the disk, effectively being able to read/write different cylinders simultaneously. While I don't believe such machines are in current use, this serves to illustrate how far a server designer is willing to go in order to beat a throughput record.
Maybe for the low-level tech, but certainly not what it's used for. There are lots of online backup services out there. Some of them publicly make it known they use Amazon S3 for storage. I'm sure there are some other companies, who own their hardware, like Backblaze, but didn't optimise their cost structure using custom builds.
They had a big splash a couple of years ago about how they were going to offer unlimited backup for all your mp3s in iTunes, all nicely and automatically synced, for about $10/month or something. They were going to do the storage on S3. They offered anyone who mentioned them on their blog a free years' service.
I had about 400G of mp3s at the time I think. I thought, "OK!"
Their service lasted about 2 days as they underestimated, seemingly by about an order of magnitude, how much data people actually had.
Then they went back to the drawing board and came up with some lame service where you buy the storage space on S3 yourself and then they sell you the ability to access it or something. Lucky they didn't go out of business.
Anyway, knowing the details of the kind of system Backblaze is using, and the price per gig it breaks down to for them, is evidence they're not overselling themselves, and gives me confidence that they'll still actually be there in a few years, unlike Bandwagon. I'm much more likely to use them because of that. Add in all the discussion they've generated, and it's been a very clever marketing event for them. Kudos!
I'm pretty sure that the Backblaze solution is actually cheaper and that they are cutting some of the right corners for their particular application scenario. But it remains an open question how much cheaper it ends up being.
I think their price comparison is a bit misleading since it doesn't include replacement costs for the desktop grade hard disks they use and it doesn't include any of the costs incurred by the software/labor/maintainance required to make this a reliable backup solution.
I believe the data dispel this marketing myth.
He's concentrating on disk-to-board throughput, but this being network attached storage on a LAN, using the motherboards' single ethernet port the amount of data read or written to those drives per second will never exceed 1 Gbit/second anyway.
I agree with the reliability analysis though, backblaze seems to be on fairly thin ice there.
Still, it seems to work for backblaze, I'm taking it as read that they've lived through a bunch of drive failures by now and they seem to still be in business.
I think that's why he says right at the beginning:
This device is that cheap because it cuts
several corners. That's okay for them. But
for general purpose this creates problems.
I want to share my concerns just to show you,
that you can't compare this to a X4540 device.
As to speed, nothing you do is going to reduce a disk's low latency so talking about bandwidth is mostly a waste of time. Device bandwidth is rarely the limitation once you start to scale, because you can add devices to your network faster than you can build a better network.
They just got 'cheap', reliable comes at the expense of using multiple of these units and fast looks like it is going to be impossible without striping across multiple setups like this.
Interesting for an extremely narrow use case (long tail storage, think backup services, video sites, photo sites).
Outside of that maybe useable with some adaptation but not 'as is', besides that it is not as though their machines are up for sale, the figures they quote are mostly after all the r&d is already factored out, if you had to do this for a one-off, just cutting and folding that case would cost a pretty penny.
I'm fairly sure there is a market for a device that is a more robust version of this.
Probably at a 50% or so premium compared to the one they showed you could get a wider spectrum of uses and a bit more robust construction.
I would have bought one three months ago, now I had to settle for 20T or so in one 4U box, for a price a little over half of what they quoted in the article.
If you believed their spin presumably you would buy from a proper enterprise company like IBM or EDC - not some failed workstation vendor.
That said, it's still really expensive if you're willing to go that extra mile and design your own system for your needs. And it's really not that hard.
How do you know when I'll need my backup?
Replication is what it is all about.
Right upto the point where they are sold to a database company that isn't interested in selling disks.
the nice thing about the blaze is that if seagate decide to double their prices or much you to a new expensive platform - you just buy toshiba instead.
I see Backblaze will learn a lot from building their storage systems and future versions may be much better, but this one cuts too many corners for me to feel confident it will have the same kind of reliability I would recommend.
What amazes me most is the use of desktop-grade hard drives. Not because they are sloppily built, but because their performance requirements are so different from a server environment.
This makes for some nice reading: http://cacm.acm.org/magazines/2009/6/28493-hard-disk-drives-...
Me, I would add more cache at the same time. SATA tends to be really good for sequential, but be really, really bad for random. multiple sequential streams hitting the same disk start to look an awful lot like random access after a while.
No, it probably means that they get their reliability at a different level, such as simply storing the data on a number of machines. Systems like glusterfs make this completely transparant.
Agreed, but I do know a few situations where it is being used, the 'stack' has been undressed to the minimum and tested to the hilt before deployment.
Still, I would not risk my company on it. I've been a big fan of their architecture from pretty much day 1, but so far they seem to be feature oriented and not stability oriented.
From the numbers he cited I got the impression that the drives they are using would fail 3+ times more often than enterprise drives (lower MTTF + speced at low temps). The cost of replacing the drives (even if they are 1/2 the price) would make the system uneconomical.
Also: because of their case design (non-hot-swappable) the cost of human replacement is higher too.
Thus they must have a low usage rate to have a low failure rate for their system to make business sense.
They're primarily write-only, only when a customer retrieves their data does stuff get read.
If they spin down the drives when the volume is not being accessed the mttf for a single drive goes up a lot, possibly beyond the point where it matters.
That, and the fact that the company seems to be struggling to get sales.
But you will have to manage the cycles for redundancy, lifetime and power savings. The software to do it must be really clever.
Yes, software on (relatively) unreliable hardware is exactly how Google has been able to scale to such an immense scale.
But a lot of his complaints are about reliability, and thats kind of a moot point since Backblaze (and anyone sane) stores every piece of data on at least two (and ideally 3 or more) separate bbs. I would argue if you don't need performance (i.e., no huge database), but you do need a lot of space, you're better off just setting up two 'mirrored' bbs (so that when one goes down, the other takes over) than almost any 'enterprise' solution.
He does have a point about ZFS though - I'm sure at that kind of scale eventually the RAID 5/6 write-hole is going to bite you in the ass.
He mentioned that 1 Disk uses 120 MB/s, so 5 of them is 600 MB/s. He then says that converts to 6 GB/s. Now, I'm not all that familiar with speeds, but I know 1000 MB is 1 GB, so shouldn't 600 MB/s be .6 GB/s?
Other than that, he raised a number of valid points. The Backblaze system is NOT general purpose. Its designed to take a bunch of data (Backups) and hold on to it for as long as possible. In that situation you don't need high Throughput, or even all that reliable hardware. Once the drives are full, their usage will drop, and as long as the RAID can be kept up to keep the data secure, as long as costs stay down they're fine to keep replacing hardware.
One thing of note, and something I never saw in the Backblaze article: Their service promises to hold your data for 50 years. Assuming that one of their pods is filled, what is the estimated cost to keep that data over the course of 50 years? I would assume they calculated that out, and that the data showed to go with the consumer grade parts, but I would really like to see the numbers.
I hope that clears it up.
Please note that this is the guys private blog, he definitely does not speak for Sun in that spot and to label the article "SUN Engineer responds" makes it seem like the response is an official one but it very clearly isn't. The text is full of other mistakes as well.
bytes vs bits.
Given the idea, buy 10 and place them at geographically diverse points, you may as well put them in the back room at branch offices or something. Who cares if one goes offline for a while. Hell, put them in countries with good consumer internet connectivity (japan, korea, sweden etc) and just use that.
If you're going ghetto, go all the way!
When I'm buying data center space I look at cost per watt and ignore the space requirements. I've not yet hit a place that was willing to sell me more power than I could use in a rack. If the space is low/medium density, I fill it with 2 or 3u chassis (I'm currently using http://supermicro.com/products/chassis/3U/833/SC833S2-550.cf... because I got a good deal on a lot of them.) If it looks like they will sell me more power than I can eat with those, I pop in a few of the supermicro 2 in 1u boxes: http://www.supermicro.com/Aplus/system/1U/1021/AS-1021TM-T+....
This is part of why you see so many people leaving airspace between 1u servers. (me, I use the 3u servers for 'air gaps' as they run really cool.)
This is less space rackpace per TB maybe more power.
Even if you need two of these for redundancy the cost savings is significant.
Managing the IT team and a growing data center installation for a medium size business was one of the most enlightening experiences I've had as a developer writing software targeted at "internet-scale".
Just sayin, at those prices, it takes a long time to amortize out a six figure appliance.
As long as that space is available. To guarantee business capacity you'll probably need to buy a large block of rackspace at one time. Of course, capacity availability varies based on the current local business climate, but that's something you have to plan for too.
If I wanted to pull a gig connection to another nearby data center, say, because I couldn't get more at svtix, we are talking something like $500-$1000 a month
Upwards of $1000, but note that GigE is a fairly small link for data center cross-connection and will still usually have a large lead-time. Not only do you need to be able to guarantee sufficient capacity either at svtix or at the other colo, but you need to be prepared to wait to get that capacity allocated for your use.
My model is such that it's not that big of a deal to just get an unconnected rack at a new data center and buy transit there. When I was blocked on new power at SVTIX, I slapped a few new servers in at rippleweb.com, which is in Sacramento. If I had a big, expensive san I couldn't replicate, well, then I'd have a different model. (rippleweb has awesome prices on high-power 1U server co-location, but it's not as good as getting a full rack elsewhere.)
But yes, the observation that you pay more if you want your stuff provisioned right now is very correct. At prgmr.com, well, I wait for the deals, so I end up paying a lot less than I would otherwise. (but then, sometimes the 'we are out of space, come back later' message is up for longer than I'd like, as well. It's definitely a compromise.)
Obviously we're still considering apples and oranges to taking this too much further would likely be silly. But its still pretty impressive numbers.
This is why small companies, startups, and people new to the storage world hate working with storage vendors. I understand the economics, I get why they sell storage the way they do, and I like my 70% discount off list price, but it makes for a horrible user experience.
If I were a cleverer person, I'd apply an every-day low price model to "enterprise" storage.
Seriously, If you say you want six figures for your product, I'm walking. I don't have that kind of scratch, and I'm not going to waste your time or my time lowballing you. If you really mean $30K, well, I'd have to scrimp and save for a few months, but it's doable if the product really does solve more than $30K worth of problems for me.
What I'm more interested in is whether the lower MTBF of the cheap drives and home-brewed chassis ends up with a higher cost per year due to higher failure rates. If a desktop drive costs $100 and fails three times a year, but a server drive costs $200 and fails once every 2, the initial cost savings is moot.
Our experience was that 0906 was utterly unusable if you're using comstar, with performance being mildly degraded for straight zfs usage.