"...since we were using low-end commodity storage that was by no means enterprise-quality."
If Facebook (and I believe it's similar with other big players like Google and Amazon) are not using "enterprise grade" hardware (because it doesn't make economic sense at scale), who is? And more importantly, why? Why do "enterprise grade" products even exist if the largest and most deep-pocketed corporations don't find them to be good value propositions?
Choosing between "enterprise" vs "consumer" products is a lot like choosing between getting an Dell desktop vs building you own - a choice many of us made during our early IT years. The former gives you a reliable and predictable product, but it might be slightly lagging in tech or come at a significant premium. The latter gives you full control over your specs, but you carry the burden of fixing it and making drivers work together and that requires high tech skills and overheads.
Enterprise hardware aims to be predictable: large businesses have to minimise risks so they value stability and clarity in performance, costs, and technical know-how. Enterprisey products are never cutting edge or best-in-breed or efficient, but they are predictable and that helps large businesses to have a reliable foundation for planning and executing IT projects over 3-5 years with known budgets, risks, support contracts etc. This predictability is important for businesses that are not based on IT (for example, making frozen foods or assembling tractors).
Facebook/Google/Amazon and other software-centered businesses depend on agility and finely-tuned machines - they need their "custom-spec" equivalent, and they have the skills required to manage it. But many other companies have known needs and are happy with standard enterprise products - they are the equivalent of an office worker that needs a machine to run MS Office, and a standard machine from Dell/IBM/HP is a more appropriate choice for them (rather than tinker with building their own).
The original poster was asking, if I can reword properly: Who is using enterprise-grade hard drives instead of consumer-grade ones, if FB doesn't (and why)?
Both consumer- and enterprise-grade hard drives use standard interfaces (SATA, etc) so all this "drivers" stuff you talked about doesn't apply to this particular hardware. Since even enterprise-grade hard drives will fail one day, one needs to have a proper backup strategy, which would backstop the needs of supporting of consumer-grade HDDs. So in light of this, why buy enterprise HDDs?
(I'm not saying one shouldn't, I'm just not buying the explanation above :])
It is a lot like the pets vs cattle comparison. If your underlying structure can tolerate larger amounts of downtime the extra cost of enterprise equipment. For many applications it is easier to buy more expensive hardware that has exceptional uptime than it is to hire people to rewrite and take advantage of cloud based options.
Enterprise hard drives use SAS connectors to allow redundant paths and a full scsi command set. Enterprise hard drives have a bit error rate an entire order of magnitude better than consumer hard drives. The rated powered on hours and mean time between failure will be better, but not more cost effective with enterprise gear unless you take into account the labor cost and downtime cost of replacing that hardware.
I feel that explanation answers the question perfectly.
It's not about the fact that the hardware looks and interfaces the same, it's that it lasts longer, performs reliably and predictably and comes with support. Even if you design for failure, it's much better to have 1 drive fail than 3 in a system.
The enterprise stuff will just be more durable and usually more performant in every way so it's worth the cost, especially when most companies are not tech giants with all the engineering talent to build something like cold storage from the ground up. FB/Google/Amazon/Microsoft/Apple are very specialized high-tech companies who know how to do this and not representative of most "enterprise" customers.
If you are doing RAID, especially on a hardware RAID card consumer drives generally don't support TLER. This means a consumer drive will sit and try to read a bad sector again and again for long periods of time. You'll notice this on desktops when a drive is getting bad sectors that it will pause 30 or 40 seconds at a time. In a RAID setup it's better to fail fast and mark the drive bad for replacement, but the consumer drives don't send the hardware/OS the information it needs to do it quickly.
Most enterprise drives come off the exact same production line too.. Some have a different external interface but the important parts are all nearly the same, they might test them differently
Lets say you enterprise division/company wants to store 100 Terabytes of data. You ring up Netapp and buy a solution for say $100,000 and plug it in. You get support, lots of documentation and it is usually fairly reliable.
Now if you are Google or Facebook you want 100 Petabytes of storage but instead of paying Netapp $100 million you employee 20 really smart guys to build you something, you put it on cheap hardware and you build exactly what your software will talk to and what it needs.
It costs you $10 million up front but the hardware cost is just $15 million because you use cheaper drives and build (thousands) of your own servers.
Of course this doesn't mean that "Enterprise Storage" isn't going to be the next premium product hit by Open Source type solutions.
I guess it depends how close to bare metal you want to get.
Google/Facebook/AWS (and maybe a Backblaze) have architected their data center setup to not require the functionality that Netapp provides. If you buy a NetAPP and plug it in your netork it is because you cant architect a datacenter from the ground up...
Facebook was a founding member of the Open Datacenter Alliance. They have a cool githup repo and put out interesting whitepapers like:
It is not just cheaper but architected better. They would use Cisco and other enterprisey stuff if it were better but it is a case of "il y a moins bien mais c'est plus cher" as the French Linux motto goes...
Which is not to say NetAPP or Cisco is bad - just that you can engineer the datacenter to make it work better without enterprise functionality.
As an object lesson in failure, I had an experience with my first buildout of Hadoop back in 2011. Not so successful because I bought all the wrong components. Raid cards, 10G networking, HP and Bladenetworking rack switches, dual power supplies.
It was hard to unlearn how I was used to setting up servers.
Wow - still a bit painful to remember my first go around with this. What i am about to admit is damning enough to keep me unemployed for the forseeable future so hopefully no hiring managers are reading this...
I was a sysadmin and did the datacenter buildouts for the company. The task was to take the dev setup in Softlayer (on bare metal servers) and reproduce it in our racks.
The error was to take this somewhat literally. The bare metal servers that we leased from Softlayer were 2u boxes with redundant power supplies, Raid controllers and 10g networking.
I found a Supermicro configuration that almost identically matched and ordered 30 datanodes; 2 name nodes (this was back when Hadoop used the curiously named primary and secondary name nodes); and a server to launch jobs from.
We already had a Netezza in place and since that used Bladenetwork switches I decided to use the same as TOR switches with a beefier on as the Aggregation switch.
We decided to use Ubuntu as the OS and the Cloudera packages - but without the support or the console.
Every single one of those choices were mistakes. It is remarkable in hindsight that it worked at all.
The mistakes:
1) starting with the power supplies. Since they were redundant it dictated an A/B power setup in the racks. What this means is that you cannot use more than 50% of the power density because the entire rack is set to fail over. Each PDU has to be able ot keep the whole rack up so it alarms at 40% capacity. My using redundant powersupplies I was more than halving the amount of power i had at my disposal.
2) RAID. I hadnt yet read the excellent Hadoop Operations O'Reilly book which tells you why RAID is a bad idea for hdfs. Further buildouts included ripping out the RIAD card and JBODing the drives allowing Hadoop to properly use the raw disks.
3) 10G networking - because more is better right? Later after we hired talented networking folks I learned a bit about how important caching is in switches - particulary for Hadoop. My expensive monster Bladenetwork switches were falling over because the bursty traffic would saturate the switches. In addition I had the networking setup for a much more conventional network and we were doing the switch aware stuff in the config but had a wide open /16 as the network. I learned about properly segmenting to avoid uneccesary cross talk.
4) Ubuntu - never listen to developers.
;)
Actually it was our default OS. CentOS works better for Hadoop I would later find out.
5) Lack of support. Cloudera is expensive but - unless you are already an expert - worth it. We should have gotten help early on.
6) Hardware choices. Supermicro was a bad choice. Penny wise / Pound foolish. They changed hardware config and it was hard to get replacemnts, etc. Cant say enough good things about working with PSSClabs for hardware though.
I already mentioned the switches. I learned about Arista switches which are amazing for this application.
By the time I left that company, I learned an enormous amount from really smart engineers that knew gobs more about Hadoop and netwrking than I did.
I think the number of datanodes was up over 200 and we had a lean 1u datanode from PSSCLabs with single powersupplies 12 drives JBOD with onboard flash for the OS. Bonded 1G networking running up to Arista TORs and mutliple Arita Agg switches in a leaf / spine topology.
Because Facebook, Google, and Amazon have the engineering talent to build their own.
No "ordinary" enterprise (basically everyone else) can just say "oh, mounting and unmounting is an issue? let's use a raw disk instead!" -- that is just well beyond their capabilities.
Also, only Facebook / Google / Microsoft / Amazon have the scale that designing your own cold storage makes sense.
This is very true. I work in the public sector and while we often need to store petabytes of data we have a hard time finding and keeping engineering talent to do stuff like this. Instead we buy expensive enterprise stuff. I find it frustrating.
I agree. Public dollars should, as much as possible, be used to implement and advance open software and hardware. It's no longer just about "free" as in no cost, it's free as in no contract, no lock-in; and not "no one owns it" but rather "everyone owns it." There's a certain obligation and clear benefit for everyone to make it better, including easier to implement.
Not really. Skipping the filesystem layer and using the raw disk has been an optimization for databases (ms-sql, mysql, oracle db) has been a practice for quite a while.
If you're in the on-premises storage industry (aka NetApp, EMC), you will be amazed by the number of companies still buy networked storage arrays. They are ranging from Fortune 500 firms to extremely smaller ones like dental offices, school district, hospital... They all have one thing in common: they don't trust their data with public cloud infrastructure.
If you ask that way, I'd guess the answer is: For any network storage array there are at least a few on-premise buyer that trust this particular product.
I guess that there are an awful lot of companies that, out of desire to save a few bucks, will go with home-NAS-types.
Ultimately the main difference between the vendors is performance and availability - you pay a lot more to increase either of those. If you don't need either even a consumer style synology nas can work.
It's actually starting to go that way in some enterprise environments, especially in storage where ditching the overhead of EMC/IBM/NetApp has a potentially high reward.
But more conventionally, there is a different thought process in enterprise IT and what I call industrial IT. Enterprise IT implements systems around minimizing failure events and create bespoke environments to meet solution requirements. Industrial orgs acknowledge that these events are a fact of life build process and software to deal with them. They also tend to limit the options available to consume IT.
Also, "at scale" cannot be achieved by an single entity outside of US Federal .gov. I work for a massive organization... 150k+ people. Facebook serves 10000x more users, and have the engineering resources to do amazing things as a result.
It depends on the applications you're going to use them for, and how much you're willing or able to invest in making your applications be able to take advantage of using a larger quantity of cheaper hardware. A lot of big companies run applications that are not nearly as resilient to failure or nearly as linearly scalable as the systems routinely built by the internet giants.
I realized this before with regards to the ridiculously overpriced top-top-of-the-line CPUs. Let's say you're running an existing application at, say, a moderately small scale: 20 servers. Each server is only handling a couple dozen requests per minute, so the load is not very high, but the requests still take too long. You've hired a whole team of engineers to work on optimizing the code to make it run faster, and that costs you, say $10 million a year. You'd be happy to buy another hundred servers if that made it run any faster, but that's just not how it works -- the application is all single-threaded code, and maybe you can make that better, but that's what your $10 million engineering team is working on (maybe). So in the meantime, you spend whatever amount of money you can on getting the absolute fastest hardware you can find to throw in those servers -- fast CPUs, fast hard drives, fast network... and even if those servers cost $50k each, that's still just a fraction of what you're spending on the software engineering work.
No I think that they are using drives with a specific purpose and do not need enterprisey functions.
I think they mean drives that are the quality of the WD black drives, it looks like them in the one picture. The types of drives that you might build a Hadoop cluster with knowing that you will be replacing 1-2 drives per 100 every month.
These drives (unless I misread) do not need to be hotswappable nor any other specialized setup. I would bet they are JBOD and directly attached with minimal controllers. If one drive per tray is powered on at a time then there is no RAID running across that tray...
As a point of comparison an expensive enterprise drive that you might see in a Netezza appliance costs many times more and has specialized firmware to work in a specific application, with certain RAID controllers at specific firmware setings. From experience I do not think that they are much more reliable - they are specific to the appliance, or in some cases like Aerospike required SSDs specific to a software requirement.
This storage design is stripped bare - the HDDs are off for most of their lifetime. I guess the biggest stress on them is spinning up and spinning down. They must focus on precise temperature/humidity control as it might be as much of a problem to have the drives too cold as too hot.
Because when using commodity hardware, strength is in numbers. Very few organizations need the computing power of Google, FB or Amazon, so they can't benefit from the "reliability at scale" that those companies can. Therefore, they need to buy enterprise-grade HW which fails far less often, even if disproportionately more expensive.
Facebook used to use Dell C series servers and HP servers, tons of Dell Custom Solutions C1100/C2100/C6100 and DL160 G6 servers showed up off lease years ago (DL160 racks were shown in a time magazine article of a FB datacenter). I'm fairly certain I've seen pictures of Dell C6220 servers at Facebook as well.
I remember par2 from usenet, seems like similar/same concept http://www.quickpar.org.uk/ and something i've been wondering about for a while for storage; glad to see it being implemented at such a scale! makes good sense. Though I would be curious how corruption/rebuild plays out over time.
Also impressive how they shaved off all that extra power requirements and built-in expected unreliability. And really, utility is pretty darn reliable anyway, mixed with non-customer facing storage. Good move.
They're not even the most arcane. Consider reading up on rapid tornado (Raptor) codes, other fountain codes, Goppa codes… there are whole families of error-correcting codes with different sets of tradeoffs. And the decoding can be enlightening, too: the Viterbi algorithm and its progeny can be surprisingly useful.
Then blow your mind even further by looking up the principles of the McEliece cryptosystem (but don't use it as originally specified - modern developments have weakened and refined it a lot).
I warn you: this is one of those down-the-rabbit-hole subjects that has a deliciously large amount of literature to soak up, you could find yourselves with bookshelves full of it. (An alarming number of patents, too, for the unwary.)
QuickPar hasn't been updated in a long time. Check out MultiPar for an actively developed alternative. (And of course there are other programs that handle PAR2 files as well.)
"For example, one of our test production runs hit a complete standstill when we realized that the data center personnel simply could not move the racks. Since these racks were a modification of the OpenVault system, we used the same rack castors that allowed us to easily roll the racks into place. But the inclusion of 480 4 TB drives drove the weight to over 1,100 kg, effectively crushing the rubber wheels. This was the first time we'd ever deployed a rack that heavy, and it wasn't something we'd originally baked into the development plan!"
That's just fun.
In some way I'm amazed that nobody of the smart people at FB tough of the weight of this stuff. But then again, I probably made sillier mistakes myself on a much smaller scale.
My guess is that they were definitely aware of the weight tolerance of the floor and the tiles but looks like the weak spot was the canister wheels on the bottom of the rack. I wonder how they eventually got it out. They probably used one of those rack jacks to get a stronger dolly underneath... Or removed the HDDs and moved it.
This made me wonder about something else - the power density. They say the layout:
"can support up to one exabyte (1,000 PB) per data hall. Since storage density generally increases as technology advances, this was the baseline from which we started. In other words, there's plenty of room to grow."
I wonder how they account for increased energy consumption as they put in larger powersupplies to deal with the increase in power to run and cooling for the larger HDDs. I imagine in this layout everything is in proportion. CPU / RAM / Disk Space. It must be hard to accurately target growth with in the constraints of a limited power footprint.
afaik - as density increases per disk, ex going from 1TB to 4TB, power efficiency per TB should be better too as long as they are also not also increasing the total number of spindles by adding new storage nodes. One thing I'm not sure of is if _total load_ for the hall increases as increased accesses may occur because of having more density.
The hard disk power management part is really fascinating in terms of keeping the storage cluster software alive and well to be able to respond to requests while keep the hard disks off. A low power SSD cache tier on the front may be part of the solution.
Doubtful. This article describes only 2EB of storage, while Google was estimated to have 15EB of storage two years ago.
Another comparison with Google would be regarding the erasure coding technique described in this blog post. Google offhandedly mentioned using it in a 2010 presentation. Is this really Facebook's first deployment of erasure coding?
Even the companies that have $100M+ allocated for storage won't necessarily be able nor even want to hire the kind of developers that could create their own data storage system in-house - they've already invested so much of their IT staff into buying into their SAN infrastructure. With most companies outside tech and $100M+ available for IT, they'll just send it to paying for better developers on things that offer better growth potential like big data analytics or whatever because IT is a cost center if your revenue does not directly depend upon reliability of your information systems, and with limited engineering talent available at non-tech companies you'd rather have them work on things that directly contribute to raising the top line. IT is about the bottom line until a business decides IT is not a cost center for them.
My guess is this is what Google and Amazon are also doing for Nearline and Glacier.
When will SSD drives catch up in storage capacity and price to traditional spinning platters? A SSD consumes very little power so this type of setup will not be as necessary.
Also what about the life cycle of drives that get powered on and off repeatedly? On one hand you have backblaze running consumer off the shelf drives 24/7. On the other you have FB/Google/Amazon powering on and off these drives.
When will SSD drives catch up in storage capacity and price to traditional spinning platters?
Never. Someone did calculate that if you can get flash that lasts 15 years it would be cheaper than disk, though (because the disk has to be replaced). I can't find the links on those.
Also what about the life cycle of drives that get powered on and off repeatedly?
But a flash disk that is healthy after 14 years is taking up probably 4x or more the DC space and power at a lower speed of a replacement disk at the end of its life span. The power part is probably not that significant, but the space and port availability is. 15 years is an eternity in hardware.
> My guess is this is what Google and Amazon are also doing for Nearline and Glacier.
Pretty likely. One of the post's authors, Kestutis Patiejunas, was an architect working on Glacier a few years ago. (src: https://www.linkedin.com/in/kestutisp )
SSD-HDD prices are a moving goalposts problem. $/GB for HDDs halve every few years. While prices for SSDs are coming down at a dramatic pace, they will probably slow as they get close to HDD $/GB.
That's only an issue for enterprise class storage of course. The speed (latency mostly, but also throughput) advantages of SSD mean that HDD are eventually going to disappear on the consumer side. The real question is when "eventually" is - I keep predicting in "two-three years" - but then two-three years comes along, and HDDs are going strong as ever. I'm guessing that when you can get a 2 TB SSD for < $100, they'll finally become the dominant consumer storage platform.
>The speed (latency mostly, but also throughput) advantages of SSD mean that HDD are eventually going to disappear on the consumer side.
The last few years have been weird - hell, ram is literally more expensive, per gigabyte, than it was in 2012. PC software requirements seems to be standing still. A five-year-old PC is still perfectly usable.
This wasn't true at all for most of my life.
I mean, yes, if storage space requirements don't start growing, of course, you are right, because hard drives, while they are big, are slow.
But... hopefully, this is a temporary setback. Someone will figure out how to make use of the surplus transistors in desktop PCs.
I mean, it can't be that hard; from what I saw in the '90s and early aughts, Microsoft seemed to release a new version of word every few years, and it required a new PC, even though I personally couldn't see how it was better.
But... apparently that isn't happening anymore.
My point here is just that if the need for disk space grows as fast as hard drive and ssd size per dollar grow, SSD might never become completely dominant, assuming that hard drives maintain their space per dollar advantage.
Of course, if things keep going where they have been going for the last few years, where system requirements for desktops don't really increase over time, then of course you are right, there would be no reason to have spinning disk.
You can already see this, sort of, in mobile, where storage expectations are dramatically lower, but spinning disk never had a real toehold there. Do you remember those tiny spinning hard drives that were packaged in CF cards? oh man, so cool! and so fragile.
The early nineties was like that for RAM, it was expensive as hell and didn't get any cheaper. For example, Windows 95 required 4MB RAM. That was ridiculously tiny just a few later in 1998 when I started working (and got 128 MB in my PC, fairly standard then).
Many kinds of big data files used by customers do not require high speed or latency. Watching video is fine as long as disk read speed greater than video stream bitrate. Music does not require that either. When you download huge archive, you don't need high write speed, HDD write speed will be more than enough, it's probably higher than network speed anyway.
Unless SSD will become better than HDD by all parameters, HDD's are very suitable as part of disk setup for anyone who need more than 200 GB data. And there are millions of people who don't use all that cloud things and prefer to download, store and watch/listen their content locally.
How are consumers storing local content reliably - home NAS with RAID? If millions of consumers are buying NAS devices, which vendors are reaping that revenue?
Single copy is most common. I haven't seen reliable data on the split between a copy in the cloud vs a copy on an external drive, but those would be 2nd and 3rd most common, with cloud growing to cut out both external drive and single copy.
I have family members who have had and continue to have user induced data loss due to single copy, but even single copy + cloud because they insist on passwords being b.s. so they don't write them down, don't remember them (due to arcane rules requiring them to pick a password they can't remember without writing down), and then the device has some problem that requires a reset, and then they can't get into their online account because they failed to properly set (and apparently weren't required) the recovery account or phrases.
Bulk content (when cloud storage might not be appropriate) is rarely important enough for reliability to be a concern, and home NAS is totally insufficient for reliability when it comes to truly important content as it does nothing for theft, fire, etc.
If my home media library gets wiped out by hardware failure, replacing it would be an annoyance but not a disaster. Anything that would be a disaster is covered by at least two cloud services.
Probably risks of data loss are underestimated by most people, so they don't use neither NAS neither RAID (I would be surprised if many people without solid technical background understand what RAID is). Disks are reliable enough so majority is not affected and if data is really important, it often could be restored for those who experienced data loss.
Netflix switched form HD to SSD though for serving video, as you can serve more streams at once. Consumer ,aptops will probably all switch to SSD soon, so almost all video streaming will actually be from SSD.
In this environment, the power cycles are going to be less stressful than typical, as you won't see the degree of thermal cycling you do in a hot chassis. I'd love to see what the data says a year out.
Would be interesting to see stats on how often they had to resort to cold storage? Also, how do they go about thoroughly testing such a service? This is a main advantage of having 3 replicas, you can test HA in your statement in a straight-forward way. How does that work with the Reed-Solomon system? Also, kudos to Facebook for publishing articles to this to the public. A lot of companies keep this stuff secret.
This quote from the article talks about data verification. Although we all know that a backup isn't really a backup until you've tested restoring from it :-)
"""
To tackle this, we built a background “anti-entropy” process that detects data aberrations by periodically scanning all data on all the drives and reporting any detected corruptions. Given the inexpensive drives we would be using, we calculated that we should complete a full scan of all drives every 30 days or so to ensure we would be able to re-create any lost data successfully.
Once an error was found and reported, another process would take over to read enough data to reconstruct the missing pieces and write them to new drives elsewhere. This separates the detection and root-cause analysis of the failure from reconstructing and protecting the data at hand. As a result of doing repairs in this distributed fashion, we were able to reduce reconstruction from hours to minutes.
"""
Yeah, it's pretty unbelievable. Is the idea here that they're keeping these storages as general-purpose as possible to serve many different use cases and survive the test of time and change?
Or do they specialize these data centers for ONE specific feature? I wouldn't want to be in the shoes of the people designing the full integration of the system in the latter scenario, imagine your use-case changes... that's a whole new level of refactoring, especially once you go into special-purpose hardware.
There was an interesting presentation about the cold storage system used at Alibaba on openSUSE conf 15. Unfortunately the video isn't online yet, but they have a system that allows them to turn off power to the cold storage servers and only wake the disks up on a monthly schedule. They also use erasure coding and have calculated that they won't have to replace any disks for 5 years if the MTBF numbers from the vendors are correct:
I wonder - looking at the data they have now - how much of the data stored has never been used by ordinary users. There's got to be a lot of cruft that could be trimmed. In the past people would keep letters they felt were important. Now, with the network it's hard to tell what's important, a user might prune items that other users want to keep or might prune items that later become more important. Is there really enough value in keeping everything.
I'm a consumate hoarder, I've got 2 decades of emails saved (but not all emails!) ... even I think perhaps we've gone over the top.
Perhaps it's good for social history. Or maybe in the future for a Black Mirror like reconstruction of people's personalities in an AI.
Doesn't there need to be limits on what we keep?
[Meta: the parents is a valid remark and one that I feel adds to the complexion of the scenario under consideration; perhaps it could have been made better, fleshed out, but still. Wish HN - as a population - would value those that add to the conversation in this way.]
Those are arguably antiquated models. Even if you don't agree with that, those models are predicated on companies that don't actually know what they're doing when it comes to storage, which is why they defer to others and pay the markup that comes with it. Google, Amazon, Microsoft, Apple, Facebook, many others (and growing), this is their business now, to know how to reliably store things, do it in-house in a way that works for their use case.
I don't use Facebook at all, but I appreciate the many investments they've made in open hardware and open source software. They don't give away everything. But they definitely contribute in significant ways.
You say this but there's a huge difference of the typical enterprise storage solution and what the internet giants require. Only recently are those vendors realising this...
If Facebook (and I believe it's similar with other big players like Google and Amazon) are not using "enterprise grade" hardware (because it doesn't make economic sense at scale), who is? And more importantly, why? Why do "enterprise grade" products even exist if the largest and most deep-pocketed corporations don't find them to be good value propositions?