Usually testing is done at room temperature (25C often used) with the best possible heat-sinking and ideal Vgs. Most MOSFETs of this type actually tie the internal thermal components to the drain pins. Its intended that the PCB designer use a large copper pour for the drain pads, this copper area acts as the heatsink for the chip, transferring heat away from the junction.
In other words, the MOSFET can only handle 6A if ambient temperatures are kept at room temp (usually means fans on it or its wide open to a room), there is adequate heat-sinking provided by the PCB design, and Vgs is high (higher Vgs reduces Ron). Further its important to remember that FETs have a "run away" characteristic in that as junction temperature rises so does Ron, which creates more heat.
So, there is a lot more to the functional rating of a MOSFET than the 6A on the datasheet. In reality you would never want to nor expect that FET to hold 6A. With just a cursory look at the PCB layout and previous experience with what the ambient temperatures tend to be on back-planes of server chassis loaded with many drives, 1A would be pushing it, so I'm not at all surprised that these things blew.
Perhaps the most WTFery aspect of that PCB is the lack of any suitable capacitance near the SATA connectors. If you aren't using staggered spin up and many drives were spinning up at the same time the rail voltages can dip which in this setup causes Vgs to drop, increasing Ron at the worst possible time as mechanical drives draw a lot more current at spin up. This could trigger thermal run away.
In this case the MOSFET is rated with a max current of 5A @ 25 degrees C if the gate is driven with a 5v signal. It has an 80 mOhm resistance when fully enhanced. So at 5A that is 2Watts of power dissipation (p = i^2r) and since the case has a thermal constant of 62 degrees C/W that means the case will be at 149 degrees C (25 + 62 2W) which causes Rds(on) to double and that 'explodes' the MOSFET. If the device was successfully carrying .5A loads then then the case is only good for about 20mW of power dissipation (which looks reasonable given the package size). So gluing a copper heak sink to the FETs with at least 2 sq inches of surface area would probably keep them alive with a .75A load. If you combined that with about 150 CFM of air flow at sea level (225 CFM at 8,000 ft) you'd stay pretty solidly inside the 'not to be exploding' parts of the parameter graphs.
What's interesting about this is that bipolar transistor has the opposite property -- as it heats up, resistance decreases -- and yet this opposite property causes the same problems with thermal runaway. (Okay, when talking about bipolar transistors we don't really talk about "resistance", but if you know that then you probably know what I'm going to say anyway...)
When hooked up to a load with a relatively constant voltage, there will also be a relatively constant voltage on the transistor. As the bipolar transistor heats up, the resistance decreases and more current flows through the transistor, and the transistor will dissipate power according to the law P=V^2/R. So once a transistor gets hot, it gets hotter until it blows.
The question is, "what kind of load looks like a voltage source?" There's actually a quite common load -- any amplifier with parallel output transistors will look like it's driving a constant voltage load, from the perspective of one of the parallel transistors. Basically, hooking up bipolar transistors in parallel does NOT multiply the power rating as you would think, because thermal runaway might cause one of the transistors to dissipate all the power.
Is the design solution to use a FET with a lower Rds at 5V Vds for smaller heat dissipation?
But more importantly, is there a standard current that most chassis are tested to? I would expect all SATA hot-swap bays would support all SATA drives on the market, since nobody ever gives power dissipation or consumption figures.
Honestly - Its to not cheap out on the design and use a proper hot-swap controller.
Examples (many companies make these): http://www.ti.com/ww/en/analog/power_management/system-prote...
You can rig up something slightly better by designing a current sense circuit with feedback into the FET gate if your adventurous. In either case the goal is to limit and control inrush current during power on or disk insertion.
Based on what you've said I doubt the steady state current of the disks is the problem at all. It just sounds like the 3TB drives have a higher inrush current during power up and its either high enough or lasts long enough to blow the FET junction.
The SATA spec defines "pre-charge" voltage pins that are connected after ground, but before any of the other voltage pins are connected. The idea is that by inserting a small (10 ohm) resistor, you can limit inrush current to a tolerable value while capacitors charge and regulators start up, and then when the other pins connect a couple of milliseconds later, the drive gets the proper low-impedance power connection.
Do you know if implementing pre-charge via connector mechanics obviates hot-swap protection circuitry? Supermicro seems to have hot-swap protection on their backplanes as well, but I haven't had a chance to closely inspect it.
EDIT: Those connector pins aren't normally solid gold. They are deposited metal (copper or tin) with electroplating of a few microns of gold on the surface.
Another reason to control insertion spikes is that when the first power pin hits you'll get a little arc (spark). This can cause small damage to the pins in the form of small chipping of the coating and/or carbon deposits(or oxidation of some metals). This contributes to a reduction in the number of insertions cycles the connectors will survive.
If you have spare boards and a hankering for destruction, you could repeat your test with resistors instead of hard disks, to verify the current drawn by the hard disk is as advertised.
MOSFET on resistance increases with temperature, so with multiple devices in parallel the hottest one will flow the least current. This is an important effect even within one device, which is actually an array of thousands of junctions.
I would be suspicious of this, in conjunction with inadequate heatsinking (i.e. heavy copper pads under the FETs) especially if the FETs are being driven by 3.3v. Looking at the specs, the FET is rated at 50mOhm given 4.5v gate drive - very acceptable - but if the FET is driven by 3.3v, it will have much higher Rds (may be running in the linear region which would be very bad). Note that the gate threshold voltage is 1.5v typical but 3v max so driving with a 3.3v logic signal would be marginal in the worst case situation.
So Ugs is the same as the rail voltage (3.3, 5 and 12v).
They're essentially the OEM for a lot of Intel developer stuff now, too. Over the past 10 years, they went from "yet another commodity motherboard manufacturer" up to being the big white box option. They have basically replaced Dell and HP as the go to vendor for a lot of big deployments, although Dell and HP do better financing, Dell and HP have a few higher end products, and Dell/HP have better desktop/workstation/laptop products if you want a full suite from one vendor.
I've actually considered some Norco enclosures in the past for a hobbyist project, but I had concerns over their hot swap backplanes -- mainly, a lot of reliability issues reported. Catching on fire pretty much settles it!
ASUS makes server mobos, but not cases (I'm not sure who's cases they're using for barebones units), and Antec makes low end server cases, but no barebones, and Intel makes pretty horrid server mobos but use someone else's cases for barebones.
There's Penguin (Linux/HPC), iXsystems (FreeBSD/storage),and Tyan for "server barebones."
Usually I'd just pay these guys to build something: http://www.computerlink.net/
Article states: "... it should work fine in any metal box, right?"
But then says the case had some electronics which failed. Which means it wasn't just a metal case. Anyway TIL that server cases aren't just metal boxes.
And in case somebody wants to respond, how important is vibration dampening to ensure hard disk reliability? What's the best way to damp vibrations in a consumer tower desktop case?
Putting the RAID set in a new machine, it would rebuild fine. But in the original machine, we swapped out the raid controller, CPU's, even the whole motherboard, and the RAID sets still would not rebuild.
Long story short, each of these "haunted" servers had a bad fan that was causing a lot of vibration within the chassis - enough physical vibration happening that the hard drives were essentially rendered inoperable.
The moral of the story is to make sure you have good vibration dampening on your fans, and to use the sensors monitoring to alert you if the fans are going bad. (Even this is not perfect, since sometimes the fan gets off-kilter but is still happily spinning at 10K RPM. The first thing we did if we got an alert for a disk failure was to replace the fans and attempt a RAID rebuild before touching the "bad" disk)
We swapped out the E450s for 440s for Oracle when we moved to InterNAP, and all seemed to be well.
Hearing your story, I wouldn't be surprised if we had just enough/wrong vibration in the case to make it go Tacoma Narrows on us.
It has been a (long) while since I have seen the inside of an e450 but iirc there were a bunch of fans in trays in there. So it is certainly possible that the vibration did bad things. I still carry one of the e450 era keys on my keychain as a momento.
"1% less downtime means 87 hours per year. Do you think Lenovo is 1% better than Acer?" <- what I say when I encounter a client that hasn't learned this lesson, and insists on low end gear.
Buying more expensive hardware (with better per-component availability) is one approach to a full availability picture. However, it's one with significant shortcomings. At the top end, it's very expensive. Across the board, it only addresses some possible threats, and completely doesn't address others (like natural disasters). The alternate approach is redundancy. That approach is not without it's shortcomings, including increased system complexity and the problems of failure detection and failover. However, it's generally lower cost and is more robust to multiple types of threats.
This is why RAID is so successful - because it is a reliable set of techniques for building highly-reliable systems out of unreliable components. Again, it only protects against one threat (drive failure) and not others (fire), but it's proven superior to the alternative "buy a single gold-plated drive" approach.
Low-end gear isn't for everybody. Cutting corners in the wrong places is dumb. On the other hand, high-end gear only saves you in some ways, and those ways may not matter to your business. If you want high-availability or high-durability you need to know to things: how much your data is worth, and what you are protecting it from. Until you know these answers, you're going to be making poor system design decisions.
Redundancy is a great tool but its not a panacea. You mentioned many of the drawbacks yourself. I see too many people building gigantic RAID arrays with consumer hitachi drives and questionable controllers. It's exceedingly easy to think you're adding redundancy when you are just creating a house of cards.
Then I checked the drive spec and saw that we are pulling 0.8A on the +5V rail and 1.2A on the +12V rail which is even more than these 3TB SATA drives without backplane issues for months.
We did have to upsize the case's PSU from the shipped 500W though as beyond 15 drives we got voltage warnings from the expander.
So I think YMMV.
I never expected that paying more for a slower drive (but better power and less power-related issues) would actually make sense!
(For what it's worth, the datasheet for the MOSFETs used are rated at 6A; it's possible the manufacturer got it mixed up with the 0.6A batch of MOSFETs. I don't have any hard evidence though, so I'd rather not post speculation.)
lold hard. transistors are not resistors. peak current is not just a variable parameter.
I suppose this was due to bad contact somewhere (socket or HD) and transistors working in linear mode for more time than they can withstand, overheating and boom...
EDIT: or may be they used a crappy hotswap controller (if they actually used one) which cannot pump enough current into gates of those mosfets.
I also saw a couple instances where the trace to the gate blew up (like, black board, no copper, looks like a burnt fuse), so it's possible that static during manufacturing killed their FETs even before they shipped. (We're pretty good about static, and if it was our fault I think maybe one drive/channel would die, not several)
yes, that would be better. in reality parameters of transistors vary. but when it is off by an order of magnitude these parts (and actually the whole batch) go to trash.
> trace to the gate blew up
if it really was trace to the _gate_ then it might be blown after mosfet itself melted. Because even really fat FETs (with high gate capacity, D-S rated to hundreds of amperes) cannot do this to copper trace.