It turns out that Samsung 8XX SSDs advertise they support queued trim but it's buggy. The old TRIM command works fine.
There are in fact lots of "quirks lists" and "blacklists" in the kernel and virtually all computers require some workarounds in the linux kernel for some buggy hardware they have. Pretty amazing when you think about it.
EDIT: another closely related example is macbook pro SSDs and NCQ aka native command queuing. They claim they support it, but on many it's buggy. It gets better though; the linux kernel just starting trying to use such functionality by default relatively recently.
these sort of things are, as you can see, very confusing and frustrating to track down, identify, and find a general fix for
EDIT2: now that I actually read the kernel bugzilla entry further, it's more recently come to light the actual problem with recent macbook pro SSDs is MSI (efficient type of interrupts)
So another way of putting what you said would be "on Linux there's no working driver for that piece of hardware, unlike on Windows where the 'proprietor' went to the trouble of supplying such a driver."
Google up the Intel errata for the i7
The list goes on and on.
One of them contained a line related to having found a CPU bug and having put a workaround in place.
I am not entirely sure, but i think it may have been the F00F bug.
Among the challenges faced by the AMCC 3ware RAID HBAs were faulty motherboards.
"But PCI is a standard!" you quite reasonably protest.
Yes, and the US Constitution guarantees us many inalienable rights.
I assumed that these drives had the same controller chip and the same firmware base as the consumer samsung SSDs, but with higher quality nand and some firmware tweaks. It's very hard to find technical details about these enterprise drives on the internet (compared to the consumer drives).
I guess the smartctl command proves it, these enterprise samsung SSDs do not have queued trim enabled.
It would make sense for enterprise drives to be more conservative and lag on feature set. But it's very surprising that enterprise drives are corrupted by original un-queued trim, they're supposed to have more validation, and that's a very common feature.
The "blacklist" does not appear to have any constant to blacklist old-style trim, only NCQ_TRIM (and other odd stuff, most notably all NCQ usage).
This makes sense, because if some SSD advertised old-style trim but was corrupted by it, then it would be found and fixed sooner by these vendors, because Windows 7 would exhibit the corruption.
Please permit me to violate my NDA:
/* MacWrite needs this */
... in Mac OS System 7.5.2. I honestly don't know whether MacWrite still needed it but that code was there to work around a bug.
Linux 4.0.5 includes a patch that blacklists queued TRIM for the buggy drives. Windows and OS X apparently don't support queued TRIM at all, so they're unaffected.
Our affected drives did not match any pattern so they were implicitly allowed full operation.
See the list:
At a place I used to work at we had a reasonably large cluster of Windows boxes on Amazon. Randomly, Windows machines on Amazon would suddenly stop accepting new TCP connections.
This means that machines would be running fine, and then half your cluster starts dropping offline. At the time when this happened to us, there were no other reports we could find of this happening.
Turns out, it's some bug in the Xen Virtual NIC driver that wasn't running the offloaded TCP cleanup, and so eventually the system couldn't accept any new connections.
Once we figured out was happening we could pre-emptively reboot boxes, but that was a problem for us for about 6 months iirc.
There's probably dozens of these bugs affecting someone on these cloud platforms at any one time. But because you have no access to the hardware, you don't even have the option of saying "Screw it, lets just get different hardware". You're at the mercy of your cloud provider.
Many use-cases just require the job to be done on your computers due to security and privacy reasons. Yes, Amazon's and Google's services are in some ways less secure than your own computer, because they are hosted by companies which are subject to a government that doesn't value privacy, not even of it's own citizens. That means said government can, just to give a concrete example, NSL the companies to give up all they have about you, and you wouldn't even know notice.
When the government puts national security above fundamental human rights there is something dangerously wrong.
A cloud is just a sysadmin staff with a Sufficiently Large Deployment to have ironed out all the kinks in their hardware.
Assuming their business model isn't assuming an infinite supply of future customers so in the short term as long as revenue per customer exceeds cost of sales per customer we're all good, etc. Support costs that exceed average cost of sales must be beaten down/ignored, otherwise its cheaper to let them go and have sales "earn" a replacement customer.
Finally their sysadmins work for them to meet their corporate objectives of various meaningless metrics which have no necessity of aligning in any way with your own corporate objectives.
By that definition, I don't think there are any clouds.
But people can try, and they can get close; and one can say that something is a cloud to the degree that it manages to fulfill the "amorphous shape in your diagram you don't have to worry about" promise. So there are some 80%-clouds, some 95%-clouds, some 99.995%-clouds, and so on.
The point I was trying to make is that the degree to which a cloud achieves that promise is correlated to the size (and longevity, and homogeneity) of the deployment. The more man-years have gone into taking care of a given server type at a given DC, the more institutional knowledge is ready-at-hand to solve a problem on your machine of that type, and so the fewer issues become emergencies that break out of the "cloud" abstraction to require your attention.
And it was a reply to the parent precisely because a security problem is just such an "emergency" that represents a failure of institutional knowledge: I would much sooner trust AWS's KMS to not leak my private keys than I would trust a machine I was running myself to not leak my private keys. I'm a much worse sysadmin than AWS!
For the machine alone its $1200 a month. Bear in mind its on a shared infrastructure, with noisy neighbours. You'll see about 10-30% CPU steal. In practice you'll see performance about half that of a real machine (from my comparisons)
Then you'll need to factor in disks as well. First things first EBS is dogshit slow. Yes ephemeral disks are fast, but then they die, so you're in the same situation. however you need 10gig networking to get low latency, avoid puncturing the cache etc,etc,etc,
for EBS the maximum IOPs you can guarantee to get is 20,000, and you need 1tb for that.
for the Iops, thats $1300 a month + $125 for the 1 TB of storage.
so a month, per machine it'll be $2625. $31500 per machine, per year.
Every 6 months, you could buy a new machine, which is faster than the fastest EC2 instance + EBS.
Now, the OP stated that they have more than one machine. Obviously one could use reserved instances. However similarly one could negotiate volume discounts.
There is of course the cost of internet and cooling, you're looking at around $500 a month for half a rack, depending on power consumption. (if you're colo'ing)
From a valuation point of view, having hardware counts towards your value, as its an asset you actually own. More importantly you can use it to lower your tax bill, and reduce your run rate, in exchange for an up front cost.
Now, if you have a lot of bursty traffic, that doesn't require much DB activity, then AWS is perfect, as the elastic IP load balancer allows you to spin up machine on demand. However thats not that helpful for Databases. Sure you can warm migrate from a EBS snapshot, but you'd best do it quick, otherwise you'll overload an already overloaded DB.
First of all they tend to not look at monthly prices, and are seduced into thinking their instances are cheap. Secondly they are seduced itnto thinking they are spending less ops time, though in my experience it's the reverse. Thirdly, people "forget" about extras like bandwidth costs (which are extortionate at all the big cloud providers), extra storage volumes etc.
Then when people get the bill, it often gets back-rationalised as being ok because it's cloud so it must be cheap.
The greatest innovation AWS did was finding a way to get people to pay absolutely insane rates for hosting.
They share their test results for both physical and cloud-based storage, I figured this would be of interest:
Samsung knew that only Linux supported queued trim, so releasing it without proper testing is just externalizing the disproportionately increased cost of testing to the Linux community.
With Samsungs finished-forms walling the company already tells Linux users to not expect any support, at all. So, that is consisting with the testbed-theory.
I'm sorry. I'm too dumb to parse this. :( Would you kindly rephrase it?
Thanks much. :)
I'm sorry. I'm too dumb to parse this. :(
Here is another try: Samsung's support walls with prewritten answers that say Linux is open and thus Linux is unsupported, and this action of Samsung is consisting with the testbed-theory.
Anyway, I get your statement now. Maybe instead of saying "walls" you should say "stonewalls" (derived from "stonewalling")?
If you're referring to Hitachi, then they did continue it, yes, but they bought it on a fire-sale, and their name was not attached to the original affair, so they presumably did not see it as particularly risky.
Even if Samsung has some systemic problems, it's more subtle than just schlocky marketing, or targeted benchmarking.
# call fstrim-all to trim all mounted file systems which support it
# This only runs on Intel and Samsung SSDs by default, as some SSDs with faulty
# firmware may encounter data loss problems when running fstrim under high I/O
# load (e. g. https://launchpad.net/bugs/1259829). You can append the
# --no-model-check option here to disable the vendor check and run fstrim on
# all SSD drives.
That's what I have in my home computer, with ArchLinux.
Do you think this problem only is something particular in the servers of the author of that article, or should this be interpreted as:
linux + samsung 850 = you will lose your data?
Sometime around the end of 2013 I started getting frequently lost data and corrupted filesystems upon reboot.
After much search and about 4-6 months into the issue, I found out that the culprit were the queued TRIM commands issued by the linux kernel to my Crucial M500 mSATA disk. The Linux kernel already had a quirks list with many drives, including some of the M500 variants, just not mine.
I added my model, compiled the kernel and the nightmare ended. I proceeded to submit a bug report and a patch. The patch got accepted (yay!) and the bug report turned to be very useful for other people with the same problem but different disk as I included the dmesg output that was specific to the issue. This meant that they could now google the errors and get a helpful result.
Such is the nature of free software; you are allowed to fix your computer yourself. :)
We have had great success with both Sandisk Extreme Pro SATA and Intel DC NVMe series drives, we've also recently deployed a number of Crucial 'Micron' M600 1TB SATA drives that are performing very well and so far haven't given us any issues.
There are Crucial SSDs on the list. I'm going to be keeping a closer eye on them now.
The problem appears to have gone away following a firmware update, touch wood.
I'm really about to consider to migrate / to ZFS now.
> fsutil.exe behaviour query DisableDeleteNotify
DisableDeleteNotify = 0
(That's why serious bugs like this can happen ;)
Also, some Samsung 800 series drives only gained this bug in a recent firmware update (840 EVO specifically).
Linux 4.0.5 ships with the patch linked above, but for a while you had to roll with a kernel built from source.
EDIT: The blatant file corruption issues only manifested after updating to firmware EXT0DB6Q.
However, if you don't update your firmware, you'll suffer from significant performance degradation when reading old files: http://www.anandtech.com/show/9196/samsung-releases-second-8...
Do you think there'll ever be SSDs that don't need it?
I have an old Intel SSD that doesn't even support TRIM, and it still works fine. As do all the other USB flash drives I have...
Eventually, they relented and enabled it on their SSDs. I'm pretty sure the marketing and engineering butted heads over this one stupid bullet point.
Apple didn't do this because of "windows users whining" but because they knew they didn't want an angry mob of customers wondering why their drive is 10x slower than it was on day one.
Arguably, idle GC was "good enough," for some use cases but probably not for drives that aren't sitting idle all the time and on many hours a day. Even then, Apple probably didn't want to tell its customers "let it sit out overnight" to regain performance when supporting plain-jane TRIM was a trivial addition.
On-board GC + OS-driven TRIM are considered the optimal solution for SSD's.
As one of the most expensive SSD drives available on the market, it was disconcerting to find dmesg -T showing trim errors, when the drive was mounted with the discard option. Research on mailing lists, indicated that the driver devs, believe it's a Samsung firmware issue.
Disabling trim in fstab, stopped the error messages. However it's difficult to get good information about whether drive performance or longevity may be impacted without the trim support.
If your drive has reasonably with unprovisioned space, it can simply work around the missing trim commands - this however, is theory, I do not know if the firmware actually does this. This is the exact thing that makes some drives better than others when working without trim.
I would some sites trade storage speed for more space (HDDs instead of SSDs).
SSD zeroed out a part of the disk during runtime, as I watched this happen music was playing from this drive. It was mounted from Ubuntu MATE 15.04 and playing a music library through Audacious. Suddenly music glitched and IO errors began appearing. Rebooted to a DISK READ ERROR (MBR was on the EVO). Ran chkdsk from USB and it showed a ridiculous amount of orphaned files for ca. 1h. Once finished the most frequently accessed files had disappeared. Download folder, Documents folder, some system files. Of course, some of the files could've been recovered had I not ran chkdsk off the bat, bot nonetheless it's an approximate measure of failure impact.
I began being suspicious of 840 EVO when sorting old files by date became fantastically slow. If you have a feeling this has happened to you recently - buckle up for a shitstorm.
TL;DR Avoid 840 EVO.
Not to mention that this disk has only had 5TB written to it.