A lot of companies using Ceph at scale are facing huge issues (OVH, etc.), so he is not wrong. Why take the risk of going with a solution that is known to cause issues?
I've talked to a lot of large-ish commercial Ceph customers and they seem to spend a lot of time building kludge-arounds for support. And tend to live terrified that the whole clumsy edifice will come crashing down at the cost of their jobs.
Also too, Ceph is block, object and file. Block is ok up to a point, object is dubious and file is utterly untrustworthy. At least at any kind of real scale - 3 servers in a rack aren't "scale".
Why must someone who isn't a Ceph fan (and I fail to see why storage systems are a "fan" activity) live in the evil pockets of EMC? I know people who've smoked for years and don't have any sign of lung cancer either.
OVH isn't exactly a shining example of a quality engineering organization. Simple web searches show how they have misused things and cause large outages.
Ceph is very reliable and durable. We've actually gone out of our way to try and corrupt data, but we failed every time. It always repaired the data correctly and brought things back into a good working state.
Ceph and Yahoo run very large Ceph clusters at scale, too.
You can use Ceph together with OpenStack. They used ceph for their cloud services but had huge problems. If I am not mistaken they completely threw out Ceph by now.
Any idea what the underlying issues with Ceph were?
My story is a bit dated, but we went from gluster to ceph to moosefs at one startup. Gluster had odd performance problems (slow metadata operations - scatter/gather rpcs and whatnot I would guess) and it was hard to know from the logs what was going on.
Ceph was very very early at this point, but part of it ran as a kernel module and the first time it oops'd, I deleted that with fire. MooseFS ran all in userspace, had good tools for observability into the state of the cluster, and the source code was simple and clean. It didn't have a good story around multi-master at that time, but I think that is improved now.
Ceph is extraordinarily complicated to run correctly. The docs aren't great and commercial support is pretty mediocre.
It's an amazing piece of software, but takes a great deal of engineering to get right. Most folks won't invest that much engineering into their storage.
This is why Providers like EMC and NetApp can extract 10x the cost of the raw storage from enterprises.
The RedHat ceph docs are great and open to everyone for free.
The RedHat commercial support has been pretty good for us. We presented them with 2 bugs, and they addressed both. One took a few weeks but one only took a few hours to get a hotfix started.
EMC storage is absolute trash post Dell merger. Pure 100% dumpsterfire. Their customers know their systems better than they do. It's pathetic.
No clue what the underlying issue was but when reading:
"We have about 200 harddisk in this cluster... 1 of the disks was broken and we removed it. For some reasons, Ceph stopped to working : 17 objectfs are missed. It should not."
Has nothing to do with EMC.