Ask HN: What distributed storage technology are you using? - gtirloni
======
notmyname
I use (and contribute to) OpenStack Swift.

It's an object storage engine (think S3, but it's open source and you can put
it in your own data center) that's excellent at storing unstructured data.

It's completely deployable and usable without any other OpenStack projects.

There's S3 API compatibility for it. It supports globally distributed
clusters. It supports multiple storage polices that can be either replicated
or use erasure coding. It's designed for very high availability, very high
durability, and high aggregate throughput.

One of my favorite features is being able to create sharable, expiring signed
URLs to any object in the cluster.

Some of the common uses for Swift include storing user-generated content (eg
images, videos, game saves), static web assets, movies, scientific data sets,
backups, document sharing, VM and container images, etc.

API docs: \- [https://developer.openstack.org/api-ref/object-
storage/](https://developer.openstack.org/api-ref/object-storage/)

Docs: \- [http://swift.openstack.org](http://swift.openstack.org)

Vagrant All-In-One setup: \- [https://github.com/swiftstack/vagrant-swift-all-
in-one](https://github.com/swiftstack/vagrant-swift-all-in-one)

Come say hi! \- #openstack-swift on freenode IRC (I'm notmyname)

~~~
copperx
Just a quick question, how good is the S3 compatibility layer? Can I switch,
say, a Rails app to use Swift easily?

~~~
tarikozket
We had been using S3 in the past and needed to move to Swift due to high cost
of S3. Our integration just took a day. Most of the things went smooth and as
far as I remember, only some of the meta properties were different, that's
all. I can't recommend Swift highly enough.

------
rarrrrrr
I haven't used it extensively, but I've read some of the source code and I'm
excited about where Minio is going -- the erasure coding storage capability in
particular: [https://www.minio.io/](https://www.minio.io/)

------
ianopolous
The question is quite open ended as to whether it means backup, or something
else.

I use IPFS. IPFS is great for sharing multi-gigabyte size files between
machines in a cluster, bit-torrent style. In my case it is a couple of hundred
Amazon spot instances that can come and go very fast and need to get the data
ASAP to start some calculation, the same data for all nodes.

~~~
Something1234
I really want to know what you are working on, in that it can just be in the
mist.

~~~
ianopolous
My background is particle physics and it is quite a common paradigm there.
Many people, or indeed the same person, want to run many calculations over the
same large dataset. The results clearly need to be posted somewhere and are
generally much smaller than the input data, so this isn't a problem.

------
Veratyr
I'm not currently using any but I've tried:

\- Ceph: Very flexible. Supports many different kinds of replication. Has high
overhead compared to local disk (on the order of ~50%) and was (for me) prone
to hard to diagnose issues. Can be annoying to setup if you're not doing it on
a supported Linux distro with ceph-deploy. It looks like Bluestore (a new on-
disk format for data) will significantly improve performance but Bluestore is
extremely RAM hungry.

\- GlusterFS: Much faster than Ceph but less flexible. Has odd requirements
about "bricks" being the same size. Much less RAM hungry than Ceph.

\- A bunch of smaller ones I can't recall. Mostly discarded because they
performed badly or lacked replication options (I really wanted erasure
coding).

In the end I'm simply sharding my data manually. It's not as scalable but it's
much more performant.

~~~
mattbillenstein
Take a look at MooseFS -- it's been several years since I eval'd all these
options and MooseFS was the best -- it is a very sensible system.

edit - this is distributed block storage -- if you just need object storage,
perhaps something else is in order.

~~~
throw2016
I think Moosefs is file not block, like Gluster.

I actually found it better than Gluster in terms of robustness and
performance. It's got support for multiple masters, failover and a nice
dashboard.

One of those projects that should be much more well known than it is given
there are few open source distributed storage solutions. The MFS devs are good
but maybe lack the marketing savvy or perhaps just happy where they are.

~~~
mattbillenstein
I also had a lot of issues with Gluster -- when it was broken, it was hard to
know what was wrong. With MooseFS, the dashboard would usually give you a good
idea, and the logs were pretty good at showing how it was replicating data
after a node or disk failed.

------
jsiepkes
We're using LinkedIn's Ambry (
[https://github.com/linkedin/ambry](https://github.com/linkedin/ambry) ). Its
an easy to use distributed object store. It's basically a self hosted version
of Amazon S3.

------
devonkim
Not very popular compared to the likes of Swift and Ceph, but we've been using
Cleversafe for media storage and it's been super solid and performant. It has
adapters for S3, Swift, and even FTP and NFS (supposedly according to some
press releases I've seen from years ago). Outside of IBM not sure who the heck
is using it honestly. It's a pity, it's got some pretty interesting technology
under the hood for cost-effective replication and availability.

------
stevekemp
Right now I'm using a home-made sysmtem, which is a simple object store
replicated across six nodes:

[https://github.com/skx/sos/](https://github.com/skx/sos/)

I think in production I've used NFS, DRBD, GlusterFS and OpenStack. Each has
their pros and cons, and without a precise set of constraints it's hard to
know what how to usefully answer any question of the form "Which would you
recommend? Why would you choose this?"

Distributed storage tends to be required either because you want redundancy,
availability, or because your "stuff" is too large for a single box to host.
But with a vague question it could mean "How do you backup boxes?" or
something entirely different. (For example "distributed storage" could end up
mapping to a pair of MySQL hosts, or a replicated PSQL database..)

------
cannonpr
While the question doesn't really specify use case, I am surprised HDFS hasn't
come up yet. I guess this is focusing a lot more on online object stores for
web use versus data processing distributed storage.

------
nakkaya
For backups, I have two git-annex [1] repositories,

\- Personal files, stuff I can not afford to lose (photos,documents etc.) -
Full archive on S3, full archive on a home server, 4 clients with partial
copies.

\- Big data stuff I can afford to lose (VM images, media files etc.) - Around
6 TB, each file has two copies split between 5 hard drives on home server and
Hubic.

[1] [https://git-annex.branchable.com/](https://git-annex.branchable.com/)

~~~
tomfitz
How do you backup your git-annex repositories?

~~~
nakkaya
All clients has a copy of the git repo plus a server on Digital Ocean.

------
cju
I'm starting to use Sia (sia.tech) to backup some heavy files starting with
data I can afford to loose (even if I'm quite confident in the tool).

------
nunez
Thoughts on Gluster? I've used it for pet projects; really simple to get going
initially

------
sushanthiray
I'm currently using Google Cloud Storage for storing and archiving data. Using
regional storage has helped us while running production jobs which ingest this
data. Once the data is processed, we move to coldline storage for archiving.

------
_sy_
At Instamotor we use Elastic Search, Redis, and S3 for files/photos. In my
previous job (Nest), we looked at a decent number of options and ended up
going with Cassandra.

------
zhynn
LustreFS (intel enterprise lustre) using 10gbps networking.

~~~
fusiongyro
We use this extensively at the NRAO and mostly love it.

~~~
hlieberman
Have you thought about investing in InfiniBand? Even a generation or two old
beats the pants off of 10GigE in price and performance. As long as you're on a
LAN....

~~~
fusiongyro
We are using Infiniband in Socorro. We just established 10Gb ethernet from
Charlottesville to Socorro last week. The sites with instruments typically are
a little ahead of the main office in terms of connectivity.

------
hbogert
Ceph and ElasticSearch. Elastic is even running on Ceph for now. That's a bit
yuk though; they should both live on physical machines.

------
sidcool
Google Cloud Storage is working quite beautifully for me. S3 would be similar
I believe.

------
Rapzid
Currently S3, hadoop, dynamodb.. Probably elasticsearch again before too long.

------
imsofuture
We use Pithos (S3-compatible API, Cassandra backend) pretty successfully.

------
cbryan
S3

------
nikentic
We're using Ceph exclusively

