
Building a high performance SSD SAN, part 1 - runarb
http://smcleod.net/building-a-high-performance-ssd-san/
======
StillBored
I wrote a longer version of this, but I will summarize as not all the
commercial flash systems are x86's driving SSDs. Disclaimer: I don't work for
IBM, and have no commercial interest in what i'm about to say.

I had the pleasure (a rare statement for me) of recently having access to a
TMS/IBM flash system 840 for a year. Frankly, if the goal is shared network
accessible performance, nothing you can build with PC's and SSDs will come
close. The on the wire latency and bandwidth that even a single 2U (60TB) unit
provides is fairly shocking. And that performance doesn't degrade over time,
even under extreme load. The official performance numbers on the IBM site are
actually conservative, which is unusual itself. The units are all custom (in a
way that is a negative, cause they aren't cheap) and built like tanks.
Although from an end user perspective they are very simplistic. AKA they
provide a RAIDed volume(s) on the SAN and nothing else. Your on your own for
replication and the like, although I think they have a canned solution for
that now.

Bottom line, they cost a mint, and are built like old school over engineered
IBM hardware. But you won't be kicking yourself a few months down the line as
you baby your custom storage solution. Normally i'm the guy building stuff
like this, but sometimes I would rather just sleep at night than get called
because the garbage collector kicked in on a SSD and it dropped out of the
RAID in the middle of a big load. Plus, being IBM, you can probably get a unit
with a POC agreement, just to see how it compares.

~~~
johngalt
> you won't be kicking yourself a few months down the line as you baby your
> custom storage solution. Normally i'm the guy building stuff like this, but
> sometimes I would rather just sleep at night...

"Even Southwest doesn't build their own airplanes." I have this argument with
myself all the time. If you don't have an IT pro who works on storage every
day, then buy something from someone who does work on storage every day. Their
time is worth the price. Your time is worth it too.

~~~
mrmondo
I've taken this from where I replied to a comment on my blog:

Support

4 hour on site response offered from vendors is not good enough for us and to
improve on that is expensive. With using more standard components we are able
to be on site at either our primary or secondary datacenter within 30 minutes
with spare parts in hand if required.

Choosing one vendor as an example, the support we've received from HP has been
atrocious - they've caused more outages than they've be able to fix. The
engineers they send to smaller organisations are generally relatively
incompetent and they have certainly shown that they don't care about your
uptime.

Proprietary storage systems offered by HP, DELL and EMC are not only expensive
to purchase and license but they're very time consuming to manage as they're
essentially a 'snowflake' in your infrastructure. It's hard to make them
integrate with modern automation tools such as Puppet and CI requirements and
they all use their own management tools that are specific to the vendor or
range of product - usually this involves having a Windows VM running Java or
some equally frustrating technology to manage the system. Performing updates
on proprietary systems can often be painful as hardware vendors generally are
not very good at designing software.

It's very hard to outsource quality and it comes at a large cost.

~~~
StillBored
You have hit the nail on the head for why most "enterprise" gear is basically
overpriced junk. That doesn't mean there isn't quality hardware out there,
just that you have to be more selective. AKA do your own research and don't be
dazed by the feature lists. Some of those features shouldn't actually be used.

I personally tend to like the KISS arrays that don't have
dedupe/replication/etc built in, and are web manageable. The extra bonus is
that there are a metric boatload of tier 2 array vendors (imation's nexan for
example) that provide rock solid hardware for a small fraction of the prices
of EMC/etc. Its quite possible to get native capacity for less than a company
like EMC charges for deduped capacity (aka 100TB from a tier two company can
cost less than the 10TB deduped to 100TB from EMC). Raw RAID is pretty
simple/understood in comparison to deduped solutions, and I think that has a
significant affect on reliability. AKA more features, more latent bugs...

Finally, get a bunch of demo units, and if the configuration UI is a mess of
esoteric proprietary command line junk, or it is only configurable with a
32-bit java app that won't run on a 64-bit windows (yah I've seen that) then
send the unit back. Be clear about why it won't work for you. The only way to
convince these companies to behave is to show them lost sales.

------
mrmondo
I'm the author (Sam Mcleod) of this post and just saw this was posted on HN.

I'm quite a bit further down the track now and should have a much better write
up to post in the next week or so time providing.

I can say that I'm hitting over 1,000,000 IOP/s on 4k random reads and around
850,000 IOP/s random 4k writes per 1U unit without cutting any corners with
data safety.

What I haven't finished yet is the cross-unit replication tuning and the load
balancing / multipathing.

So far everything has gone to plan and is progressing well as a serious
replacement for our 'traditional' storage.

Regardless, there's still quite a lot of testing to be done and once I have
more of that completely I'll post an update with some serious numbers and
observations.

~~~
moe
_cross-unit replication tuning and the load balancing / multipathing._

That's going to be interesting. At those speeds (1.5GB/s, 850k IOPS) you must
bump[1] into all kinds of DRBD bottlenecks?

Also I'm very curious about your approach to redundancy/multipath. Dual-
primary or failover?

[1]
[http://blog.gmane.org/gmane.comp.linux.drbd/page=41](http://blog.gmane.org/gmane.comp.linux.drbd/page=41)

~~~
mrmondo
I have no illusions that I'll be able to maintain such performance with DRBD
replicating and iSCSI overheads.

My current feeling on load balancing is to keep it simple, have half the
clients treat node A as their active storage and the other half using node B.
If node A becomes unavailable failover will occur and node B will serve all
clients.

As far as multipathing goes - my approach is that I honestly have no idea.
That's something that's going to be learnt along the way and I'm in uncertain
if it'll make it into the final build or not.

Edit: I can't get that link to load on my connection at present but I've
bookmarked to look at tomorrow. If you have any experience / advice I'm always
very open to assistance!

~~~
FlyingAvatar
I was quite amused to see this posted as we just built nearly the exact same
thing about 6 months ago (minus the NVMe drives which weren't available). We
are using ZFSonLinux on top of DRBD instead of mdmadm, which has been
surprisingly successful.

The biggest frustration that we met was using LACP in trying to get a better
data rate out of DRBD. Regardless of the LACP mode, we were not able to get
DRBD to scale to full bandwidth (20gbps) and had to settle for using it for
redundancy only.

I imagine with your setup (read: much faster peak write capability), you might
actually be able to bottleneck DRBD pretty well over a single 10GBE pipe if
you were writing at peak capacity. I'll be interested to see if you happen
upon a work-around.

Load balancing as you've suggested is an intriguing compromise. Have you tried
doing a two-way heartbeat setup where each both boxes have their own active IP
and can fail over to each other?

~~~
tedchs
I wonder if the lack of help from LACP was because DRBD is using only 1 TCP
session, which would only get hashed to one or the other member link?

~~~
cgb_
Might be a case where MPTCP ([http://www.multipath-
tcp.org/](http://www.multipath-tcp.org/)) would help with aggregating multiple
links.

~~~
mrmondo
Correct me if I'm wrong but I'm not sure MPTCP would help as LACP is generally
SRC/DST MAC based and shares a uses a single IP?

~~~
cgb_
You are right in that LACP bonding methods cannot increase the throughput of a
single flow to gt any single link in the group (the balancing method can
utilise src/dst MAC, IP and sometime L4 ports).

MPTCP establishes multiple subflows across individual IP paths and can load
balance or failover across all subflows. Applications do not need to be
rewritten to take advantage of it. I'm not sure if that includes kernel
modules like DRBD though. I suppose someone needs to find out :)

~~~
mrmondo
I could have a play and report back if I get time?

------
jjoe
I understand his decision to skip the HW RAID controller (I like mdadm too).
But a BBU is definitely critical to this op. NVMe has so many more queues that
on a busy box it's going to hold way too much uncommitted data and metadata.
Now the only other reasonable option is a UPS but it adds significant RUs/cost
to the setup.

~~~
runarb
I am no expert, but are you sure that uncommitted data on the ssd is an issue?

The Intel DC P3600 has a built in capacitor that should give it econoff backup
power to commit the data. The SanDisk Extreme Pro disks don't have volatile
cache, but instead uses SLC NAND flash for cashing, that will survive a power
loss.

There is of course still the issue that data send to the server and stored in
main memory will be lost if you loses power, but a HW RAID controller with
battery backup would not have prevented that (but an UPS of course might).

~~~
otterley
It would definitely prevent an uncommitted data problem if fsync or O_DIRECT
were used (which should always be used for critical writes).

UPSes defend against power outages, but they're only one rung on the data-
integrity ladder. Controller BBUs, on the other hand, protect against both
power outages AND kernel panics.

~~~
mrmondo
Indeed we have fsync enabled on our database servers (PostgreSQL)

~~~
mrmondo
(And just wanted to point out that that is only one part of the equation)

------
mrmondo
It would be typical that this was posted just before I went to sleep last
night!

I've been flooded with messages, comments and feedback overnight and am slowly
catching up on most of them.

I wanted to say one thing: My blog post that was linked here was never
intended to be much more than a brain dump at the start of a journey. It
contains little-to-no useful information other than the idea itself and it
doesn't have any technical information or nearly enough background as to why
I'm exploring this option or why I think there is a valid use case for it.

Give me a week or two and I'll have an updated post with a proper background
story and experience with traditional solutions, my observations so far, some
real technical content and of course some serious numbers.

Thank you so much to all the people that have contacted me both from HN and
directly to offered insight into their experiences.

------
hengheng
What is the reason to go with Raid1 for the intel SSDs? My guess would be that
the rate of Raid1-catchable failures of these drives is not considerably
higher than the rate of, say mainboard failures or other failures that render
the whole unit out of operation.

~~~
mrmondo
The only reason is to protect against per-device failures - the technology is
new and while it's meant to have astounding resiliency and a long lifetime - I
don't want to take that chance. Down the track when they have proved
themselves to us I'll certainly be reconsidering this however.

------
KaiserPro
I'm always interested in the outcomes of these types of experiments.

One thing that I find interesting is no mention of what network card you are
using? or switch setup, or ip config. I seem to remember in the dim and
distant past that letting the storage layer handle the paths instead of LACP
(which on some switches are active-passive) was better/faster/easier. (that
could be bollocks though.)

I used to work in VFX where performance and cost must be balanced. For the
storage lumps we had a 1u dell with the fastest 10 core v3 processor in it,
384 gigs of ram (ram is a cheap and fast cache) tied to a dell 3060 (60 4tb
disks in a 4u case, with up to 4 SAS connections out(well many other options
it does iscsi too))

These topped out at about 2gigabytes/second (or about 1.3gigabytes with 100
concurrent fios from NFS clients). but there are 30+ of them so the aggregate
was enough to dowse half the render farm in iops.

A couple of things that stuck out for me: o 24/7 support contracts are
amazing. o auto phone home disk replacements are also life savers o multipath
iscsi can be painful unless you have decent switching (with ospf) and OS
support o in short distances, SAS > FC >> iscsi for block storage. pretty much
plug and play (multipath is easy too). o Replication needs a different path
separate from the client o melenox or intel nics are the only 10gig cards
worth paying for. (the only thing going for broadcom is that they are
supported out of the box by vmware) o File systems always fail o for random
IO; RAM > SSD by a massive factor. especially if your active dataset is less
than ram cache. o drdb will be as slow as your slowest link in the network. o
ethernet packet size really matters. (and depends on your workload) o
synthetic benchmarks are the chocolate teapots of the storage world. o paying
for decent support saved the (me and by extension the company(s)) at least
twice.

Yes, the dell/HP/whitebox route is more expensive in capital outlay. However
you can if you so wish go the white box route. The dell raid enclosure is a
rebadged netapp/lsi/enginio box. dothill do a similar one.

The important thing to note though is the opex & RnD is far, far cheaper with
prebuilt systems. With a click of a button you get replication. If you are
using VMware (I'm sure KVM has a similar system) it can do it as well. The
performance is predicatble, and the configuration and tuning has been done
more or less for you (excluding the raid layout, FS tuning and network setup.
but then you're a sysadmin who knows how to do all that right?)

There is nothing worse than having to debug a storage system so unusual or
highly configured that googling won't help. You feel terribly alone. Even more
if you don't have decent backups.

If you don't have those expertise, it'll cost you either time, or budget for a
storage guy. (so just do some testing and buy the fastest prebuilt system for
your budget.) For a previous place, I tested 7 different systems to get the
best fit. (that included home rolled) A lot of the time was modelling the
sotrage workload, and devising tests that would proxy that kind of load on the
test systems.

END_OF_RAMBLE

~~~
mrmondo
I actually have to shoot off but wanted to say that I enjoyed your comment and
I'll have a chance to reply tomorrow morning - this post of mine was sort of a
quick brain dump and massively lacking in almost any useful detail other than
the idea itself. Anyway, as I said I'll reply tomorrow and hopefully shed some
light on where I'm coming from and get your thoughts on that.

------
PaulHoule
Enterprise storage systems made more sense back in the spinning disk era when
the need to minimize latency caused by physical motion was acute. They're
definitely struggling now to produce products that really add value to SSDs.

~~~
KaiserPro
for Direct attached storage, possibly.

But its simply not the case to say they are struggling. IBM yes.

Netapp and EMC are still growing, just.

Contrary to popular belief, spinning disks are here for the next few years at
least.

Especially when you compare disk performance with EBS "SSD" performance.

128gigs of "SSD" EBS storage gives you ~380 iops. The same performance as two
15k disks. We are very much in the era of spinning disk performance if Amazon
can charge a premium for that kind of storage.

~~~
jbooth
What's the derivative of that growth look like, though? It could be that they
just have enough customers who are either locked-in or on the back-half of the
adoption curve, allowing them to limp in with some growth for a few more years
before it's over.

Amazon charges a premium for convenience and for the fact that there's
0-startup cost to get going with them. Why would someone every shell out the
money for an enterprise 15k disk when they can buy an SSD instead?

~~~
baruch
The 15k HDDs will likely be replaced by SSDs, the 7200 ones are likely to stay
for a long while yet. Their capacity is yet unmatched, the SSDs are trying to
catch up with higher densities but currently the HDDs are ahead of them on the
curve of size and definitely ahead on the cost per Gb.

I just recently switched my laptop to an SSD and the change is amazing but I'm
not going to replace even the 1TB HDDs on my nas with equivalently sized SSDs.
Both from a needs view and from a cost view.

------
engendered
_if you’ve got lots of money and you don’t care about how you spend it or
translating those savings onto your customers_

This is a baseless, prejudiced claim, contrasting buying with some idealized
scenario where your in-house crew can only possibly build better solutions
cheaper. That there are no other outcomes.

But they might also cost far _more_ in manpower costs than any premium. They
might give you an unreliable solution that costs you your entire business.
Such an effort might distract from the core competencies of the organization.
Operating that storage might end up dwarfing the up-front cost (it's easy to
hire admins knowledgeable of EMC. Quite a different matter when it's your own
home-brew solution).

I understand the draw, but the "all upside" claim undermines the entire piece.
There are enormous downsides, not least the reliability of your data.
Companies like EMC and Nimble -- despite "great" editorial quotes added by
some random person on Wikipedia -- base their entire existence on reliably
serving your data, and things like multipathing and replication are the
absolute minimum cost of entry in the market. Now add automatic tiering, thin
provisioning, disk-deduplication and streaming hardware compression, etc, and
the value starts to become evident.

EDIT: The moderation through this is an abomination. HN shouldn't be overly
critical, but nor should it pander patronizingly to some tripe because the
author happened by.

~~~
KaiserPro
EMC _are_ expensive and nimble are ok for light VM performance.

However they provide a service. As the OP says its a bit broad to claim that
they are a pointless cost. Having dealt with both, I can tell you that its
very rare that own brand storage is cheaper in the long term. Unless: o you
run at significant scale o have enough data to warrant a team of skill storage
admins

bear in mind that decent storage admins cost the same as a medium sized array,
per year, per person. Opex can quickly over take capex.

~~~
mrmondo
Actually we've found quite the opposite, because proprietary vendor management
tools are generally hard to automate and integrate with existing frameworks
such as puppet - the time it takes to administer them is far greater than any
of our more standard systems. I just replied to a similar observation from
someone else here: [http://smcleod.net/building-a-high-performance-ssd-
san/#comm...](http://smcleod.net/building-a-high-performance-ssd-
san/#comment-1949813997)

