
HSE: Heterogeneous-memory storage engine designed for SSDs - caution
https://github.com/hse-project/hse
======
haneefmubarak
Looks pretty cool when you make it to the GitHub ([https://github.com/hse-
project](https://github.com/hse-project)). Order of magnitude performance
gains! I imagine most of that come from skipping the Filesystem layer and just
hitting the raw Block layer directly.

I am curious about the durability and how well tested all of that is though.
On the one hand, filesystems put a lot of work towards ensuring that bytes
written to disk and synced are most likely durable, but OTOH Micron is a
native SSD vendor so they've probably thought of that.

I'm also curious whether RAIDing multiple SSDs together at the block layer and
then running HSE on top of that will be faster or whether running multiple HSE
instances (not the right word, it's a library, but you get what I mean) with
one per drive and then executing redundantly across instances would be faster.
Argument for the former is that each instance would have to redo the
management work, argument for the latter is that there's probably
synchronization overhead within the library so running more in parallel should
allow for concurrency and parallelism gains.

~~~
wtallis
> I am curious about the durability and how well tested all of that is though.
> On the one hand, filesystems put a lot of work towards ensuring that bytes
> written to disk and synced are most likely durable,

All of the SSDs that this software might be deployed to have power loss
protection capacitors to ensure the drive can flush its write caches when
necessary. So this software only needs to make sure that the OS actually sends
data to the drive instead of holding it back in an IO scheduler queue (as you
point out, they're already bypassing the FS layer). Since this software should
be pretty good at structuring its writes in an SSD-friendly pattern, the
operating system's IO scheduler should probably just be disabled.

~~~
ignoramous
> _...the operating system 's IO scheduler should probably just be disabled._

The default on RHEL/Fedora has long been the _noop_ scheduler. I'd be
surprised if other distributions haven't followed suit, given the prevalence
of SSDs.

------
aloknnikhil
> [https://github.com/hse-project/hse](https://github.com/hse-project/hse)

Their benchmarks show significant gains compared to RocksDB.

> [https://github.com/spdk/rocksdb](https://github.com/spdk/rocksdb)

But what I'd really like to see is a comparison against RocksDB using SPDK

>
> [https://dqtibwqq6s6ux.cloudfront.net/download/papers/Hitachi...](https://dqtibwqq6s6ux.cloudfront.net/download/papers/Hitachi_SPDK_NVMe_oF_Performance_Report.pdf)
> Based on these results, SPDK performs significantly better than the kernel
> requiring only 1-2 cores to saturate IOPS on an NVMe SSD (compared to the
> kernel requiring 16)

~~~
ignoramous
SPDK has had its share of detractors here on news.yc [0], especially with
_io_uring_ around the block [1]. It'd be interesting to see the improvements
to these io-centric applications once they move to _io_uring_ [2], which, in a
way, like RocksDB, is sponsored by Facebook [3].

[0]
[https://news.ycombinator.com/item?id=10511960](https://news.ycombinator.com/item?id=10511960)

[1]
[https://news.ycombinator.com/item?id=22266503](https://news.ycombinator.com/item?id=22266503)

[2]
[https://news.ycombinator.com/item?id=19843464](https://news.ycombinator.com/item?id=19843464)

[3] [https://lkml.org/lkml/2014/1/24/252](https://lkml.org/lkml/2014/1/24/252)

~~~
benlwalker
io_uring is a fantastic development for the kernel, and I really can't praise
it enough.

However, there's still lots of reasons to use SPDK. Performance is still
significantly better[0], and you can directly access all the of the NVMe
features on the device without going through any abstraction layers.

[0]
[https://spdk.io/news/2019/05/06/nvme/](https://spdk.io/news/2019/05/06/nvme/)

~~~
zerd
Woah, just realizing that 10.39M 4k IOPS is 42GB/s. Doing 40GB/s of sequential
IO was difficult not that many years ago, let alone random IO. That's faster
than my memory bandwidth. [https://www.microway.com/knowledge-center-
articles/performan...](https://www.microway.com/knowledge-center-
articles/performance-characteristics-of-common-transports-buses/)

------
g14i
Many techniques already used by Aerospike on their KV database, which also
bypass the OS file system/cache.

I'm a long time Aerospike user with no connection to Aerospike.

Edit: I would love to see a benchmark with Aerospike.

------
jandrewrogers
PR copy aside, the claimed performance differences relative to RocksDB and
WiredTiger are typical of many storage engines, the performance doesn't stand
out. I don't think either RocksDB or WT has made a serious claim to
prioritizing performance in their designs in any case.

Also, I have to wonder how narrowly "open-source storage engine for SSDs" is
being defined here such that it excludes so many earlier storage engines in
claiming the title of "first".

~~~
tzone
All storage engines prioritize performance? Back in the day, way before
WiredTiger got acquired by MongoDB they also had very similar graphs showing
perf differences with RocksDB and InnoDB:
[https://github.com/wiredtiger/wiredtiger/wiki/Read-
scalabili...](https://github.com/wiredtiger/wiredtiger/wiki/Read-scalability)
[https://github.com/wiredtiger/wiredtiger/wiki/iiBench-
result...](https://github.com/wiredtiger/wiredtiger/wiki/iiBench-results)
[https://github.com/wiredtiger/wiredtiger/wiki/YCSB-
Mapkeeper...](https://github.com/wiredtiger/wiredtiger/wiki/YCSB-Mapkeeper-
benchmark)

And of course RocksDB has had similar graphs to showing perf against other
systems.

Every system manages to find a benchmark that fits their narrative :)

Reality is both RocksDB and WiredTiger are high performance storage engines,
and they are both optimized for SSDs too. These type of benchmarks rarely tell
real story.

------
erulabs
"World's first" Open-Source storage engine for SSDs? I believe Aerospike has
advertised itself as that for years, and certainly most MongoDB instances are
backed by SSD these days. Heck, conceptually etcd is a key-value storage
engine built for SSDs.

> HSE optimizes performance and endurance by orchestrating data placement
> across DRAM and multiple classes of SSDs or other solid-state storage.

Orchestrating data placement? Isn't that what all storage engines do?

What am I missing here? Is this a block level rather than file-system level
driver?

~~~
natmaka
Isn't this 'HSE' conceptually a HSM? How does it compares to existing field-
proven 'storage engines', some of them shock-full of features (because they
are filesystems), such as Lustre or ZFS?

[https://en.wikipedia.org/wiki/Hierarchical_storage_managemen...](https://en.wikipedia.org/wiki/Hierarchical_storage_management)

[https://en.wikipedia.org/wiki/ZFS#Caching_mechanisms:_ARC,_L...](https://en.wikipedia.org/wiki/ZFS#Caching_mechanisms:_ARC,_L2ARC,_Transaction_groups,_ZIL,_SLOG,_Special_VDEV)

------
shockinglytrue
Much better link: [https://github.com/hse-project/hse](https://github.com/hse-
project/hse)

PR is insane hot air referring to another hot air product (can you even buy
their 3D Xpoint devices yet?)

~~~
wtallis
> can you even buy their 3D Xpoint devices yet?

Nope. The only product they've announced so far using 3D XPoint is the Micron
X100 SSD, which they're only selling to a limited number of major customers;
you won't find it for sale on CDW. Intel's Optane products do use 3D XPoint
memory, and at the moment I believe that's all manufactured in a Micron-owned
fab. (Intel used to co-own it, and I don't think Intel will have their own
production line up and running until next year.)

------
organicfigs
Someone needs to write a book about breaking into writing software like
RocksDB, HSE, etc. Years ago I found myself wanting to learn more however
going from 0 to 1 felt impossible. Graduated from a T3 school in CS so
understanding the concepts wasn't the issue, I just didn't know how to build a
good foundation in low latency persistence. Years later I ended up
contributing to low latency java which was really interesting, but what an
opportunity missed.

~~~
sahil-kang
I just finished reading the OSTEP book[1] and it has a nice chapter on
SSDs[2]. The entire last portion of the book is about filesystems/disks so you
might find it interesting.

[1]
[http://pages.cs.wisc.edu/~remzi/OSTEP/](http://pages.cs.wisc.edu/~remzi/OSTEP/)

[2] [http://pages.cs.wisc.edu/~remzi/OSTEP/file-
ssd.pdf](http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf)

~~~
wtallis
That chapter on SSDs looks pretty good to me. Their numbers for NAND page and
especially erase block sizes are very outdated; more modern values are 4kB to
16kB for NAND pages and 16-24MB for erase blocks on TLC NAND. Section 44.9 on
mapping table sizes is a little bit odd, because most SSDs really do have 1GB
of RAM per 1TB of flash, and that expense is widely seen as worthwhile even
for multi-TB SSDs. The exceptions are low-end consumer SSDs that cache only
part of the mapping table in a smaller amount of DRAM or SRAM, and a few
enterprise/datacenter models that use 32kB block sizes for their FTL instead
of the typical 4kB and thus reduce the DRAM requirement by a factor of 8 at
the expense of greatly lowered performance and increased write amplification
when writing in units smaller than 32kB.

Aside from the two above issues, everything looks correct and relevant, and I
can't think of any missing details that deserve to be added to an introduction
of that length.

------
drenvuk
This is unbelievably cool. It is a multi segmented key prefix Key Value store.
Can someone just strap paxos or raft to this and call it a day please? Pretty
please?

~~~
haivri
I wonder how close of an API this provides compared to RocksDB... If close,
CockroachDB might be a good trial candidate

------
tiernano
GitHub repo: [https://github.com/hse-project](https://github.com/hse-project)

------
elihu
So, is this open-source firmware that runs directly on Micron SSDs, or is it
an upper-layer thing that runs on the host system?

~~~
buildbot
I thought it was firmware too, but it appears to be more of a key value store
block level access engine, and improves mongo performance.

I got really excited thinking it was an open source nvme fpga core.

------
adam0c
HSE not to be confused with HSE:
[https://www.hse.gov.uk/](https://www.hse.gov.uk/)

------
jgaa
Will it perform well on all SSD's, or is it optimized to give top performance
only on Micron devices?

------
ha-shine
What does heterogeneous mean here?

~~~
klodolph
Multiple types of storage at the same time. For example, two different types
of SSD, or combinations of SSD and DRAM.

------
junaru
Why is this press release getting massively upvoted?

~~~
shockinglytrue
It claims to fix MongoDB

------
cryptonector
Sounds kinda like a ZFS.

------
fortran77
Will this be a replacement for something like the overpriced and under
performing proprietary products from Pure Storage?

~~~
pinewurst
You're comparing apples to oranges, plus that's unduly harsh.

1\. This is a KV store optimized (or claimed to be) for a combo of SSD and
PMEM. Not a packaged, supported, appliance storage system.

2\. People who pay for enterprise storage do so for reasons beyond being too
bone-headed to appreciate the joys of cobbling together production systems
from the white box low-bidder and open source software.

