
Faster – Fast key-value store from Microsoft Research - JeffCyr
https://github.com/Microsoft/FASTER
======
skyde
I had to opportunity to look at the code. I believe this should be compared
against embedded hash database such as "kyoto cabinet, LevelDB, RocksDB" .
They introduce a novel latch free hashtable that they say is faster than other
in memory data structure.

They also introduce a new disk persistence system called HybridLog that
combines in-place updates (in memory) and log-structured organization (on
disk).

The interesting aspect of the HybridLog is that it act as a bufferpool but
seem to work at the record level instead of working with pages like a B-tree
buffer pool would or compressed block as levelDB block_cache does.

~~~
brongondwana
I was very surprised to see pretty much the entire implementation inside a
file called faster.h rather than in a .cc file. Maybe that's all the rage in
C++ libraries, I don't work with many of them.

~~~
utoku
It is because of the template class code. In C++, a template is not really a
class or a function.

[https://isocpp.org/wiki/faq/templates#templates-defn-vs-
decl](https://isocpp.org/wiki/faq/templates#templates-defn-vs-decl)

------
atombender
For those wondering what this is, it is not a client/server app, from what I
can tell, but an embedded engine. It looks like it's intended to be a library,
and it's been implemented in two languages (C# and C++).

To get something like Redis or Riak you would have to build API, clustering,
etc. on top of it. So it's more analogous to libraries like RocksDB, BoltDB,
BDB etc.

Paper: [https://www.microsoft.com/en-
us/research/uploads/prod/2018/0...](https://www.microsoft.com/en-
us/research/uploads/prod/2018/08/p809-chandramouli.pdf)

~~~
dlahoda
c++ part is one liner intrinsic which may be supported directly with new .net
and several io methods which would be heavy on pinvoke calls like 5 per method
if in c#, but these are very simple. so c++ part could be done in c and easily
ported to unix. i guess c# should allow for pinvoke strategy like in lua to
replace c++ more.

~~~
int_19h
What kind of different strategies did you have in mind? So far as I know,
P/Invoke already tries to be as low overhead as possible - e.g. primitive
types and blittable structs are just passed as is, with no conversions. The
recently added Span<T> is also special-cased for P/Invoke.

~~~
chrisseaton
> So far as I know, P/Invoke already tries to be as low overhead as possible

As low as possible maybe, but there's lots of limiting factors with P/Invoke,
such as:

\- They are not inlined

\- Registers need to be saved pessimistically

\- The state at the point of the call is visible to observers outside of code
the compiler controls so options for scheduling are constrained

\- Objects need to be materialised

\- All parameter values need to be available even if they aren't actually used

\- Objects passed have to be pinned or given an indirect handle

\- The caller has to be in a safepoint so GC can keep running

\- The compiler has to juggle things into place for the C ABI, where it may
normally lay things out differently

\- etc

Going through a P/Invoke call is a whole inconvenient circus compared to an
intrinsic.

------
mooman219
"What differentiates FASTER are its cache-optimized index that achieves very
high performance — up to 160 million operations per second when data fits in
memory;"

I really dislike when papers make performance claims like this in the
introduction. That "160 million" number is so meaningless at face value
because everything from the runtime environment to the hardware is going to
play a huge role in ops. I rather see how it compares against other key-value
stores across a number of configurations and use cases.

~~~
haney
It'd be awesome if they provided a comparison of other tools on the same
hardware like, "on an AWS M4.xlarge instance we were able to achieve
160m/ops/sec when the dataset fit into memory where as redis only did X"

Lacking that I agree it's a pretty meaningless stat.

~~~
ddorian43
Don't do benchmarks on shared-host.

~~~
ebikelaw
If that’s your production environment then that’s where you should run them.

~~~
dymk
That's not why you don't run them on a shared host.

It's because every other tenant on the machine is going to make run of the
same benchmark unpredictable, and it will likely vary greatly through the day.

Even taking multiple runs of each benchmark isn't sufficient, because you
don't know the usage patterns of other tenants.

~~~
pvg
You're making it sound like performance on such hosts in unknowable which
isn't really accurate. 'Multiple runs of each benchmark' is vague enough to be
potentially insufficient in just about any environment to boot.

~~~
eklavya
Variance matters though. Test where you can reasonably be sure about low
variance.

~~~
pvg
Sure. But it can be measured and accounted for. The idea that you can't or
shouldn't benchmark such environments seems weird, given that they're pretty
popular.

------
iamleppert
For key value stores that make performance claims, it would be helpful if they
would publish the time and space complexity graphs for various sample data
workloads. Write/read, and various data homogeneity for at least one of the
standard architectures in different kinds of memory and disk configurations if
what is being touted is the novel storage architecture. And provide the
scripts to generate said sample datasets.

------
Rauchg
Haven't read the papers yet, but it immediately reminds me of Anna:

[https://databeta.wordpress.com/2018/03/09/anna-
kvs/](https://databeta.wordpress.com/2018/03/09/anna-kvs/)

~~~
sometimesijust
That link has comments between anna and faster authors. The best part being:

> There are two high-level design goals that separate Anna and FASTER.

> First of all, for Anna, we set out to explore an execution model that’s
> truly coordination-free; each thread accepts requests, performs computation
> and sends out response without communicating or waiting for other threads.
> We believe having a coordination-free execution model is the key to fully
> exploiting multi-core parallelism within a single machine, and scale out
> smoothly to a distributed setting. We acknowledge that the fundamental
> caveat of having a coordination-free execution model is that strong
> consistencies (linearizability, serializability) are not achievable. Anna
> instead offers a wide-spectrum of coordination-free consistencies
> taxonomized in Bailis’s HAT paper
> ([http://www.vldb.org/pvldb/vol7/p181-bailis.pdf](http://www.vldb.org/pvldb/vol7/p181-bailis.pdf)).

> In addition, in Anna we focus on exploring a unified architecture that works
> at any scale, from a single multi-core machine to NUMA to a geo-distributed
> setting. Under this goal, architectures that rely on shared memory within a
> machine (including FASTER) need to be redesigned as we move to a distributed
> setting. This complicates the software, and can introduce challenges in
> maintaining consistency as the execution model within nodes and across nodes
> are now different.

> Anna currently focuses on workloads that fit in memory. For larger-than-
> memory data, we believe Anna can benefit from the hybrid-logging technique
> in FASTER for efficiently persisting data to stable storage.

------
fizx
If you want concrete benchmarks, they compare to RocksDB and Redis around page
10 of their academic paper. ([https://www.microsoft.com/en-
us/research/uploads/prod/2018/0...](https://www.microsoft.com/en-
us/research/uploads/prod/2018/03/faster-sigmod18.pdf))

TL;DR:

I find their choice of benchmarks to be very convenient.

They tested on in-memory 8 byte payloads and were way faster than RocksDB and
Redis. They then tested against _only different configurations of themselves_
for configurations that hit disk. They also tested an embedded version of
their software vs a Redis that hits loopback (rather than e.g. running against
the raw redis data structure implementation, which would have been a more fair
comparison).

~~~
makmanalp
> They then tested against only different configurations of themselves for
> configurations that hit disk.

In the case of Redis, AFAIK it can't support larger than memory use cases,
right? And in fact they do compare to RocksDB for larger-than-memory, see Fig
10. Granted, it could be more detailed, but I think it makes their point.

~~~
fizx
They sure hid figure 10! Unexpected section, bad header choice, etc.

------
baybal2
I'm sceptical. Modern top tier in mem key value storages like LMDB are just
within 20% 30% away from CPU maximum IO throughput on server class hardware.
There must be some tricks in their metric.

Do they actually fetch data or they just measure how many memory pointers per
second they can dump?

------
frugalmail
Why the hell is Microsoft so bad at naming?!

I'm sure it will be easy to find information about this,

.NET,

abysmally performing FAST search product (which is named too close to this
with too much overlap),

Creators update,

SQL Server like they're the only SQL capable server out there even though it's
based on somebody else's product

etc...

------
hashrate
This is really interesting.

This is usually what people would do when using NoSQL database to reduce
latency by caching elements with an in-memory database.

Basically they are mixing up Redis with RocksDB , which is what devs usually
do to get even higher throughput by storing IDs in Redis to save a call to
RocksDB.

Now what bother me is the look of repository , it looks completely rushed out.

No logo , unclear description of the tech...

Hence , as other mentioned it's just "an engine" , it doesn't actually contain
the network layer, the clustering mechanism etc...

I guess it's probably the tech that is powering their flagship engine :
"CosmosDB".

~~~
omegabravo
> Now what bother me is the look of repository , it looks completely rushed
> out.

> No logo , unclear description of the tech...

it's not a product, it's the code that accompanies a white paper.

[https://www.microsoft.com/en-
us/research/project/FASTER/](https://www.microsoft.com/en-
us/research/project/FASTER/)

------
Nican
I have seen plenty of local-machine fast key-value stores, such as LevelDB (By
Google), or RocksDB (By Facebook), but I have a hard time imagining what they
are for.

What are the use cases for such a library?

~~~
DSingularity
Imagine you want to run a service. This service needs to maintain some
intermediate state. This state might’ve frequently read. There might be little
value in persisting this state. Also, your service is used by many users, so
this state can grow to be pretty big. For example, contents of a shopping
cart.

One solution is to maintain such state in some key-value store. Different
functions of your service can query this state within the data center and
never have to suffer disk delays - which are often much longer than the
internal network of your data center.

~~~
wiradikusuma
I think it's an in-memory embedded library, so why not just use a global
variable map/dictionary?

~~~
seabrookmx
> so this state might grow pretty big

As mentioned by the comment you replied to, it's because FASTER (and similar
libraries) allow persisting this to disk in an efficient manner. If it can fit
in memory and your language/framework has an efficient hashmap implementation,
then you're right.. it's probably not worth using something like this.

------
andreygrehov
What does FASTER stand for? Because otherwise it's a terrible name, imo.

~~~
badloginagain
Microsoft tends to choose bad names: See Visual Studio Code, the most
ungooglable name ever.

~~~
andrewstuart
What are your thoughts on how "Googleable" these product names are: ".NET" and
"Azure Functions"

~~~
mcbits
Google makes it work. But then there are C# and F#, which are Googleable but
fail on probably 99.9% of other sites and software with search functionality,
including some of Microsoft's own products (e.g. NuGet).

------
faragon
"up to 160 million operations per second"

Read operations? Write operations? Per core, or using N cores?

For the case of one core at 3GHz, it would mean 20 CPU cycles per operation,
which is barely enough for storing the data in the RAM for a log write. If
using more than one core, e.g. one for writing the log, and others for
updating the actual data structures, could imply much longer times for
completing one operation, despite doing 160M per second on average: in that
case it would not their "faster" would mean scalable rather than faster.

------
utopcell
The VLDB reviewers of their paper [1] must not have thought much of it because
it's accepted as a short paper, which is not a great signal. No comment on the
quality of the work though, as I only just started reading it.

[1] [https://www.microsoft.com/en-
us/research/uploads/prod/2018/0...](https://www.microsoft.com/en-
us/research/uploads/prod/2018/08/p809-chandramouli.pdf)

~~~
utopcell
So this was a demo paper, backed by their SIGMOD paper [2]. Interesting work.
Their Epoch Protection kind of reminds me of the functionality provided by
RCUs [3]. It is unfortunate that they do not provide comparisons with other
hash maps in [2], opting for end-to-end comparisons with other systems, but
that doesn't take away from their results.

[2] [https://www.microsoft.com/en-
us/research/uploads/prod/2018/0...](https://www.microsoft.com/en-
us/research/uploads/prod/2018/03/faster-sigmod18.pdf) [3]
[https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt)

------
fein
It would be nice to know why I would use this instead of something tried and
tested like Redis, but the sparse description doesn't really help.

~~~
makmanalp
You wouldn't - first and foremost, this is probably most useful as a storage
layer for some more higher-level data store.

Second, think of software artifacts from papers as a proof of concept, or a
prototype. It's intended more to demonstrate some architectural innovation,
and less to serve customer needs. Their ideas revolve around supporting a high
level of concurrent access and removing the locking overhead that would
normally stem from this, and an interesting way to handle spilling over to
disk in larger-than-memory scenarios. This might evolve into a customer-
oriented product eventually. Or perhaps get retrofitted into the guts of
something like MS SQL server.

------
throwaway010317
Could somebody give an example of a use case for this?

------
21
Faster? Really?

Name for fast X: Quicker

Name for fast Y: Speedy

~~~
gaius
Didn’t Google call one of their HTTP extensions SPDY? Or am I thinking of
something else?

~~~
tialaramex
Yes, Google's binary replacement for HTTP when run over TLS was named SPDY and
is ultimately the origin of the HTTP/2 standard

Their encrypted replacement for TCP was named QUIC and is now being worked up
for a standard also to be named QUIC (people working in this space call
Google's GQUIC). The IETF's standard QUIC is firming up and may be finished in
2019 or so, but then I expected TLS 1.3 in 2016 so what do I know.

------
tribesman
So what's the underlying storage? Is it LSM with more tricks and prallezation
+ probabilistic algorithms + SIMD intrisics?

~~~
dboreham
Memory.

------
graycat
Lots of people here are much deeper in key-value stores, Redis, etc. than I
am. So, here I outline what I wrote for a key-value store for a Web server
session state store long ago and am still using and ask for expert comments on
any pros/cons.

So, for my Web site, (A) when I first send a Web page to a user, I send in the
HTML of that page, in an encrypted, hidden field, a key that identifies the
user's _session_. The data I want to keep on the user's session I make a
_value_ and then write the key-value pair to my Web session state store; (B)
when a user does an HTTP POST back to my Web server(s), I get the user's key
and from my session state store get the value that is the user's session
state. Each such value is just a byte array that is the serialized instance of
my session state class, the instance particular to that user.

My session state store is just some simple .NET Framework software running as
a Windows console application. The core of the store is just two instances of
the standard .NET collection class, hopefully as fast as AVL trees (as in
Knuth, _Sorting and Searching_ ) or red-black trees, IIRC, in Sedgwick. One
collection class instance holds the key-value pairs. The other collection
class instance holds for each key the time of the last access to the key-value
pairs and is used to implement session time outs. The communications with the
Web servers is just via simple TCP/IP sockets sending/receiving byte arrays.

The session state store is single threaded; so, there are no issues about, or
efforts for, concurrency.

Then it would be good for the session state store to be sufficiently fast for
the Web site and also to have a FIFO (first in, first out) queue of incoming
requests. TCP/IP provides the desired FIFO queue, and, for the FIFO queue
length, I'm setting that with just the standard option for TCP/IP.

So far, the session state store is all in main memory, but, of course, that
memory might page as in virtual memory.

I'm guessing that there is more computing just for the TCP/IP than for the
core of the session state store itself, i.e., the two collection class
instances. If so, then as long as I'm using just TCP/IP communications, e.g.,
over my server farm LAN, the speed of the session state store code becomes a
secondary issue -- the whole session state store operation as seen by the Web
server(s) might not be much faster even if the session state store ran in 0
time. Maybe.

So far with my server farm software architecture, it should be easy to have as
many executing instances of the session state store as needed for Web site
performance and with no _user affinity_ between a particular user and a
particular Web server. There would be affinity between a particular user and a
particular session state store via whatever Web server instance the user most
recently got assigned to via load leveling.

Q 1. Is there a fundamental and serious flaw in my design?

Q 2. Would Redis actually be much better for me?

Q 3. If my Web site becomes very busy and has me using, say, a full rack of
servers just for session state store, should I look for a still faster way to
do the work?

Q 4. Maybe solid state disks (SSDs) are now so fast that I should just program
my session state store to use a solid state disk instead of main memory --
e.g., can get SSDs of several terabytes, and that much main memory would be
much more expensive. Of course, this assumes that the LAN and software for the
TCP/IP stack are fast enough that the performance bottleneck just the work on
the collection class instances and that, even with the rest of the
assumptions, and SSD would be fast enough -- sounds like a strange situation.

Q 5. Should I plan on using a 10 GBps LAN for communicating with a session
state store? That is, might such a fast LAN significantly reduce the latency
of the communications between the Web server(s) and the session state
server(s)? Have the Web server wait less time for the read/write with the
session state store might speed up the whole server farm. Anyone have any
actual experience about such speeds?

Thanks!!

------
magoon
Faster than the leading brand

------
bufferoverflow
But is it faster than Aerospike or Tarantool?

~~~
dfee
Is it webscale?

