Hacker News new | comments | ask | show | jobs | submit login
Faster – Fast key-value store from Microsoft Research (github.com)
401 points by JeffCyr 6 months ago | hide | past | web | favorite | 76 comments

I had to opportunity to look at the code. I believe this should be compared against embedded hash database such as "kyoto cabinet, LevelDB, RocksDB" . They introduce a novel latch free hashtable that they say is faster than other in memory data structure.

They also introduce a new disk persistence system called HybridLog that combines in-place updates (in memory) and log-structured organization (on disk).

The interesting aspect of the HybridLog is that it act as a bufferpool but seem to work at the record level instead of working with pages like a B-tree buffer pool would or compressed block as levelDB block_cache does.

I was very surprised to see pretty much the entire implementation inside a file called faster.h rather than in a .cc file. Maybe that's all the rage in C++ libraries, I don't work with many of them.

It is because of the template class code. In C++, a template is not really a class or a function.


To compare it to rocksdb or leveldb: it looks like there is no iteration support or the notion of ordered keys to support partial scans...

For those wondering what this is, it is not a client/server app, from what I can tell, but an embedded engine. It looks like it's intended to be a library, and it's been implemented in two languages (C# and C++).

To get something like Redis or Riak you would have to build API, clustering, etc. on top of it. So it's more analogous to libraries like RocksDB, BoltDB, BDB etc.

Paper: https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

c++ part is one liner intrinsic which may be supported directly with new .net and several io methods which would be heavy on pinvoke calls like 5 per method if in c#, but these are very simple. so c++ part could be done in c and easily ported to unix. i guess c# should allow for pinvoke strategy like in lua to replace c++ more.

What kind of different strategies did you have in mind? So far as I know, P/Invoke already tries to be as low overhead as possible - e.g. primitive types and blittable structs are just passed as is, with no conversions. The recently added Span<T> is also special-cased for P/Invoke.

> So far as I know, P/Invoke already tries to be as low overhead as possible

As low as possible maybe, but there's lots of limiting factors with P/Invoke, such as:

- They are not inlined

- Registers need to be saved pessimistically

- The state at the point of the call is visible to observers outside of code the compiler controls so options for scheduling are constrained

- Objects need to be materialised

- All parameter values need to be available even if they aren't actually used

- Objects passed have to be pinned or given an indirect handle

- The caller has to be in a safepoint so GC can keep running

- The compiler has to juggle things into place for the C ABI, where it may normally lay things out differently

- etc

Going through a P/Invoke call is a whole inconvenient circus compared to an intrinsic.

Can anyone work out what platforms it works on? .net core?

Here is the relevant metadata: https://github.com/Microsoft/FASTER/blob/master/cs/src/core/...

Should be usable in NetCoreApp 2.0 and up, or .NET full framework 4.6 and up. In other words, yes, it will work on recent versions of NetCore and classic .Net.

However I also see a .dll file in use https://github.com/Microsoft/FASTER/blob/master/cs/src/core/...

Which suggests that this won't work in NetCoreApp on linux. It's not crossplatform unless it eliminates this or supplies linux .so binaries and so on.

src is in native https://github.com/Microsoft/FASTER/tree/master/cs/src/nativ...

Looks like some file IO functions and __rdtsc

"What differentiates FASTER are its cache-optimized index that achieves very high performance — up to 160 million operations per second when data fits in memory;"

I really dislike when papers make performance claims like this in the introduction. That "160 million" number is so meaningless at face value because everything from the runtime environment to the hardware is going to play a huge role in ops. I rather see how it compares against other key-value stores across a number of configurations and use cases.

It'd be awesome if they provided a comparison of other tools on the same hardware like, "on an AWS M4.xlarge instance we were able to achieve 160m/ops/sec when the dataset fit into memory where as redis only did X"

Lacking that I agree it's a pretty meaningless stat.

Check out the linked paper for a detailed performance comparison.

Don't do benchmarks on shared-host.

If that’s your production environment then that’s where you should run them.

That's not why you don't run them on a shared host.

It's because every other tenant on the machine is going to make run of the same benchmark unpredictable, and it will likely vary greatly through the day.

Even taking multiple runs of each benchmark isn't sufficient, because you don't know the usage patterns of other tenants.

You're making it sound like performance on such hosts in unknowable which isn't really accurate. 'Multiple runs of each benchmark' is vague enough to be potentially insufficient in just about any environment to boot.

Variance matters though. Test where you can reasonably be sure about low variance.

Sure. But it can be measured and accounted for. The idea that you can't or shouldn't benchmark such environments seems weird, given that they're pretty popular.

No benchmark means anything on an EC2 shared instance (or probably any other cloud instance) because you don't know what else is running on the machine.

What about running the benchmark multiple times on instances of the same type? I get that it would be noisy but lots of workloads run on shared instances so it’s a useful measuring stick in that way.

And: "FASTER achieves higher throughput than current systems, by more than two orders of magnitude, and scales better than current pure in-memory data structures, for in-memory working sets."

Looking at 7.2 of the paper, they probably mean "more than 2x", definitely not exponentially faster in most cases. Still nice work though.


From your link: "This was achieved with 32 (out of a maximum 48) data nodes." The project being discussed is benchmarked on a single node, not a cluster. This is why they compare performance to RocksDB.

As well as being owned by Oracle.

Links to 'our website' on the github page follows through to this paper [1]. Haven't read through it, but I did see a graph comparing Intel TBB, RocksDB and Masstree.

[1] https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

The comment about being two orders of magnitude faster than other stores is also pretty annoying. Many systems perform at most 1 disk access for data sets with working sets 100’s of times larger than DRAM, and that’s essentially optimal for those workloads.

It would be nice if they said what workload patterns they’ve optimized for, and which ones they perform poorly on.

Agreed. Hardware and software has had a fair amount of upgrades since then, but reading that I could pseudo-remember that quote by the memcache guy that goes something like "you can't tell if that or that tweak is giving you a few % more requests because you're being i/o limited by your network link anyway".

close enough.

heavily batched on a 48 core I can pull 50 million keys/sec over localhost. if you remove syscalls and use it as a library it should double at least.

writes are another story, but they're slower because nobody asks for them to be faster.

This. Somebody already filled an issue https://github.com/Microsoft/FASTER/issues/2

So 160 million packets per second?

For key value stores that make performance claims, it would be helpful if they would publish the time and space complexity graphs for various sample data workloads. Write/read, and various data homogeneity for at least one of the standard architectures in different kinds of memory and disk configurations if what is being touted is the novel storage architecture. And provide the scripts to generate said sample datasets.

Haven't read the papers yet, but it immediately reminds me of Anna:


That link has comments between anna and faster authors. The best part being:

> There are two high-level design goals that separate Anna and FASTER.

> First of all, for Anna, we set out to explore an execution model that’s truly coordination-free; each thread accepts requests, performs computation and sends out response without communicating or waiting for other threads. We believe having a coordination-free execution model is the key to fully exploiting multi-core parallelism within a single machine, and scale out smoothly to a distributed setting. We acknowledge that the fundamental caveat of having a coordination-free execution model is that strong consistencies (linearizability, serializability) are not achievable. Anna instead offers a wide-spectrum of coordination-free consistencies taxonomized in Bailis’s HAT paper (http://www.vldb.org/pvldb/vol7/p181-bailis.pdf).

> In addition, in Anna we focus on exploring a unified architecture that works at any scale, from a single multi-core machine to NUMA to a geo-distributed setting. Under this goal, architectures that rely on shared memory within a machine (including FASTER) need to be redesigned as we move to a distributed setting. This complicates the software, and can introduce challenges in maintaining consistency as the execution model within nodes and across nodes are now different.

> Anna currently focuses on workloads that fit in memory. For larger-than-memory data, we believe Anna can benefit from the hybrid-logging technique in FASTER for efficiently persisting data to stable storage.

If you want concrete benchmarks, they compare to RocksDB and Redis around page 10 of their academic paper. (https://www.microsoft.com/en-us/research/uploads/prod/2018/0...)


I find their choice of benchmarks to be very convenient.

They tested on in-memory 8 byte payloads and were way faster than RocksDB and Redis. They then tested against only different configurations of themselves for configurations that hit disk. They also tested an embedded version of their software vs a Redis that hits loopback (rather than e.g. running against the raw redis data structure implementation, which would have been a more fair comparison).

> They then tested against only different configurations of themselves for configurations that hit disk.

In the case of Redis, AFAIK it can't support larger than memory use cases, right? And in fact they do compare to RocksDB for larger-than-memory, see Fig 10. Granted, it could be more detailed, but I think it makes their point.

They sure hid figure 10! Unexpected section, bad header choice, etc.

8-byte payloads are a necessary limitatuon of their system because of atomic operations.

I'm sceptical. Modern top tier in mem key value storages like LMDB are just within 20% 30% away from CPU maximum IO throughput on server class hardware. There must be some tricks in their metric.

Do they actually fetch data or they just measure how many memory pointers per second they can dump?

Why the hell is Microsoft so bad at naming?!

I'm sure it will be easy to find information about this,


abysmally performing FAST search product (which is named too close to this with too much overlap),

Creators update,

SQL Server like they're the only SQL capable server out there even though it's based on somebody else's product


This is really interesting.

This is usually what people would do when using NoSQL database to reduce latency by caching elements with an in-memory database.

Basically they are mixing up Redis with RocksDB , which is what devs usually do to get even higher throughput by storing IDs in Redis to save a call to RocksDB.

Now what bother me is the look of repository , it looks completely rushed out.

No logo , unclear description of the tech...

Hence , as other mentioned it's just "an engine" , it doesn't actually contain the network layer, the clustering mechanism etc...

I guess it's probably the tech that is powering their flagship engine : "CosmosDB".

> Now what bother me is the look of repository , it looks completely rushed out.

> No logo , unclear description of the tech...

it's not a product, it's the code that accompanies a white paper.


I have seen plenty of local-machine fast key-value stores, such as LevelDB (By Google), or RocksDB (By Facebook), but I have a hard time imagining what they are for.

What are the use cases for such a library?

Imagine you want to run a service. This service needs to maintain some intermediate state. This state might’ve frequently read. There might be little value in persisting this state. Also, your service is used by many users, so this state can grow to be pretty big. For example, contents of a shopping cart.

One solution is to maintain such state in some key-value store. Different functions of your service can query this state within the data center and never have to suffer disk delays - which are often much longer than the internal network of your data center.

sooo.... what's the difference with your run-of-the-mill hash map / ordered map / whatever ? How does it compare to other maps such as these ones ? https://tessil.github.io/2016/08/29/benchmark-hopscotch-map....

I think it's an in-memory embedded library, so why not just use a global variable map/dictionary?

> so this state might grow pretty big

As mentioned by the comment you replied to, it's because FASTER (and similar libraries) allow persisting this to disk in an efficient manner. If it can fit in memory and your language/framework has an efficient hashmap implementation, then you're right.. it's probably not worth using something like this.

Great summary.

They sit under a service you would actually use. For instance, I think mysql can use rocksdb as a storage backend.

What does FASTER stand for? Because otherwise it's a terrible name, imo.

Yeah, it seems like an acronym, maybe some internal codename. But for some reason they didn't make it public.. Inappropriate? So... Fuck Anything Slower Than Expected Results? ;)

Microsoft tends to choose bad names: See Visual Studio Code, the most ungooglable name ever.

What are your thoughts on how "Googleable" these product names are: ".NET" and "Azure Functions"

Google makes it work. But then there are C# and F#, which are Googleable but fail on probably 99.9% of other sites and software with search functionality, including some of Microsoft's own products (e.g. NuGet).

A very arrogant name from a very arrogant company...

(Someone should make something faster than faster)

"up to 160 million operations per second"

Read operations? Write operations? Per core, or using N cores?

For the case of one core at 3GHz, it would mean 20 CPU cycles per operation, which is barely enough for storing the data in the RAM for a log write. If using more than one core, e.g. one for writing the log, and others for updating the actual data structures, could imply much longer times for completing one operation, despite doing 160M per second on average: in that case it would not their "faster" would mean scalable rather than faster.

The VLDB reviewers of their paper [1] must not have thought much of it because it's accepted as a short paper, which is not a great signal. No comment on the quality of the work though, as I only just started reading it.

[1] https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

So this was a demo paper, backed by their SIGMOD paper [2]. Interesting work. Their Epoch Protection kind of reminds me of the functionality provided by RCUs [3]. It is unfortunate that they do not provide comparisons with other hash maps in [2], opting for end-to-end comparisons with other systems, but that doesn't take away from their results.

[2] https://www.microsoft.com/en-us/research/uploads/prod/2018/0... [3] https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

Interestingly, the hash bucket structure presented in their papers does not match the one on their code [0], but that probably means they've made some sort of improvement post-publications.

[0] https://github.com/Microsoft/FASTER/blob/master/cc/src/core/...

It would be nice to know why I would use this instead of something tried and tested like Redis, but the sparse description doesn't really help.

You wouldn't - first and foremost, this is probably most useful as a storage layer for some more higher-level data store.

Second, think of software artifacts from papers as a proof of concept, or a prototype. It's intended more to demonstrate some architectural innovation, and less to serve customer needs. Their ideas revolve around supporting a high level of concurrent access and removing the locking overhead that would normally stem from this, and an interesting way to handle spilling over to disk in larger-than-memory scenarios. This might evolve into a customer-oriented product eventually. Or perhaps get retrofitted into the guts of something like MS SQL server.

Redis is like 0 for 3 from a CAP perspective as far as I understand it. Fine for a cache, but if you want a KV store you can rely on for some properties, it does not seem viable.

So redis is 'tried and true' but may not meet your constraints.


I haven't read the Faster paper yet (planning to) so I don't know that it provides better guarantees. But I personally would like a simple-as-redis KV store that provides better guarantees.

edit: Ah, yeah, Faster doesn't seem to even really be directly comparable to Redis - seems more like rocksdb, and not a distributed system.

On the surface, I'd guess the statically linked vs separate daemon differences apply (both have obvious pros and cons depending upon requirements and deployment scenarios).

Also, MS research is just that, a research wing and many of their libs go stale/unsupported after written.

Could somebody give an example of a use case for this?

Faster? Really?

Name for fast X: Quicker

Name for fast Y: Speedy

Name for fast Z: Gonzales

Didn’t Google call one of their HTTP extensions SPDY? Or am I thinking of something else?

Yes, Google's binary replacement for HTTP when run over TLS was named SPDY and is ultimately the origin of the HTTP/2 standard

Their encrypted replacement for TCP was named QUIC and is now being worked up for a standard also to be named QUIC (people working in this space call Google's GQUIC). The IETF's standard QUIC is firming up and may be finished in 2019 or so, but then I expected TLS 1.3 in 2016 so what do I know.

> Didn’t Google call one of their HTTP extensions SPDY? Or am I thinking of something else?

SPDY was the experiment that became HTTP/2, yes.

SPDY was a good 4-letter acronym.

So what's the underlying storage? Is it LSM with more tricks and prallezation + probabilistic algorithms + SIMD intrisics?


Lots of people here are much deeper in key-value stores, Redis, etc. than I am. So, here I outline what I wrote for a key-value store for a Web server session state store long ago and am still using and ask for expert comments on any pros/cons.

So, for my Web site, (A) when I first send a Web page to a user, I send in the HTML of that page, in an encrypted, hidden field, a key that identifies the user's session. The data I want to keep on the user's session I make a value and then write the key-value pair to my Web session state store; (B) when a user does an HTTP POST back to my Web server(s), I get the user's key and from my session state store get the value that is the user's session state. Each such value is just a byte array that is the serialized instance of my session state class, the instance particular to that user.

My session state store is just some simple .NET Framework software running as a Windows console application. The core of the store is just two instances of the standard .NET collection class, hopefully as fast as AVL trees (as in Knuth, Sorting and Searching) or red-black trees, IIRC, in Sedgwick. One collection class instance holds the key-value pairs. The other collection class instance holds for each key the time of the last access to the key-value pairs and is used to implement session time outs. The communications with the Web servers is just via simple TCP/IP sockets sending/receiving byte arrays.

The session state store is single threaded; so, there are no issues about, or efforts for, concurrency.

Then it would be good for the session state store to be sufficiently fast for the Web site and also to have a FIFO (first in, first out) queue of incoming requests. TCP/IP provides the desired FIFO queue, and, for the FIFO queue length, I'm setting that with just the standard option for TCP/IP.

So far, the session state store is all in main memory, but, of course, that memory might page as in virtual memory.

I'm guessing that there is more computing just for the TCP/IP than for the core of the session state store itself, i.e., the two collection class instances. If so, then as long as I'm using just TCP/IP communications, e.g., over my server farm LAN, the speed of the session state store code becomes a secondary issue -- the whole session state store operation as seen by the Web server(s) might not be much faster even if the session state store ran in 0 time. Maybe.

So far with my server farm software architecture, it should be easy to have as many executing instances of the session state store as needed for Web site performance and with no user affinity between a particular user and a particular Web server. There would be affinity between a particular user and a particular session state store via whatever Web server instance the user most recently got assigned to via load leveling.

Q 1. Is there a fundamental and serious flaw in my design?

Q 2. Would Redis actually be much better for me?

Q 3. If my Web site becomes very busy and has me using, say, a full rack of servers just for session state store, should I look for a still faster way to do the work?

Q 4. Maybe solid state disks (SSDs) are now so fast that I should just program my session state store to use a solid state disk instead of main memory -- e.g., can get SSDs of several terabytes, and that much main memory would be much more expensive. Of course, this assumes that the LAN and software for the TCP/IP stack are fast enough that the performance bottleneck just the work on the collection class instances and that, even with the rest of the assumptions, and SSD would be fast enough -- sounds like a strange situation.

Q 5. Should I plan on using a 10 GBps LAN for communicating with a session state store? That is, might such a fast LAN significantly reduce the latency of the communications between the Web server(s) and the session state server(s)? Have the Web server wait less time for the read/write with the session state store might speed up the whole server farm. Anyone have any actual experience about such speeds?


Faster than the leading brand

But is it faster than Aerospike or Tarantool?

Is it webscale?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact