Hacker News new | comments | ask | show | jobs | submit login
An FPGA-based In-line Accelerator for Memcached (2013) [pdf] (hotchips.org)
89 points by cleong 8 months ago | hide | past | web | favorite | 32 comments

This seems old (2013ish?). There're newer "key/value on FPGA" papers that're more modern.

If memcached papers have one thing in common, it's an uncanny ability to get the comparison software to run as slowly as possible. 100k ops/sec/core is what you get when using a single client connection with blocking I/O. Using more clients (as per a normal prod setup) or pipelining queries is more like 1m ops+/core, with writes scaling worse than reads. In production it's easy to get some level of pipelining (multigets, clustered keys, etc), since you're rarely just fetching a key and then blocking.

A much better FPGA paper would show scales of at what level the syscall overhead becomes most of the CPU usage, as well as any measured latency improvements. I think some of the other papers address latency at least.

In reality it hardly matters. If you're hitting memcached hard enough with tiny values for it to matter, ensuring keys are clustered and pipelined is a lot less maintenance overhead than deploying FPGA's.

Can confirm this eerie ability of FPGA and h/w folks from another domain (regular expressions - I'm the designer of Hyperscan, a s/w solution).

Every corner case that could be found in s/w was always the topic of a excited benchmark. Also, the old trick of 'hey, let's drop all the matches on the floor in our h/w or FPGA, while getting a huge number of matches in s/w and making the s/w guys look ridiculous'.

Every time I read a paper touting a great new speedup on FPGA (over some crap s/w implementation) I'm reminded of that old joke about the Texan visiting Israel and telling the owner of some small farm that "he can get on a tractor and ride for days without getting the the boundary of his property." The Israeli nods sympathetically and says "Yes, I too used to have a tractor like that".

I don’t understand the joke?

I think the Texan is thinking of a regular tractor on a huge farm, while the Israeli is thinking of a slow/broken tractor on a moderately sized farm.

I found https://www.reddit.com/r/Jokes/comments/2uj2ot/a_texas_ranch... And it seems to be not that good of a joke.

Especially when we have Xeon -D that goes up to 16 Core. AMD EPYC that gives more Core per dollar. And more IPC soon in Zen 2. 7nm and 10nm from AMD and Intel next year on Server. Not to mention it now Support ARMv8

So Excuse my ignorance, apart from AWS or Azure Scales, what would anyone uses Memcached on FPGA?

* I wouldn't mind if the system was simple plug and play and has all the benefits, cost saving without the headache. But very rarely are any technology deployment without any headache or hassle free.

Our startup is working on accelerators using FPGAs on AWS including memcached.

Using a single AWS F1 (FPGA) instance, our Memcached accelerator achieves over 11 million ops/sec at less than 300 microsecond latency. Compared to ElastiCache, the AWS-managed CPU Memcached server, our Memcached accelerator offers 9X better throughput, 9X lower latency, and 10X better throughput/$.

We need to batch multiple requests per Ethernet packet to get around packet per sec rate limiting on AWS. See more details here: https://www.legupcomputing.com/blog/index.php/2018/05/01/dee...

If anyone is interested we would love to hear from you, we will be showing off an online demo later this week.

FPGAs are great for processing data at 10Gbps line rate with low latency. They are also good for compute tasks like compression and encryption.

I am the co-founder of Plunify. We have an ML software solution (InTime) that optimizes FPGA design by tuning the parameters of the compilation, i.e. synthesis, P&R. I don't know much about memcached accelerators but if it is a performance driven application, like HFT, I believe we can make it go even faster, i.e. increase the FMax. I read from your website that you are using Intel PSG devices. We often see designers using seeds to close timing or optimize the timing, but that is leaving performance on the table. For more details: https://support.plunify.com/en/2018/04/17/compare-timing-per...

Happy to talk more at kirvy@plunify.com if you are interested. Congrats on getting a seed round from Intel Capital!

Interesting, what synthesis settings have you found have the most impact? I have also seen FPGA designers trying different seeds when closing timing. In this case, AWS provides an FPGA shell for external interfaces that has a maximum clock frequency of 250MHz. We have been able to meet this timing constraint without many issues. But we will keep you in mind for the Intel FPGA boards we are working with now.

what synthesis settings have you found have the most impact?

The answer is - it depends. Unfortunately, we have not found a "golden" combination of settings yet. If you have a highly congested design, synthesis does help a lot, but not all cases. There are correlations between the settings, so if A is good, B is good, A+B could be bad. Seeds belong to a category of techniques that we classify as random. For example, for Vivado, although Xilinx removed the seeds feature, we created a technique to trigger randomness in the placement using a property of Vivado.

What we do is not new in the sense that settings exploration has always been around. But with cloud compute resources and ML approaches, it really enables timing closure and optimization methods in a cheaper and more disciplined fashion.

We are also very interested in users of OpenCL/HLS/C. The translated RTL is often not as optimized/readable as what an RTL designer will do if he/she does it directly in RTL. Our tool (InTime) can be a good boost to the performance of such RTL.

How is elasticache so slow? what instances does it run on?

edit: r4.4xlarge as per the link. 16vcpu? You should be able to beat on latency but beating on throughput means elasticache is misconfigured, likely. Or you're putting on way too much set traffic (think I saw you set the bench to 1:1 ratio of gets to sets?)

I wouldn't characterize Elasticache as running slow, a single instance in this case is handling 1.3M request/sec. But we can be 9X faster by batching multiple requests per packet and then offloading the TCP network stack and memcached operations to the FPGA. The FPGA allows us to handle the requests at network line-rate, even with small 100-byte requests. On Elasticache, past a certain point these small requests start to overload the CPU.

The interesting part is the FPGA could still do much more computation (for example, compression or encryption) while maintaining the same throughput due to hardware pipelining. We described this concept further in the blog post I linked to.

I characterize it as slow because I know it can saturate the packet rate AWS gives it with software memcached. If the packet rate were much higher then you might win out.

The only reason why you can claim 9x latency is because you've saturated the worker threads. You should still win on latency even if it were properly bottlenecking on the network, but 9x throughput and 9x latency is completely false as a capacity limit in this test.

The other issue is 100 bytes isn't typical. It's common but almost every user has a varied workload. Deploying FPGA's for the larger cache values ends up being a waste. I designed a new storage system based off of offloading larger cold keys to flash, even.

Do you have a source showing elasticache running faster than this? For example, Redis labs was only able to achieve 10M req/sec by using 6 m4.16xlarge instances which are double the price of the CPU instance we used: https://dzone.com/articles/10m-opssec-1msec-latency-with-onl...

100-500 byte values are the majority of requests at companies like Facebook and Lyft for their key value clusters. For large value sizes the network interface becomes the bottleneck so FPGAs won’t be able to help.

I wrote the current version of OSS memcached. I don't know how elasticache is configured, but as I said memcached itself can definitely saturate the network from that instance. Either the version they run is too old or it's misconfigured. If I were to compare a "custom FPGA caching service" vs something memcached like, I would take the same 4xlarge instance and just run memcached on it.

On a large enough machine I've gotten it up past 55 million read ops/sec. It's quite good at read throughput.

I'm also familiar with the cache clusters at major companies.

Our assumption was that elasticache would be highly optimized by Amazon. Remember that these are virtual machines which means limitations such as packet per second throttling. What specific configuration options do you think are missing?

In the latest version of memcached have you added support for batching/pipelining multiple requests per packet? Because this was crucial for achieving high requests/sec in this example.

Were the 55M requests/sec coming from another machine? Even with small 100B values you would need a minimum of a 44 Gbps network link. How many cores were required? In our benchmark we wanted a fair comparison between instances of similar price and RAM size.

They stopped updating it a few years ago; it's probably also not as well tuned as you think. I'd need to see the output of "stats settings" from a running instance to know for sure. I also have no idea if it's a hacked fork or not.

Odds are pretty good it's left at the default of 4 worker threads... so on a 16 vcpu instance that's not going to reach great heights. Since it's a 1.4.x version (years old), it's missing some newer features that both help in average latency and memory efficiency. Or rather, a lot of them are there but disabled by default.

Memcached has allowed pipelining since it was created. For the ASCII protocol, packing multiple responses into single packets is done via a straight multiget. You can send multiple requests in a single packet for any protocol and any command.

My stress utility (https://github.com/memcached/mc-crusher) has options for pipelining requests, and using multigets ascii packed get responses. I test to the limit of lock scaling for each individual subsystem.

The 55M test required running mc-crusher via localhost, there's no network that can go that fast. My point is you're limited by the network throughput, not the CPU. In that particular 55M test, all cores were used, but ~7-8 of them were used by mc-crusher... so the real limit for the machine is even higher. It did have a lot of cores. 48ish?

You can still do apples/apples with instance sizes... but given everything I know about this thing, unless those cores are extremely slow, hitting 11m ops/sec shouldn't be an issue. Or at least, with minimal fiddling it should hit 6-8m, which doesn't give you a crazy 9x figure.

You do need to stop doing 1:1 get/set ratio though. Sets don't scale very well since I've generally never had complaints about the speed. I'd say a highly conservative test would be 5:1 get/set. Production workloads are typically even higher than that. (that said I do intend to speed them up more, it's just lowish priority.. the LRU locks are highly granular, so spreading sets across different slab classes can help mutation perf a lot).

I'm still seeing similar results (~1M req/sec) after compiling your latest version of memcached from github and running with 16 worker threads. I just spun up two r4.4xlarge instances (one for client and one for the memcached server). I'm using memtier_benchmark with pipelining of 16 requests, 100B values, 10:1 get/set ratio. I compiled mc-crusher but you'll have to let me know the command to run because the readme wasn't clear.

One main constraint here is that we are using AWS virtual machine instances on the cloud. My guess is your previous experience is with physical servers. The FPGA performance is also significantly better when you can use the physical board with a direct ethernet connection, pipelining isn't required in this case the FPGA can handle minimum sized ethernet packets at line rate.

Another question, in your experience is compression/encryption used much with memcached? Because this is another area where the FPGA can compute much faster.

mis-threaded my response below (didn't have a reply button?), so see that too.

Just signed up for a personal AWS account and manually started an r4.4xlarge for target and c5.4xlarge for source (same CPU's and networking capability?, but it wasn't allowing me to just start two r4.4xlarge...).

got it up to 15M hits/sec for pure mget test.

results: https://gist.github.com/dormando/910134e85279710b970bd2c8af8...

Thanks for the details on how to use your benchmark script and for taking the time to investigate this. I hadn’t heard of your benchmark before and mc-crusher seems to work a bit differently than memtier_benchmark.

First a few significant differences:

1) Your value size is 10B which completely changes the results. Let’s keep the value size at 100B, which is more realistic.

2) The ratio of gets to sets significantly affects the requests per sec. We were assuming 1:1 ratio when we did our measurements. Increasing the percentage of gets really speeds up req/sec. We didn’t observe this effect on elasticache. Is this a recent improvement in the github version of memcached?

3) Your benchmark is using multiple keys in the same get command. What memtier does is pipeline multiple get commands each with one key. This seems more realistic.

4) We pipelined 16 get commands per packet while your configuration had 50 keys per get command.

I was able to reproduce the same setup as we had with ~1.2M req/sec with your mc-crusher benchmark using the following config. This has 1:1 get to set ratio with pipeline 16 and value size 100B.

send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1

I used the github memcached on an r4.4xlarge. I ran memcache-top on the server instance to measure the requests per second, showing about 750k gets/sec and 600k sets/sec.

With a ratio of 10:1 gets to sets I’m seeing about 3.5M req/sec which seems better than elasticache.

Please don't try to dial this back to win. I showed you how things work, go ahead and fiddle with them as you want.

1) sure, 100b, but that will just make it easier for the CPU version to hit the packet rate limit. I dialed it down to show just how fast the key rate is. Your entire proposal was that CPU bottlenecked the NIC, and it does not. Also, most people have 100b keys, nevermind the values.

2) 1:1 was never realistic. It's not even remotely realistic; as I said earlier 5:1 would be pessimistic. In reality the instances which have get rates in the millions tend to have 100:1 or better ratios due to the nature of the data they're caching.

Yes, the newer LRU algorithm doesn't grab LRU locks on the read path, so it'll scale with the number of CPU cores. As I said in earlier comments, the sets don't currently scale, especially if you're hammering the same LRU (which is again, unrealistic). If you just do a pure set load you'll land somewhere between 900k and 1.5m ops/sec.

3) I did both single-get-pipelined and packet-pipelined benchmarks; also absolutely not. Clients are designed to use the multiget mode when multiple keys are being fetched from the same server. This benefit is lost with the binary protocol (which will be fixed at some point).

4) Try an mget with 16, it won't be too far off, though you might have to add one more mc-crusher thread.

In your last test, you're simply overloading it with sets. If you want to mislead people with a test like this, go ahead; but I'll point it out.

3.5M/s isn't too bad.

Memcached really isn't a great target for your sort of work. I love the idea of FPGA offload, but trying to advertise your thing as superior by making up your own rules is going to get called out.

1) The popularity of redis is absolutely damning in general. if people are okay with the performance of a single CPU database with all-over-the-map latency profiles, the odds of you finding enough customers with extremely high rate memcached pools to sustain a business are essentially zero. You'd be solely tricking people who think they need it.

2) You are not facebook. Nobody is facebook but facebook. 100b is not representative. It's not even representative of facebook's load.

What's worse, even for a more common case, if 99% of requests are 100b, the average size of an item might be 8k. Which doesn't mean that there are a bunch around 8k, but there could be a few thousand items that are 50k-500k+ in size, getting hit 1% of the time, or even 0.1% of the time.

500x the bandwidth of a 100b request for the same processing overhead. It's almost always something they need: a request might fetch a couple hundred items from memcached, with just a couple of them being large.

This ends up making RAM be the greatest expense in the system. If so few users really need this performance, and the newer versions of memcached have a much higher perf ceiling, the extra features it has to drive down RAM usage are more valuable.

The best cost/power savings most users can do is find a way to get more RAM attached to fewer CPU cores: to be frank a r4.4xlarge would suit better with 8 cores. Or find ways push larger cold values into flash, freeing up RAM for those 100b values to be served quickly.

Nothing was dialed back. My previous post just confirmed that our elasticache was not misconfigured. I used the latest github memcached with 14 worker threads as you suggested and your benchmark script gives the same results that we reported for the r4.4xlarge.

1) Actually your earlier test confirms our point that the CPU does not saturate the 10Gbps network with small value sizes. For example, in your 10B value example you got 15M req/sec with 16 cores. This rate is 1.2Gbps (15M x 10B * 8), well below the network limit of 10Gbps. The FPGA would still be ~8X faster at line rate.

2) Regardless of the get/set ratio the FPGA will hit line rate. However, thanks for pointing out that GET requests are much faster in the latest version of memcached, we didn't know that. Looks like the FPGA is only 3X faster for this 10:1 Get to Set workload.

3) Users would prefer if there was no pipelining at all. We only used pipelining to get around packet/sec limitations on AWS. If the FPGA was connected directly to the network we could hit line rate without any pipelining.

We aren't trying to mislead, we are just showing what's possible with the FPGA on AWS: line-rate processing of incoming requests at close to 10Gbps. The cool part as I mentioned is that the FPGA is still under-utilized so we could add encryption without affecting requests/sec at all because hardware cores execute in parallel. Another idea is to compress the data on the fly.

Agreed that RAM is the expensive part, which is why we picked a CPU instance that had similar RAM to the FPGA instance. Yes we have heard of users caching data to SSD to save cost.

Try 12 workers to start, there are some bg threads.

I've plenty of experience with both hardware and virtual machines, I just don't use AWS myself much. I can get something like 800k read ops/sec from a 4core raspberry pi2, and I hope the AWS instance isn't that terrible.

with mc-crusher:

./mc-crusher conf/someconfigfile ipaddress port

https://github.com/memcached/mc-crusher/blob/master/conf/asc... - this is a decent read test with pipelining (give the test a few seconds to get through its sets). The inbound requests are pipelined, but it'll still send each get response in individual packets. This is what I use to test syscall/interrupt overhead.

https://github.com/memcached/mc-crusher/blob/master/conf/mge... this is the same thing, but with mgets. I'd copy the set line from ascii too:

send=ascii_set,recv=blind_read,conns=10,key_prefix=foobar,key_prealloc=0,pipelines=4,stop_after=200000,usleep=1000,value_size=10 send=ascii_mget,recv=blind_read,conns=50,mget_count=50,key_prefix=foobar,key_prealloc=1

can vary the value_size to and mget_count to see how that changes things. You can also pre-warm with the 'bench-warmer' script that comes with it, or remove stop_after and adjust usleep to adjust get/set ratios.

Watch top on the client host, and if mc-crusher is capping out its CPU cores, add more lines to the test but with the (confusing, sorry) threading enabled:

send=ascii_set,recv=blind_read,conns=10,key_prefix=foobar,key_prealloc=0,pipelines=4,stop_after=200000,usleep=1000,value_size=10 send=ascii_mget,recv=blind_read,conns=50,mget_count=50,key_prefix=foobar,key_prealloc=1 send=ascii_mget,recv=blind_read,conns=50,mget_count=50,key_prefix=foobar,key_prealloc=1,thread=1

That puts the first two tests on the "main" thread, then spawns an extra thread for the third test. you can keep copy/pasting that last line until the client or the server are saturated.

edit: sorry, the enc/compression question:

1) compression is typically done in the client to reduce bandwidth overhead. It's not very useful in the server.

2) encryption is becoming more popular, but doesn't currently exist much. The mainline OSS doesn't even have TLS support yet. Almost all use cases are on internal networks. FPGA's could potentially help there... aes-ni on intel cpu's isn't awful though.

About to congratulate you that you are about to make competition to Alibaba.com.

On our side, maxing the IOPS was the easy part. The hard one was to marry the protocol with converged/deterministic Ethernet with RDMA. We were split in between "one request, one frame/packet burst" vs "all requests are somehow smartly aligned with frames by stateful logic." The first one was surprisingly susceptible to performance artifacts due to varying round trip latency of few microseconds, thus it was possible to get packets in transit being dropped due to receiving NIC (a top tier hardware) being momentarily overloaded.

You have an advantage of being DC provider independent, and can jump the AWS ship whenever you want. Alibaba's solution will be tied to its infrastructure with its very expensive RDMA capable network.

Sounds interesting, are there any papers or public details that you could point me to with more technical information about this Alibaba RDMA-based memcached project? Alibaba also has FPGA instances available and we have been investigating their cloud offering.

Invite only till at least autumn, and they have no English speaking support (it is for their Chinese only DCs.)

Google Alibaba China F1 or F2.

As for the offering, the idea is that end the clients will only have to deal with SDK and libs on instances, not raw RDMA or anything related to internal infrastructure. This is how much I am allowed to say besides the fact of its existence.

From their current experience, and that of other hosting providers, not many people who go with F1, F2, or other FPGA instance actually do reap any benefit, and some drop mid-way. That's why they want to get more people using them pass the "toying with it stage." The "Herokuification" (god, Heroku sounds beyond hilarious in Russian) is there to let people use the common APIs while getting benefit from FPGA performance, without dealing with things outside of average webdev area of expertise.

really wish AWS supported Ethernet on fpga

There aren't Ethernet interfaces (transceivers, PCS/PMA, MAC) on the AWS FPGAs?

Although there are Ethernet transceivers on the AWS FPGAs (these are Xilinx UltraScale+ VU9P FPGAs) they are unused and not connected directly to the AWS network. Instead the FPGAs are connected over PCIe to a host server, which has a standard NIC. This required us to use DPDK to bypass the kernel and pass raw network traffic directly to/from the FPGA over PCIe.

We have been told by cloud providers that the FPGA cannot be directly connected due to network security concerns. Since there is no easy way to control how the arbitrary hardware programmed by users on the FPGA will interact with their network. Microsoft has been using FPGAs directly connected to their network (called a bump-in-the-wire architecture) for the FPGAs used in their datacenter (see Project Catapult for details). But these FPGAs are not programmable by Azure users yet.

I wonder if you could bound the problem with formal verification so that a custom design could only expose a certain activity profile, or an embedded hardware kill switch that would eliminate access after some detectable malicious interaction (at the hardware level).

Would that even be possible, or does it boil down to the halting problem (how can you know a sequence of transactions is malicious)?

I'm completely naive, but even so I'm tentatively confident saying yes, it's a halting problem thing.

My understanding of formal verification is a) that it doesn't eliminate bugs, it just reduces their probability by a percentage, and b) that it's academically very attractive (which is why HN is constantly hearing about it) but practically very expensive in terms of investment/return, in the sense that formal verification requires formal modeling and so forth in order to exist in the first place, and that requires actual design lockdown, and suddenly everything got Complicated™ :)

You could possibly get away with an FPGA-targeting compiler (low-relevance related conversation: https://news.ycombinator.com/item?id=17151024) set up in a stripped-down form, but you'd need to lock it down a bit tighter than https://www.corsix.org/content/malicious-luajit-bytecode (a bit old, but illustrative of a point), and then the question would be whether that would negate the performance benefit of the FPGA in the first place.

The thread I linked the comment from had some interesting info on FPGA vs GPU (eg, https://news.ycombinator.com/item?id=17150985)

These kind of optimisations will become more common as we near the end of Moore's law.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact