Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Elastic File System Update – Sub-Millisecond Read Latency (amazon.com)
135 points by tosh on Feb 15, 2022 | hide | past | favorite | 86 comments

> EFS file systems now provide average latency as low as 600 microseconds for the majority of read operations on data and metadata.

I do not want to detract from how awesome this is at all. It's a great improvement and worth of celebration.

That said, I spent long enough at Amazon to know that they have the metrics around this, before and after, to a surprising level of detail. When Amazon gives a single piece of data like that, it makes me worry me there is some nuance to the improvement. Show me the p50, p90, p99, and p99.99 as well as the average.

Remember, if reading 1 byte is improved 10X on average, but one in every 10,000 bytes is worsened by 10X, then 10kb file reads might now take ~10X longer (if all reads were in parallel, etc).

(PMT on the EFS team).

Yes, the wordings are carefully formulated as they have to be signed off by the AWS legal team for obvious reasons. With that said, this update was driven by profiling real applications and addressing the most common operations, so the benefits are real. For example, a simple WordPress "hello world" is now about 2x as fast as before.

> For example, a simple WordPress "hello world" is now about 2x as fast as before.

Before seeing this announcement, someone at work today literally said "the wordpress site is feeling very fast today, I wonder if anything has changed?" Great work!

If you don't mind me asking, why do you run Wordpress off of EFS rather than EBS?

EFS allows you to connect to the filesystem from multiple web front-end machines. Doing the same with EBS would require you to manage some sort of clustered filesystem, or run a separate storage tier server.

So you are using it for shared files and user-generated files, not database? Instead of S3?

Wordpress Plugins are stored both in the filesystem and in the database and need to match. Same for the Wordpress Core when an update occurs as WP updates the PHP files locally for the upgrade process. The AWS guidance for a multi-webserver Wordpress setup has included EFS for this reason for years now. You can store uploaded media in S3, but that is a performance optimization that doesn't solve the previous issues.

So you mean the plugins' code, I see, thanks!

Not OP but the obvious use case for Wordpress is the install directory itself. Deploy to one server and have all servers sharing the EFS get the update at the same time.

So PHP rereads the source files on every request? Interesting.

There's a cache. Even if you have to restart some process or purge some cache, this setup makes sense to me for applications that can update their own code (e.g. install/update plugins).

Current SLA is only on uptime: https://aws.amazon.com/efs/sla/

For people thinking about EFS - the backup story is currently not that great. You may see a few days of recovery time for a volume with a few Tbs.

If you have a reasonable number of clients connecting (<20) and do not need access from Fargate or Lambda, then FsX (Ontap, openZFS) is a more well rounded offering.

Sorry to hear that. We don't see this internally but if you put your email in your profile I'd be happy to reach out to you. (not sure else how to start the communication chain unfortunately). SDM on the team that owns backup.

I’m not sure how come you don’t see this internally. Before we moved off EFS we simply where never able to back up our drive - it would timeout after 7 days for an incremental backup, and we reported this several times in tickets.

I think we had about 300tb in that drive. Which is a lot, sure, but never being able to take a backup made us pretty sad.

On a project we had to deal with some properietary software that was highly optimized for its CPU-bound steady-state workload but suffered from terrible startup times due to single-threaded, blocking IO to EFS. Part of the workaround was a wrapper that preopened files on multiple threads, issued readaheads and then rewrote the data to point to /proc/<pid>/fd/<number> paths instead of the paths on EFS. That alone already cut load times from 6 minutes to < 30 seconds.

Lower latency storage helps, but applications not doing things the most naive way possible would be great too. Spawn a few threads if you have thousands of files to open. Issueing IO operations in batches to io_uring would help too but libraries need to catch up first.

"Spawn a few threads if you have thousands of files to open." You've gone so deep into the trees that you can't see the forest. Creating a workaround wrapper or spawning unnecessary threads is a bad idea. You want async io or io_uring.

“Proprietary software” suggests the commenter did not have the source code or resources to do that. Sometimes you just have to make the best hack you can.

I was under the impression that Linux does not support async file access and so people have to fake it with threads.

This is no longer true since the arrival of io_uring, and that's why the person you replied to referenced it.

Definitely impressive. However, EFS perf oriented pricing is confusing. It seems to be quite expensive if we ask for throughput comparable with EBS. 100 MB/s costs $600/month. I’m not sure why the pricing for disitributed file system is so much higher than the block storage side. For ex, this makes running a db on top of EFS very expensive for non trivial workloads.

Hi! Unless you need to drive that 100 MB/s continuously, you can use the bursting throughput mode, which means you don't have to separately provision and pay for throughput.

If you do need to drive 100 MB/s continuously, then Provisioned Throughput at the price that you mention would be appropriate. But note that you get 50 MB/s of throughput included with every TB of storage, and you only pay for the provisioned throughput in excess of what's included in your storage.

Thanks for the reply. Relying on bursting for a core service like a db would be too much risk - it would make sense for batch compute or similar workloads.

From a systems viewpoint, I can’t understand if there’s an actual technical challenge involved or if this pricing is just a lack of competitive pressure. Is there some corner case about the POSIX guarantees of EFS that makes higher throughput challenging to achieve?

(Otherwise one could just spray file blocks across many servers and maintain a metadata mapping of file to blocks to achieve good throughput)

You should probably not run a database off of EFS.

It's not best tailored for the type of concurrent access that DBs will throw at it.

What kind of database do you want to run on EFS? Have you considered built-for-purpose managed services instead, such as RDS?

I'm thinking of a ClickHouse database which files are on EFS. Clickhouse is usualy much faster than RDS/Redshift for the use cases is designed for, and having EFS will ease a lot replication efforts.


EFS can scale to levels beyond what EBS is capable.

EFS is replicated across multiple AZs.

EFS supports multiple clientes, as opposed to one on EBS.

> EFS can scale to levels beyond what EBS is capable.

While technically true, the gap isn't large and it has enormous caveats. EFS can scale to a "whopping" 5x the aggregate throughput of a single gp3 EBS volume (5GB/sec EFS vs. 1GB/sec gp3). It requires at least 10 clients to reach that aggregate throughput. It barely beats a single io2 Block Express EBS volume (5GB/sec EFS vs. 4GB/sec io2 BE), and the io2 Block Express volume can do it with a single client. Reaching that 5GB/sec with EFS costs $30,000/month for either 100TB of data or 5GB/sec in provisioned bandwidth. Good luck!

I wonder what you all are using for EFS? I guess for larger models in ML (like NLP transformer stuff) it makes sense. I have observed myself moving everything i could to a DB and if i cant mostly S3. But for both there will be a latency pulling the data.

I think there's very little "proper" usage for EFS. That is to say: if you had infinite time and dev resources, you'd pick something else because other options are faster and cheaper. But it is very easy to set up EFS and start using it since it's just a filesystem. The time it takes to migrate from "my code deals with local files" to "my code deals with networked files" with EFS is about 15 minutes regardless of what you're doing with the files. S3 can't say that. I think EFS's value proposition is more about that convenience than it is about actually being an appropriate tech choice when considered in a vacuum.

I ran a Dovecot IMAP server using EFS for a while, before moving it to a leased dedicated server (i.e. away from AWS) with local NVMe storage. And that move did make it noticeably faster in addition to reducing the cost.

> I wonder what you all are using for EFS?

Quick and easy way to store the user uploads of a CMS like Drupal. Deploy the application itself via either Beanstalk, EKS/ECS or on an EC2 server, point the database to RDS and mount the sites/xxx/files directory from EFS.

One of the most impressive parts of the post is this:

> We "flipped the switch" and enabled this performance boost for all existing EFS General Purpose mode file systems over the course of the last few weeks

Surely it wasn't as simple as flipping "[x] Make filesystem slower than it could be" to unchecked.

It would be cool to read up on how they did this. I'm guessing lots of folks are writing a lot of data to the existing old set up, does anyone have any guesses on how they hot swapped that to the new set up with no downtime or data inconsistencies?

I'm the PMT for this project in the EFS team. The "flip the switch" part was indeed one of the harder parts to get right. Happy to share some limited details. The performance improvement builds on a distributed consistent cache. You can enable such a cache in multiple steps. First you deploy the software across the entire stack that supports the caching protocol but it's disabled by configuration. Then you turn it for the multiple components that are involved in the right order. Another thing that was hard to get right was to ensure that there are no performance regressions due to the consistency protocol.

Shameless plug, only one I promise. If you think this is cool, we are hiring for multiple PMT and SDE positions, which can be fully remote at the Sr. level and above. DM me for details, see [1], or see amazon.jobs and search for EFS.

[1] https://www.amazon.jobs/en/jobs/1935130/senior-software-deve...

EDIT: public link

Thank you for calling out on-call responsibilities in your job listing. Too many job listings today fail to mention that _very significant_ responsibility.

I enjoy working with distributed storage systems, but I don't think I will ever carry a pager for one again. I wish the industry could figure out how to separate designing and building such systems, from giving up your nights and weekends to operate them.

Separating design and build from operate is antithetical to Amazon. It isn’t a “figure out” for a lot of companies including Amazon — it’s very intentional and seemingly unlikely to change. They’ve observed that they create a stronger culture of ownership (which then drives getting things fixed faster and more empathy for the customers) through having the builders also be the operators.

Still needs supportive management: there are teams at Amazon who have time to fix everything which paged them at anti-social hours, and there are teams which don’t prioritize beyond minding the SLA of their COE Action Items, and more silently accrue operational debt and page people more often. Tricky balance to be sure.

Even the ‘SRE’ or ‘PE’ approaches you see at Google and Meta don’t obviate the need for development teams to have on-call rotations. At least in “BigTech” where teams operate services instead of shipping shrink-wrapped software it’s becoming rare to NOT see some on-call responsibility with engineering roles (including management). I suppose it isn’t just on-call, and the other big change in BigTech of the last decade was the somewhat widespread elimination of QA teams and SDET roles, and the merger of those responsibilities into the feature/service teams, and to SDE.

There's different schools of thought around this and I certainly understand your perspective. At AWS, carrying a pager at limited times (in our team, 2-3 weeks per quarter as mentioned in the link) is considered an important part of our culture of operating at-scale services. In our team, we try to minimize oncall burden as much as possible by investing in automation, and only alarm if the system really doesn't know what to do. We have a separate planning bucket for burden reduction every quarter.

Other interesting thing to mention is that as an SDE you're not the only one that has oncall duties. In our team at least, PMTs are also oncall for about the same time. This creates a good dynamic as everyone is incentivized to minimize the oncall burden.

Being on call aligns incentives. If it's someone else's problem when what you just design and build then it will operate less well.

Isn't that the idea behind separating out the SRE (site reliability engineer) role from software engineering?

Sort of. Many teams in FAANG put their devs on rotations that aren't full on-call like SRE (and some managers put their devs into full SRE rotations without mentioning there is a bonus). I always check with my future managers that they don't plan to do this.

Haha well aware as a current on call SDE at one of them!

I watched the SNIA presentation from SDC2020 on EFS and it described each extent in the file system as a state machine replicated via multi-paxos.

It seems possible to implement this feature via a time-based leader lease on the extent where a read request goes to a read-through cache. The cache will store some metadata (e.g. a hash or a verison number) of the block it is trying to validate from cache and send that to the leader along with the read request.

If the leader has the same version number for a block as was requested by the cache, the block itself does not need to be transferred and the other replicas don't need to be contacted. If you place the cache in the same AZ as the EC2 instance reading the file system, 600ms sounds viable.

Am I on the right track? :)

> Am I on the right track? :)

The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!

Sounds good! The SNIA presentation was very interesting.

You shared an internal link, aside from this, great work!

CDO salutes you.

This sounds a lot more interesting than I was expecting. I assumed you just swapped to better network hardware. Will there be more info at some point?

I thought they're already running quite top dog stuff in their whole network, it's the only way to scale big isn't it?

Are there targets for what percentage of an efs filesystem’s reads can be satisfied by this cache?

NFS workloads are typically metadata heavy and highly correlated in time, so you can achieve very high hit rates. I can't share any specific numbers unfortunately.

This reminds me of a 10x-developer coworker at my previous company. Whenever more performance was needed from the system he would just say "sure, just give me a week to remove the sleep() calls from the code".

See it's one of those things where you can't be sure if it's sarcasm or not; I've read plenty of anecdotes where it was in fact a case of putting in sleep() on purpose (either to remove it later to show off, or because customers were complaining things went too fast), or leaving it in by accident.

Another variant is that older games (think cartridges, some CD's) had a block of unused data, so that when the game approached completion and they had to crunch to make it fit on the cartridge, they had some space left over. Probably the same for RAM memory.

I've done this.

I had to add sleep calls because the client had a simple resource consumption monitor that would complain if anything used more than 80% of the machine's CPU. Arguably a stupid policy; arguably simple enough to work.

Some time later, the client wanted the script to finish faster...so the sleep calls were deleted, and they changed their resource monitor.

I mean as an infra person I can kinda agree with this. I wouldn’t put it on the app, and would have used cgroups but I don’t want to starve OS services for cycles.

most of the latency comes from network layer. my naiive guess is they probably switched from a standard ethernet setup to a infiniband setup to achieve 600us of total latency.

This is good to see. I didn't realize just how much this latency could effect me until I recently was doing some fine tuning on an NLP model on Sagemaker and noticed my training loss was looking pretty rough when I used a working directory that was in EFS as opposed to on my EC2 instance. Testing it out, performance was dramatically improved when I moved everything to the directly attached storage, so I've just been avoiding EFS since then during training.

For ML training, where you repeatedly open the same files over and over again, the NFS 'close to open' coherency protocol (where every open() and close() is a round trip) is not a good fit. Once we support NFS delegations, this will become a lot faster, but local training data set will very likely continue to be faster (we have ideas on this though!) Given that ML training uses very expensive instance types, a local file system is the way to go for now.

ML inference on the other hand is a use case that we've specifically targeted with this update, and it should work very well. Many of our customers do inference using AWS Lambda, directly connected to their EFS file system.

Appreciate the insight, that's really good to know! It's one of those things that's been bugging me every once in a while when I think back to it.

Also, does today's update apply to GovCloud regions as well or just Commercial for now?

> Also, does today's update apply to GovCloud regions as well or just Commercial for now?

It also applies to GovCloud.

That's awesome, thank you!

Could you use EFS to host a horizontally scalable relational database by using finely sharded SQLite dbs? Like if you had 1 db per user, for example.

I do something similar, but test out the performance before you commit to it. There is a massive chasm between how EFS is marketed and how it actually performs. EFS is the slowest possible way to store data in AWS, with painfully low per-client and aggregate throughput. We implemented EFS because it was easy and then immediately commenced the project to replace it with S3. The bare EFS solution didn't survive long enough to make it to the S3 project's finish; we had to add a layer of caching instances in EC2 to bring EFS throughput up to an acceptable level. We almost needed two layers of caching instances; i.e. EFS fans out to N caching instances, which then fan out to M caching instances, which then fan out to P workers, P >> M >> N. Because the EFS throughput is so poor, it could barely cope with the fan-out to the one layer of caching instances. Fortunately the S3 migration project finished before we got that far.

This announcement is, of course, the result of adding a caching layer on top of EFS. Naturally. But because they don't mention "throughput" and only mention "latency" I'm betting they have not used the cache layer to increase throughput.

I'll only ever recommend EFS if your data is very small and the throughput requirements are negligible.

We had to move 300tb of EFS data into S3 in 40 hours, and this really showed how poor EFS performance can be.

The reason was metadata only, which caps out at about ~40mbs. Raw rear speed is great, but any metadata ops cap out quite hard.

We had to hack together a special NFS client to list the contents of the drive using as few metadata operations as possible, then have a separate step to copy the data.

> Could you use EFS to host a horizontally scalable relational database by using finely sharded SQLite dbs? Like if you had 1 db per user, for example.

This is actually a more common use case than I had imagined just one year ago. It would work today, and there are a few optimizations we're doing short term on file locking that will make this much better. If you reach out to AWS support or your TAM we can share more information and time lines under NDA.

I actually recently delivered a project which uses almost this exact technique. Python lambdas that operate on SQLite files have the benefit of being much simpler and cheaper than most other scalable database solutions (like Aurora) for very light loads.

That said, accessing SQLite databases is surprisingly disk IO heavy. I haven't gone too deeply into measuring the effect, but it seems the core issue is that traditional RDBS wire protocols are better than SQLite's disk accesses wrapped over NFS (or whatever connection the lambda/EFS join is). Stuff especially starts to break down when you need any sort of concurrent access. The small overhead for locking/unlocking files can quickly become awful when multiplied by the EFS latency, so you really do need extremely fine sharding.

I've wanted to use "SQLite db per user" for awhile now but haven't had the right problem domain for it yet. https://engineering.backtrace.io/2021-12-02-verneuil-s3-back... looks really interesting.

I guess one could consider using Amazon FSx for Lustre: https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.h...

You could also consider using Amazon FSx for NetApp ONTAP.

Yeah, think you could. They have a lambda adapter as well, so you’d get nearly infinite scale straight away.

Is it me, or is "sub millisecond" actually quite unimpressive for an 'all-in-the-same-building-on-ssd' filesystem?

When my consumer flash ssd can typically provide data to a read request inside 50 microseconds, I really expect a big hosted filesystem to have a median latency in the 100 microsecond range.

Ie. I expect amazon to have the majority of filesystem metadata in RAM, such that a single read request from the user translates to a single SSD read request at a known block address. That data is then sent back across the 10GBit ethernet network to the machine doing the request, adding 1.3 microseconds per network switch.

Do a ping test between two machines in your local network. Turns out it’s very hard to get sub millisecond speeds. Especially sub 500 microseconds.

the kernel will buffer and queue packets vs sending them out asap. So there’s delay added at your host level and networking layer. The. You’ve got the speed of light traversing that distance. All this adds latency.

But something like a filesystem should probably be built on RDMA to get the kernel mostly out-of-the-way. Then you should see performance like this:


And the speed of light is 1000 feet per microsecond, so even the largest data centers won't add more than a handful of microseconds.

> Turns out it’s very hard to get sub millisecond speeds. Especially sub 500 microseconds.

I have to contest this a bit. The company I work for doesn't put much effort into latency from what I can tell, and sees ~200us between servers in the same data center, and ~400us to another data center 12 miles away.

Uh… no. PCI-E bus is about 900ns on each end, switch adds 300ns. You can get under 3us ping pong, you’re off by orders of magnitude.

Your sense is probably coming from bad OS stacks. Kernel bypass like Amazon uses achieves this easily.

I just put a crossover cable between a Mac mini and a random Linux box and with no tuning the median ping is 180 microseconds.

Now add some routing and switching...

Or is there no routing between the storage and the application servers in a data center?

1.3 microseconds per switch is typical on 10GigE ethernet. I doubt your message goes through more than 10 switches even in amazons data center.

RTTs are extremely low in EC2, and I doubt EC2 is wasting any fibers on a mere 10ge link.

It's not all in the same building though

Recently moved a system off of EFS because the latency. This update might have prevented that.

We had to make extensive use of fscache to get reasonable performance for shared data for web applications.

I wonder if fscache is part of the solution on the backend.

Quick plug for my open source project nfsping for measuring NFS latency:


Hi, thanks for the link! I just tried this on EFS but it doesn't give the expected answer. The program uses NFS NULL calls. I'm guessing we haven't optimized those yet given that they are typically only used at mount time, but we'll have a look to see what's going on.

One of the most common operations on NFS are OPEN and GETATTR. The following example measures the latency of OPEN (GETATTR would be similar):

  $ python3 -m timeit -s 'import os' 'os.open("/efs/test.bin", os.O_CREAT|os.O_RDONLY)'
  500 loops, best of 5: 554 usec per loop
Results will vary a bit depending on network routing and the region.

Yes the NULL RPC does nothing and should give a baseline for measuring the round trip time. I'm surprised that it doesn't give the expected results. Your test case is going through the kernel, not sending raw RPCs across the wire. There are a lot of moving parts in the filesystem layer.

I implemented direct RPC calls for other RPCs like GETATTR but only for NFS version 3, and EFS uses version 4. Unfortunately the design of version 4 is a disaster from a protocol perspective. The designers wanted to enable multiple operations in a single RPC, but that's not supported by SUNRPC. So to maintain backwards compatibility they only implemented 2 procedures, NULL and COMPOUND. Everything inside a COMPOUND RPC is opaque. Well it's possible to construct but not with the standard RPC toolchain, so I never got around to it.

Is there an updated version of latency.txt? Would be interesting to compare against.

> Is there an updated version of latency.txt? Would be interesting to compare against.

Our updated docs [1] contains a latency table. The official answer is that previously our read latencies were low single-digit millisecond and now they are as low as 600 microseconds.

[1] https://docs.aws.amazon.com/efs/latest/ug/performance.html

Does anyone have any benchmarks with AWS FSX (LustreFS) vs EFS - how do they compare in random/sequential access?

There’s a durability difference.

“Amazon EFS is designed to provide 99.999999999% (11 9’s) of durability over a given year”

Where-as FSx for lustre’s durability and EBS single volume’s durability isn’t anywhere near as high.

I suspect but cannot confirm that lustre can support higher single instance and in aggregate throughput speeds.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact