I do not want to detract from how awesome this is at all. It's a great improvement and worth of celebration.
That said, I spent long enough at Amazon to know that they have the metrics around this, before and after, to a surprising level of detail. When Amazon gives a single piece of data like that, it makes me worry me there is some nuance to the improvement. Show me the p50, p90, p99, and p99.99 as well as the average.
Remember, if reading 1 byte is improved 10X on average, but one in every 10,000 bytes is worsened by 10X, then 10kb file reads might now take ~10X longer (if all reads were in parallel, etc).
Yes, the wordings are carefully formulated as they have to be signed off by the AWS legal team for obvious reasons. With that said, this update was driven by profiling real applications and addressing the most common operations, so the benefits are real. For example, a simple WordPress "hello world" is now about 2x as fast as before.
Before seeing this announcement, someone at work today literally said "the wordpress site is feeling very fast today, I wonder if anything has changed?" Great work!
If you have a reasonable number of clients connecting (<20) and do not need access from Fargate or Lambda, then FsX (Ontap, openZFS) is a more well rounded offering.
I think we had about 300tb in that drive. Which is a lot, sure, but never being able to take a backup made us pretty sad.
Lower latency storage helps, but applications not doing things the most naive way possible would be great too. Spawn a few threads if you have thousands of files to open. Issueing IO operations in batches to io_uring would help too but libraries need to catch up first.
If you do need to drive 100 MB/s continuously, then Provisioned Throughput at the price that you mention would be appropriate. But note that you get 50 MB/s of throughput included with every TB of storage, and you only pay for the provisioned throughput in excess of what's included in your storage.
From a systems viewpoint, I can’t understand if there’s an actual technical challenge involved or if this pricing is just a lack of competitive pressure. Is there some corner case about the POSIX guarantees of EFS that makes higher throughput challenging to achieve?
(Otherwise one could just spray file blocks across many servers and maintain a metadata mapping of file to blocks to achieve good throughput)
It's not best tailored for the type of concurrent access that DBs will throw at it.
EFS is replicated across multiple AZs.
EFS supports multiple clientes, as opposed to one on EBS.
While technically true, the gap isn't large and it has enormous caveats. EFS can scale to a "whopping" 5x the aggregate throughput of a single gp3 EBS volume (5GB/sec EFS vs. 1GB/sec gp3). It requires at least 10 clients to reach that aggregate throughput. It barely beats a single io2 Block Express EBS volume (5GB/sec EFS vs. 4GB/sec io2 BE), and the io2 Block Express volume can do it with a single client. Reaching that 5GB/sec with EFS costs $30,000/month for either 100TB of data or 5GB/sec in provisioned bandwidth. Good luck!
Quick and easy way to store the user uploads of a CMS like Drupal. Deploy the application itself via either Beanstalk, EKS/ECS or on an EC2 server, point the database to RDS and mount the sites/xxx/files directory from EFS.
> We "flipped the switch" and enabled this performance boost for all existing EFS General Purpose mode file systems over the course of the last few weeks
Surely it wasn't as simple as flipping "[x] Make filesystem slower than it could be" to unchecked.
It would be cool to read up on how they did this. I'm guessing lots of folks are writing a lot of data to the existing old set up, does anyone have any guesses on how they hot swapped that to the new set up with no downtime or data inconsistencies?
Shameless plug, only one I promise. If you think this is cool, we are hiring for multiple PMT and SDE positions, which can be fully remote at the Sr. level and above. DM me for details, see , or see amazon.jobs and search for EFS.
EDIT: public link
I enjoy working with distributed storage systems, but I don't think I will ever carry a pager for one again. I wish the industry could figure out how to separate designing and building such systems, from giving up your nights and weekends to operate them.
Still needs supportive management: there are teams at Amazon who have time to fix everything which paged them at anti-social hours, and there are teams which don’t prioritize beyond minding the SLA of their COE Action Items, and more silently accrue operational debt and page people more often. Tricky balance to be sure.
Even the ‘SRE’ or ‘PE’ approaches you see at Google and Meta don’t obviate the need for development teams to have on-call rotations. At least in “BigTech” where teams operate services instead of shipping shrink-wrapped software it’s becoming rare to NOT see some on-call responsibility with engineering roles (including management). I suppose it isn’t just on-call, and the other big change in BigTech of the last decade was the somewhat widespread elimination of QA teams and SDET roles, and the merger of those responsibilities into the feature/service teams, and to SDE.
Other interesting thing to mention is that as an SDE you're not the only one that has oncall duties. In our team at least, PMTs are also oncall for about the same time. This creates a good dynamic as everyone is incentivized to minimize the oncall burden.
It seems possible to implement this feature via a time-based leader lease on the extent where a read request goes to a read-through cache. The cache will store some metadata (e.g. a hash or a verison number) of the block it is trying to validate from cache and send that to the leader along with the read request.
If the leader has the same version number for a block as was requested by the cache, the block itself does not need to be transferred and the other replicas don't need to be contacted. If you place the cache in the same AZ as the EC2 instance reading the file system, 600ms sounds viable.
Am I on the right track? :)
The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!
CDO salutes you.
Another variant is that older games (think cartridges, some CD's) had a block of unused data, so that when the game approached completion and they had to crunch to make it fit on the cartridge, they had some space left over. Probably the same for RAM memory.
I had to add sleep calls because the client had a simple resource consumption monitor that would complain if anything used more than 80% of the machine's CPU. Arguably a stupid policy; arguably simple enough to work.
Some time later, the client wanted the script to finish faster...so the sleep calls were deleted, and they changed their resource monitor.
ML inference on the other hand is a use case that we've specifically targeted with this update, and it should work very well. Many of our customers do inference using AWS Lambda, directly connected to their EFS file system.
Also, does today's update apply to GovCloud regions as well or just Commercial for now?
It also applies to GovCloud.
This announcement is, of course, the result of adding a caching layer on top of EFS. Naturally. But because they don't mention "throughput" and only mention "latency" I'm betting they have not used the cache layer to increase throughput.
I'll only ever recommend EFS if your data is very small and the throughput requirements are negligible.
The reason was metadata only, which caps out at about ~40mbs. Raw rear speed is great, but any metadata ops cap out quite hard.
We had to hack together a special NFS client to list the contents of the drive using as few metadata operations as possible, then have a separate step to copy the data.
This is actually a more common use case than I had imagined just one year ago. It would work today, and there are a few optimizations we're doing short term on file locking that will make this much better. If you reach out to AWS support or your TAM we can share more information and time lines under NDA.
That said, accessing SQLite databases is surprisingly disk IO heavy. I haven't gone too deeply into measuring the effect, but it seems the core issue is that traditional RDBS wire protocols are better than SQLite's disk accesses wrapped over NFS (or whatever connection the lambda/EFS join is). Stuff especially starts to break down when you need any sort of concurrent access. The small overhead for locking/unlocking files can quickly become awful when multiplied by the EFS latency, so you really do need extremely fine sharding.
When my consumer flash ssd can typically provide data to a read request inside 50 microseconds, I really expect a big hosted filesystem to have a median latency in the 100 microsecond range.
Ie. I expect amazon to have the majority of filesystem metadata in RAM, such that a single read request from the user translates to a single SSD read request at a known block address. That data is then sent back across the 10GBit ethernet network to the machine doing the request, adding 1.3 microseconds per network switch.
the kernel will buffer and queue packets vs sending them out asap. So there’s delay added at your host level and networking layer. The. You’ve got the speed of light traversing that distance. All this adds latency.
And the speed of light is 1000 feet per microsecond, so even the largest data centers won't add more than a handful of microseconds.
I have to contest this a bit. The company I work for doesn't put much effort into latency from what I can tell, and sees ~200us between servers in the same data center, and ~400us to another data center 12 miles away.
Your sense is probably coming from bad OS stacks. Kernel bypass like Amazon uses achieves this easily.
Or is there no routing between the storage and the application servers in a data center?
We had to make extensive use of fscache to get reasonable performance for shared data for web applications.
I wonder if fscache is part of the solution on the backend.
One of the most common operations on NFS are OPEN and GETATTR. The following example measures the latency of OPEN (GETATTR would be similar):
$ python3 -m timeit -s 'import os' 'os.open("/efs/test.bin", os.O_CREAT|os.O_RDONLY)'
500 loops, best of 5: 554 usec per loop
I implemented direct RPC calls for other RPCs like GETATTR but only for NFS version 3, and EFS uses version 4. Unfortunately the design of version 4 is a disaster from a protocol perspective. The designers wanted to enable multiple operations in a single RPC, but that's not supported by SUNRPC. So to maintain backwards compatibility they only implemented 2 procedures, NULL and COMPOUND. Everything inside a COMPOUND RPC is opaque. Well it's possible to construct but not with the standard RPC toolchain, so I never got around to it.
Our updated docs  contains a latency table. The official answer is that previously our read latencies were low single-digit millisecond and now they are as low as 600 microseconds.
“Amazon EFS is designed to provide 99.999999999% (11 9’s) of durability over a given year”
Where-as FSx for lustre’s durability and EBS single volume’s durability isn’t anywhere near as high.
I suspect but cannot confirm that lustre can support higher single instance and in aggregate throughput speeds.