Investigating Linux phantom disk reads

addisonj · on May 2, 2023

I am going to write this comment with a large preface: I don't think it is ever helpful to be an absolutist. For every best-practice/"right way" to do things, there are circumstances when doing it another way makes sense. That can be a ton of reasons for that, be it technical, money/time, etc. The best engineering teams aren't those that just blindly follow what others say is a best practice but understand the options and make an informed choice. None of the following comment is at all commentary on questDB, as they mention in the article, many databases use similar tools.

With that said, after reading the first paragraph I immediately searched the article for "mmap" and had a good sense of where the rest of this was going. Put simply, it is just really hard to consider what the OS is going to do in all situations when using mmap. Based on my experience, I would guess that a ton of people reading this comment have hit issues that, I would argue, is due to using mmap. (Particularly looking at you prometheus).

All things told, this is a pretty innocuous incident of mmap causing problems, but I would encourage any aspiring DB engineers to read https://db.cs.cmu.edu/mmap-cidr2022 as it gives a great overview of the range of problems that can occur when using mmap

I think some would argue that mmap is "fine" for append only workloads (and is certainly more reasonable compared to a DB with arbitrary updates) but even here, lots of factors like metadata, scaling number of tables, etc will eventually bring you to hit some fundamental problems when using mmap.

The interesting opportunity in my mind, especially with improvements in async IO (both at FS level and in tools like rust), is to build higher level abstractions that bring the "simplicity" of mmap, but with more purpose-built semantics ideal for databases.

justin66 · on May 3, 2023

> All things told, this is a pretty innocuous incident of mmap causing problems, but I would encourage any aspiring DB engineers to read https://db.cs.cmu.edu/mmap-cidr2022 as it gives a great overview of the range of problems that can occur when using mmap

At first glance (doing a few text finds and a really quick read through the paper after clicking through that intro page with the poop emoji at the top and disregarding Recommended Music for this Paper: Dr. Dre – High Powered (featuring RBX)) that paper seems too short to adequately explore the topic.

On the other hand these same guys (the CMU Database Group) are an amazing resource and their youtube channel offers some great stuff [1] that would allow a curious person to explore the topic in greater depth if they dug into the papers of everyone who gave presentations at CMU.

clickbait: How many of the world's leading software engineers who addressed CMU students and professors about their successful database products rely on mmap? The answer may surprise you.

[1] https://www.youtube.com/@CMUDatabaseGroup

ayende · on May 3, 2023

I wrote a response to this article, because that is a bad comparison.

https://ravendb.net/articles/re-are-you-sure-you-want-to-use...

Agree on CMU being a great resource.

justin66 · on May 3, 2023

Thanks. I didn't know about this, but I read the stuff you wrote about lmdb back in the day.

apavlo · on May 3, 2023

> disregarding Recommended Music for this Paper: Dr. Dre – High Powered (featuring RBX))

Why?

> that paper seems too short to adequately explore the topic.

The paper was published in CIDR (https://www.cidrdb.org). The paper submissions for this conference are meant to be short (typically 6-7 pages) to ensure that people can get their ideas out quickly.

justin66 · on May 4, 2023

That's really good to know - thanks for clarifying, and for the research you're sharing.

Sytten · on May 2, 2023

When I read they were using mmap I immediately thought of Andy Pavlo since he warns against it every time he can in the CMU videos. Enough that I thought it was somewhat of a concensus now that mmap should be avoided specially for databases, guess I was wrong.

eatonphil · on May 2, 2023

> a concensus now that mmap should be avoided specially for databases

Maybe, but table 1 in Andy Pavlo's paper shows 7 of 10 databases surveyed do still use mmap.

Furthermore, that paper clearly demonstrating the issues with mmap came out in 2022 and most databases have been around longer than that.

That mmap isn't the future is maybe more certain than that mmap isn't common practice today (because it does sorta seem to be).

loeg · on May 3, 2023

My impression is that databases have known of the shortcomings of mmap since, like, the 90s. A critical flaw is the lack of error handling -- in addition to the unpredictable performance characteristics and naive caching. I'm looking forward to reading the paper.

josephg · on May 3, 2023

> Enough that I thought it was somewhat of a concensus now that mmap should be avoided specially for databases

I have to remind myself regularly that the world is larger than it seems.

The other day someone on r/rust said "Surely everyone knows what the rust programming language is by now. Can we stop introducing rust every time we mention it in a paper?". But no, obviously everyone in your circles knows what rust is. But you don't know most developers. And I wouldn't be surprised if less than half of working developers have heard of rust, even if everyone you know knows about it.

I used to rant against using socket.io at every possible opportunity. The library has (had) crazy bugs in its reconnection code. In the right circumstances the library would violate ordering and delivery guarantees, or it would lie about messages being received when they hadn't been. But no matter how much I ranted about it, and no matter how many hundreds of issues there were on github, far more people used socket.io than the (much more reliable) alternatives because socket.io had a pretty website, good documentation and it was taught at coding bootcamps. I think the only reason its not as popular now is that you don't need it now that websockets are available everywhere.

My partner says she imagines asking questions of our families when she tries to imagine what the average person thinks. But our immediate families are still a really weird bubble - every single one of the adults has graduated from college. (And weirdly, over half of that group have also taught at college.) That's still a really biased set of people. Finding an unbiased set is wildly difficult.

postdb · on May 3, 2023

>I used to rant against using socket.io at every possible opportunity. The library has (had) crazy bugs in its reconnection code. In the right circumstances the library would violate ordering and delivery guarantees, or it would lie about messages being received when they hadn't been. But no matter how much I ranted about it, and no matter how many hundreds of issues there were on github, far more people used socket.io than the (much more reliable) alternatives because socket.io had a pretty website, good documentation and it was taught at coding bootcamps. I think the only reason its not as popular now is that you don't need it now that websockets are available everywhere.

That is true. So with the modern browsers now days latest Chrome/Firefox (both desktop and mobile) can support websocket seamlessly? I guess socket.io is kinda like the jquery of ws then? It will take some time to phase out.

josephg · on May 3, 2023

> So with the modern browsers now days latest Chrome/Firefox (both desktop and mobile) can support websocket seamlessly?

Yes. And this has been true for nearly a decade. The jquery analogy is exactly right. These days socket.io is simply an over complicated wrapper around a standard browser feature (websockets). Just like jquery, it will probably hang around as long as we’re alive through sheer stupid inertia.

Sesse__ · on May 3, 2023

mmap should generally be avoided, not just for databases. It's useful for quick prototyping and for the specific case of demand-paging executables (which is really what it's made for!), but there are so many pitfalls overall. You can't mmap large files relative to your memory (you crash into either address space limits or PTE memory usage), you have absolutely no hope of recovering from errors, you can't do large sequential I/O reliably, it's a really difficult problem to order your writes, and so on.

There are corner cases where it's great, like when you have a file that you know is 90% in-core already and you don't care about errors. But overall, read() and write() are simpler, faster, more reliable primitives.

citrin_ru · on May 3, 2023

Here is a good use case for mmap - a process-a performs data processing and writes results to a disk (or tmpfs), next you need to repeatedly read in a process-b (or read once but not sequentially). If you'll use read() you will: 1. double RAM usage - the file will be in a VM cache anyway (unless you'll use O_DIRECT which would make this pipeline slower) and without mmap() you'll have to create 2nd copy inside the process-b 2. add unnecessary kernel->userspace copy while reading data in process-b.

But for saving data from process-a I would still use write() using MAXPHYS sized blocks: I'm not sure mmap would use optimal write block size, and with write it is easier detect errors (like ENOSPC).

Sesse__ · on May 3, 2023

That's a good use case for… a quite regular pipe? :-)

citrin_ru · on May 3, 2023

Pipe would not allow receiving process to read the data more than one time or jump across the data from a location to a location (without storing either in memory or on FS).

Another good use case for mmap is sharing read-only (or rarely updated) dataset among multiple processes.

m463 · on May 3, 2023

So a dbms probable has intimate knowledge about its data that probably can't be hinted via madvise.

But I wonder when there are decent reasons to let the OS handle file I/O through the demand paging system, since it's good at it.

loeg · on May 3, 2023

> since it's good at it.

This is generous. At least, it's not a great working assumption. If you know anything about your workload, it's often possible to do better by specializing slightly.

pengaru · on May 2, 2023

Going through mmap for bulk-ingest sucks because the kernel has to fault in the contents to make what's in-core reflect what's on-disk before your write access to the mapped memory occurs. It's basically a read-modify-write pattern even when all you intended to do was write the entire page.

When you just use a write call you provide a unit of arbitrary size, and if you've done your homework that size is a multiple of page size and the offset page-aligned. Then there's no need for the kernel to load anything in for the written pages; you're providing everything in the single call. Then you go down the O_DIRECT rabbithole every fast linux database has historically gone down.

ritcgab · on May 3, 2023

For (sequentially) writing a file this is true. But a database might be more complex, as the underlying I/O is transparent to the user. The database needs a semantics to indicate that "this ingestion will produce a large bulk of sequential writes".

bluestreak · on May 2, 2023

It is not always read-write-modify. There is no evidence of this pattern in Ubuntu when there is no memory pressure. Merge occurs when block is partially updated after kenel had lost state of the block, which can happen under memory pressure.

amluto · on May 3, 2023

Sorry, try again.

On x86, and I think every architecture, when you write to a memory mapping that is not already backed by a writable page, the kernel is notified that user code is trying to write. And the kernel needs to fill in the contents of the page, which requires a read if the page isn’t already loaded.

It has to be this way! The write could be a read-modify-write instruction. Or it could be a plain store, but I’ve never heard of hardware with write-only memory with fine enough granularity to make this work.

The sole exception is if the page in question is all zeros and the kernel can know this without needing to read the file. This might sometimes be the case for an append-only database. I don’t know exactly what QuestDB does.

Also:

> As soon as you mmap a file, the kernel allocates page table entries (PTEs) for the virtual memory to reserve an address range for your file,

Nope. It just makes a record of the existence of the mapping. This is called a VMA in Linux. No PTEs are created unless you set MAP_POPULATE.

> but it doesn't read the file contents at this point. The actual data is read into the page when you access the allocated memory, i.e. start reading (LOAD instruction in x86) or writing (STORE instruction in x86) the memory.

What are these LOAD and STORE instructions in x86? There are architectures reasonably described as load-store architectures, and x86 isn’t one of them.

bluestreak · on May 3, 2023

> On x86, and I think every architecture, when you write to a memory mapping that is not already backed by a writable page, the kernel is notified that user code is trying to write. And the kernel needs to fill in the contents of the page, which requires a read if the page isn’t already loaded.

This is very true. Perhaps there wasn't enough context to what the article is describing. The read problem started to occur on database that is subject to constant write workload. Data is flowing in all the time at variable rate. Typically blocks are "hot" and being filled in fully within seconds if not millis.

Zeroing the file is an option to try. QuestDB allocates disk with `posix_fallocate()`, which doesn't have the required flags. We would need to explore `fallocate()`. Thanks.

amluto · on May 3, 2023

If you are allocating zeroed space and then writing a whole page quickly, then you may well avoid a read. And doing this under extreme memory pressure will indeed cause the page to be written to disk while only partly written by user code, and it will need to be read back in. Reducing memory pressure is always good.

I would expect quite a bit better performance if you actually write a entire pages using pwrite(2) or the io_uring equivalent, though.

Messing with fallocate on a per-page basis is asking for trouble. It changes file metadata, and I expect that to hurt performance.

bluestreak · on May 3, 2023

great, we are on the same page! `fallocate()` (or posix one) is called on large swades of file. 16MB default. Not too often to hurt performance. I wonder if zeroing the file with `fallocate()` will result in actual disk writes or is it ephemeral?

amluto · on May 3, 2023

fallocate will cause writes, possibly on a delay

saagarjha · on May 3, 2023

I’m assuming the LOAD and STORE were not actual instructions but euphemisms for the variety of operations that touch memory.

kevin_thibedeau · on May 3, 2023

> What are these LOAD and STORE instructions in x86?

The RISC core of every x86 since PPro is a load/store machine.

eru · on May 3, 2023

Those instructions still aren't exposed to the user, are they?

kevin_thibedeau · on May 3, 2023

Doesn't matter. The core doesn't see the x86 instructions either. MOVs are decoded into simpler operations to facilitate reordering.

amluto · on May 3, 2023

Good luck decoding LOCK CMPXCHG to loads and stores.

In any case, my actual objection was to “LOAD instruction in x86”. It makes no sense.

dmazin · on May 3, 2023

Am I the only one surprised to read that this database relies on periodic flushing (every 30s by default) with no manual syncs at all? I guess it’s metrics so 30s of data loss is fine? I dunno about that. Data loss is usually due to a power failure, and the metrics collected right before a power failure are important.

eatonphil · on May 3, 2023

I was surprised by this too. I asked the author about it on Twitter [0]. At the very least it seems like fsync is something you can opt into in their configuration, even if it's not the default.

https://twitter.com/eatonphil/status/1653373246929027075

dmazin · on May 3, 2023

Nice. Glad you asked.

From a developer:

“As long as the OS & the HW doesn't crash, the data is safe thanks to the page cache”

This is so strange to me. A database that is non-durable by default. OK…

josephg · on May 3, 2023

A few years ago I was doing some consulting for a medical tech startup run by an ex-doctor. They were thinking of using mongodb until I explained how it had a reputation for losing data. I'll never forget the look of horror, disgust and confusion on his face. He turned to me and said "A database that forgets things!? Why would anyone want that??".

I still don't have an answer for him. It sounds just as strange to me too.

nhourcard · on May 3, 2023

MongoDB is one of the most successful open-source databases of all time. The parent company is a listed company and worth $15BN, 3x more than Elastic to put some perspective.

This reflection [1] came from the founders of RethinkDB, a competitor of MongoDB at the time:

"It turned out that correctness, simplicity of the interface, and consistency are the wrong metrics of goodness for most users. The majority of users wanted these three trade-offs instead:

- A use case. We set out to build a good database system, but users wanted a good way to do X (e.g. a good way to store JSON documents from hapi, a good way to store and analyze logs, a good way to create reports, etc.).

- Timely arrival. They wanted the product to actually exist when they needed it, not three years later.

- Palpable speed [...]. MongoDB mastered these workloads brilliantly, while we fought the losing battle of educating the market."

MongoDB narrowed things down for a specific use case, and became the best for that use case. This comes with trade-offs. MongoDB was probably not the best database for healthcare back in the days, but that is OK. It did the job very well for other use cases and industries. And over time, they fixed the issue around losing data and became more stable. Essentially, they made developers feel like superheroes, and over time improved their product, and eventually grabbed a massive market share.

[1] https://www.defmacro.org/2017/01/18/why-rethinkdb-failed.htm...

pritambaral · on May 3, 2023

> MongoDB is one of the most successful open-source databases of all time.

It used to be open source. It's not anymore

> The parent company is a listed company and worth $15BN, 3x more than Elastic to put some perspective.

That's purely a capitalistic argument and makes no difference to whether the product is any good. For example, there's plenty of "churches" that are richer than MongoDB Inc. and absolutely abhorrent and evil.

> This comes with trade-offs.

The only thing that required the trade-off of data loss was cheating in benchmarks in order to hoodwink naive potential users into using their dangerous product. MongoDB Inc. has always preferred to lie to their users. It is not a database company; it's a marketing company with a product they label as a database. And that's a smart way to make money, sure, because of vendor lock-in, but it's not a smart way to gain trust.

josephg · on May 3, 2023

Cold comfort to all the companies who believed mongo’s marketing claims and then lost data because of their shoddy engineering. Or the users who had their data stolen because mongo shipped with insecure defaults. (Not entirely mongo’s fault, but they deserve some of the blame).

As engineers we bear responsibility for how our work impacts society. Mongodb may have made their investors a lot of money, but they did sloppy work and didn’t do right by their customers. That’s not a success in my book.

matthews2 · on May 3, 2023

It's like Snapchat, but for databases!

ayende · on May 3, 2023

The issue is that fsync is super expensive. Like, it's not even funny.

There are many cases, and ingest is one of them, where no being durable is fine. If you can either:

* Repeat the whole process on failure (which is assumed to be rare) * Recover from the failure without data corruption (distinct from data loss, mind)

In those cases, being 10x faster is very compelling.

Note that this is about ingest for bulk loads, while online transactions not being durable is a really bad idea.

For bulk load ingest, you can usually retry the whole operation. Not so for transactions.

scottlamb · on May 3, 2023

At least given the append-only nature, the data loss should be bounded.

For some reason, many databases that overwrite data support disabling journaling and/or fsync. E.g. SQLite has "pragma journal = off". You can lose the entire database from an ill-timed crash, if one important page gets written but another doesn't. To their credit, it's not the default, and the documentation is explicit about this:

> If the application crashes in the middle of a transaction when the OFF journaling mode is set, then the database file will very likely go corrupt.

https://www.sqlite.org/pragma.html#pragma_journal_mode

tkhattra · on May 3, 2023

it supports different commit modes (see [1]) - nosync, async, and sync. you can choose the more expensive but safer async or sync modes if you're willing to tolerate higher commit latency.

[1] https://questdb.io/docs/reference/configuration/#cairo-engin...

dmazin · on May 3, 2023

Yeah it’s just… it’s a database. Weird default.

AlfeG · on May 3, 2023

Because otherwise competitive benchmarks will show not so great results against other dbses

ddorian43 · on May 3, 2023

That's what ~every "cloud" database does because disks in the cloud are slow.

davidhyde · on May 2, 2023

Seems like using memory mapped files for a write-only load is the sub optimal choice. Maybe I’m mistaken but surely using an append-only file handle would be simpler than changing the behaviour of how memory mapped files are cached like they did for their solution?

rajnathani · on May 8, 2023

I know sharing ChatGPT/GPT/AI generated text in comments here can be unappealing, but I would like to share this one as I feel that I managed to get ChatGPT-4 to summarize this article using a non-computer analogy pretty well:

"Imagine you work in a library where you store books on shelves. Your primary task is to take new books and put them on the shelves (write-only load). You don't expect to read the books often, so the number of times you need to open and read the books should be minimal.

One day, you notice that several books are being opened and read more often than expected, even though your main task is to put away new books. This is confusing and unexpected, so you start investigating why this is happening.

After some investigation, you find out that the library assistant (the operating system) is trying to be helpful by anticipating which books might be needed next and opening them ahead of time (readahead). This anticipation works well when there is plenty of shelf space (memory) available. However, when the library gets crowded (memory pressure), the assistant starts anticipating the wrong books, causing unnecessary book openings (phantom reads).

To resolve this issue, you tell the library assistant to stop anticipating which books to open (disabling readahead) when you're just putting away new books. This solves the problem and reduces the number of unnecessary book openings. The experience teaches you the importance of understanding how the library assistant works and shows that addressing unexpected issues can lead to improvements in the overall library system."

sytse · on May 2, 2023

TLDR; "Ingestion of a high number of column files under memory pressure led to the kernel starting readahead disk read operations, which you wouldn't expect from a write-only load. The rest was as simple as using madvise in our code to disable the readahead in table writers."

EE84M3i · on May 2, 2023

The article kind of dances around it, but AIUI the reason that their "weite-only load" caused reads (and thus readahead) was because they were writing to a mapped page that had already been evicted - so the kernel was reading/faulting those pages because it can only write in block/page sized chunks.

In some sense maybe this could be thought of as readahead in preparation for writing to those pages, which is undesirable in this case.

However, what confused be about this article was if the data files are append only, how is there a "next" block to read ahead to? I guess maybe the files are pre-allocated or the kernel is reading previous pages.

bremac · on May 2, 2023

Reading between the lines, it sounds as if they're using mmap. There is no "append" operation on a memory mapping, so the file would need to be preallocated before mapping it.

If the preallocation is done using fallocate or just writing zeros, then by default it's backed by blocks on disk, and readahead must hit the disk since there is data there. On the other hand, preallocating with fallocate using FALLOC_FL_ZERO_RANGE or (often) with ftruncate() will just update the logical file length, and even if readahead is triggered it won't actually hit the disk.

EE84M3i · on May 2, 2023

For the file being entirely pre-allocated case I understand, but for the file hole case I'm not sure I understand why you'd get such high disk activity.

If the index block also got evicted from the page cache, then could reading into a file hole still trigger a fault? Or is the "holiness" of a page for a mapping stored in the page table?

loeg · on May 3, 2023

I suspect page size/aligned file holes could be backed by a read-only zero page via PTE as an optimization, but they might not be (I'm not as familiar with Linux mmap/filesystems as with FreeBSD).

It is quite possible the filesystem caches, e.g., the file extent tree (including holiness) separately from the backing inode/on-disk sectors for the tree.

ayende · on May 3, 2023

Using _ftruncate_ or FALLOC_FL_ZERO_RANGE is a bad idea for a database. The problem is that you may get an out of disk space error mid operation.

If you are using mmap, that will express itself as a segmentation fault, which you really don't want.

You _need_ to allocate the file ahead of time, so you can properly behave there.

pengaru · on May 2, 2023

The readahead is a bit of a readaround when I last checked, as in it'll pull in some stuff before the fault as well.

There used to be a sys-wide tunable in /sys to control how large an area readahead would extend to, but I'm not seeing it anymore on this 6.1 laptop. I think there's been some work changing stuff to be more clever in this area in recent years. It used to be interesting to make that value small vs. large and see how things like uncached journalctl (heavy mmap user) were affected in terms of performance vs. IO generated.

EE84M3i · on May 2, 2023

The article distinguishes "readaround" from a linear predicted "readahead", but then says the output of blktrace indicates a "potential readahead", which is where I got confused.

Does MADV_RANDOM disable both "readahead" and "readaround"?

0xbadcafebee · on May 2, 2023

There are other methods you can use to increase performance under memory pressure, but you'd end up handling i/o directly and maintaining your own index of memory and disk accesses, page-aligned reads/writes, etc. It would be easier to just require your users buy more memory, but when there's a hack like this available, that seems preferable to implementing your own VMM and disk i/o subsystem.

speedgoose · on May 3, 2023

> It's also important to note that the above percentages…

Has this article being written using ChatGPT by any chance ?