M2dir: Treating mails as files without going crazy

ksherlock · 2024-05-23T14:46:38.000000Z

BeOS stored mail as individual files with extended attributes holding the subject, date, sender, etc. The email app (BeMail) was used to view/compose/send email but inbox management was handled by Tracker (BeOS version of Macintosh Finder or Windows Explorer). But the Tracker window was configured to display the extended attributes instead of the file names. The actual filename wasn't even displayed. Nobody went crazy.

Example: https://birdhouse.org/beos/refugee/bemail.jpg via https://birdhouse.org/beos/refugee/ (which has some other images of Tracker organization with extended attributes)

jauntywundrkind · 2024-05-23T15:06:42.000000Z

Pushing concerns up to the OS level is that one thing we could be doing in so many places, but havent really tried in decades. Should we use a universal format/protocol agnostic way of having data attached to files? Naaaahhhh. /s

ryandrake · 2024-05-23T22:20:54.000000Z

I'm generally a fan of taking advantage of the filesystem, especially when your application is just... storing and viewing files. It irrationally upsets me when an application grafts its own "Library" on top of my perfectly working filesystem, requiring me to import my files into an artificial thing that is just like a filesystem.

On the other hand, extended attributes and other filesystem-specific features could be problematic if you want to share files with other operating systems. If I copy a file to a FAT32 formatted SDCard, I need to worry about what might not copy over.

gryn · 2024-05-23T23:30:57.000000Z

the problem as you said is that the common denominator between all filesystem that are currently in use is to practically forget that metadata exist because you're only one filesystem transfer away from it disappearing (copy from one hdd to another, sent to cloud, transfer with a usb, etc ...)

the other problem is that the filesystem/ desktop environment closely entangles 2 concepts that imho should've been more orthogonal. - data storage+indexing for apps (a common unified KV store would've been more general abstraction that can be used to build upon other kind of abstractions, would be nice if yo could define your own indexing instead of what's done with folders) - data/information access for users in the "desktop environment"

TeMPOraL · 2024-05-24T15:32:27.000000Z

Isn't Android an example of disentangling those concepts? Apps have their own folders, isolated from each other, to store data in, but then they present a completely different view of that data to the user. The user is discouraged and often prevented from working with files and the filesytem, instead, they're made to work with opaque apps.

I really, really hate this design.

gryn · 2024-05-24T16:30:44.000000Z

yes you're right. but that one instance of doing so, but not the only way. I dislike it too since they put an emphasis solely on the app and not on the data.

what I had in my was just idle thoughts about providing disentangled primitives that could be used to build other things on.

for example the primary key for accessing a file/object could be (computer_id?, storage_id or partition_id , object_id/inode) + ways to define different kind of indexes based on you use cases.

instead of just making apps into silos you can have the things builts on top of these primitives be typed structured data objects + API/interface. have an Object explorer and programs can declare they they are able to display or manipulate custom data Type X. you can then have GUI be composable the same way the pipe operator work in the cli.

you can define a regular filesystem on top of these primitives, a relational database, a tag system, or something new all together. if you don't want folders you would ca to deal with them.

the work on fushia OS seem to explore something along these lines (BlobFs + MinFs + Components). (https://fuchsia.dev/fuchsia-src/concepts/components/v2/intro...) Pharo/SmallTalk seem to also explore the ideas akin to this. (https://pharo.org/)

to be fair the current state of affairs is similar enough with file extensions + mime info if you squint hard enough and pretend that app and systems folders files don't exist but it's held with pinky promises.

rchard2scout · 2024-05-24T07:29:36.000000Z

>common unified KV store

So, something like the Windows Registry?

hobs · 2024-05-24T11:38:09.000000Z

Well, I was going to say "limited key size" but it looks like in current windows versions the size is "available memory" so... yeah.

rakoo · 2024-05-24T09:34:23.000000Z

The windows registry is more a unified configuration file for the whole system, I think what GP talked about is more about a general store for data

bobthecowboy · 2024-05-23T18:01:08.000000Z

My gut reaction to this was "isn't that just sqlite"?

I don't think this is what you were thinking of, but I do kind of love the idea of formalizing sqlite file formats where the "metadata" is standardized and the "file" is stored inside. Like a file format for a recipe, or a picture, or ...

brirec · 2024-05-23T18:05:22.000000Z

Isn’t that just a container format, like what video and audio files have used for decades?

I don’t know of any existing container formats with support for a relational DB as one of the embedded streams, but the whole point of container formats is that you can add arbitrary metadata, which of course can be a whole database.

Of course, the way BeOS does what OP is talking about is by having many DB columns within the filesystem itself! (The filesystem is a queryable database).

bobthecowboy · 2024-05-23T18:44:24.000000Z

Yes, I totally get the distinction (and I was among those amazed by BeOS back in the day - I still show the old demo videos to friends who haven't seen it). I hadn't considered the container formats used by media, but in my head it would be the other way around - each file would be a sqlite file first so that they all share some commonality around access and inspection (I'm assuming in my ignorance that the media container formats are different).

Are there any database filesystems today? I haven't really looked, but the last one I heard of was the one that MS abandoned years ago. Actually I suppose Haiku probably still has one? I can't imagine how difficult it would be to get a DB Filesystem as a mainstream choice on Linux, let alone across OSen.

pjerem · 2024-05-24T07:21:53.000000Z

If you want something more tangible than old demos, try HaikuOS. It works wonderfully from a usb drive.

I’m too young to have known BeOS (well I was a kid in the nineties so not too young but afaik, BeOS was pretty rare (overall and) at home. However I’m old enough to have known OSes that were build around offline usage and that’s what I loved trying Haiku is that it remembers me when your OS was made to use your computer, not to be an internet client.

I feel that having your emails as files is a good example of that : you connect to the internet to get your mails. You disconnect. You want to work to those mails on another computer ? No problem, just copy paste them on a USB d… I mean floppy disk, answer your mails put the answers on your floppy disk and send them tonight when you’re back home.

It may feel pretty cumbersome when we have today’s tools but that’s the feeling I feel I lost : owning my data not only legally but physically. And not only physically but physically in a useful way.

It remembers me the time when you just had to understand simple abstractions like files and folders and windows to own the computer (and you were just learning some programming language away to master it).

p_l · 2024-05-24T09:34:47.000000Z

Every filesystem is by definition a database system.

Out of extant systems, the closest to BeFS outside of Haiku is NTFS as implemented in Windows. In fact, you can run pretty much all of the BeOS behaviors on NT since ~1994 or so, it's an issue of programs not using it. Part of that is allegiance of user applications to Classic Windows-compatible APIs.[1] Part of the "WinFS" efforts was to break with the old approaches totally and push more indexed/searchable APIs etc. but in the end all we have is pretty robust internal search engine that is sadly underused (just like the extended attributes support). It really doesn't help that Explorer.exe is in many ways ridiculously outdated, with Windows95/98 peeking out from various corners when you look deeper into how it acts.[2]

Then ZFS but the ZPL/DMU APIs do not include indexing layer IIRC (also on systems that use Irix-style xattr APIs you lose full scope of resource forks).[3]

Both OS/2 (with HPFS) and OSX do some work with integrating metadata in filesystem, with various level of usage and end-user accessibility.

And of course there's some level of integration in AmigaOS Workbench and .info files, but that's arguably the most niche by now and never evolved to this level of use.

[1] Know the regular posts about how you can't create a file named "CON:" or "COM1:" etc in Windows? In Windows NT you actually can, but a) the only way to do it "safely" is to use alternate NTFS namespace b) I bet most people have never heard there was more than one namespace c) Win32 applications will usually only see Windows95 LFN-compatible one (in two versions, UCS and ASCII) unless they get out of their way to get access to other namespaces

[2] It's not the most egregious though - at least explorer.exe internally uses paths that work with default APIs of the system. In 2021 I ended up having to dig out an AppleScript for converting MacOS Classic paths to POSIX ones, because it turns out Finder AppleEvents API returned only Classic paths. Or at least neither I, or anyone I could find, knew how to get Finder to return a path that wasn't Classic HFS one

[3] Irix-style xattr API is limited in capabilities to only add short K/V data to a file. Solaris instead effectively gives you a complete directory attached to a file, while WindowsNT on NTFS treats everything including main content of file as "extended attribute" and opening file as normal is essentially "open the $DATA attribute of the file".

bgro · 2024-05-24T14:49:35.000000Z

I prefer the chaos of devs randomly storing data in appdata, programdata, the program files dir, the x86 program files dir with mostly but not entirely duplicate data, c:/games/game/game and ../, c:/game/game, ~/games, ~/game/game &./game, ~/documents/game, ~/documents/games/game, ~/game saves/games, etc...

TeMPOraL · 2024-05-24T15:25:31.000000Z

Special place in hell for games storing heavy or frequently modified files in user's Documents dir - nowadays, Documents is often a synced folder backed by OneDrive. The amount of wasted processing, bandwidth and IO wear generated by this is tremendous.

jkrejcha · 2024-05-24T17:14:54.000000Z

The one that I've always found odd is everything deciding to dump itself in the user profile directory (this is even something that stuff like VS Code does).

XDG_CONFIG_HOME (|| ~/.config) and friends has been a standard for a long time now on *nix (including macOS) and AppData (née Application Data) has been the standard on Windows for over 20 years at this point.

lmz · 2024-05-24T15:59:47.000000Z

Other than saves, what files do games heavily modify? And if you're complaining about cloud syncing of (auto)saves, I personally think it's a good thing.

TeMPOraL · 2024-05-24T19:11:24.000000Z

One example I experienced recently: Sims 4 uses a subfolder in Documents as a cache for downloaded data and decoded chunks. It creates and deletes files there constantly while the game is running; we're talking dozens of files per minute or more. Few minutes of play, and there's nothing but hundreds of new additions and deletions in the "recent history". Not to mention, all that auto-syncs with any other machine you have online and using the same Microsoft account.

Wrt. Saves, auto-uploading those can be good, but it's unnecessary for games I rent on Steam, which already handles cloud saves on its own.

zbentley · 2024-05-24T14:03:35.000000Z

Files as an interchange format, sure. But as a primary storage system for application structured data they leave a lot to be desired:

- Portability of metadata is lacking (and can be sneakily removed when you least expect it), as other commenters have pointed out.

- Filtering sets of files from out of a (possibly deep) directory hierarchy based on different criteria requires writing a lot of subtly different loops to check metadata. Querying e.g. SQLite handles that part for you once you express what you want, without as much risk of messing up one of those loops.

- Similarly, a schemaful database can prevent your writing incorrectly-shaped (meta)data up front, where filesystems are flexible enough that bad writes may not be noticed until your program tries to read that data back out.

- The accessibility of file-based internal storage systems to human users can sometimes be too high, a la the joke about someone "organizing and renaming things in the win32 folder". Cracking open and messing about with a flat-file all-in-one DB is a higher barrier to screwing around. To be fair, permissions mitigate this risk substantially.

- Intermediate failures with some single-flat-file DBs are much less impactful than with many filesystems. Two parts to this: one is that a more rigid structure in a DB prevents certain invalid writes entirely; the other is transactionality. While plenty of local-flat-file "myapp.library" DBs don't have a good atomicity story underneath (I'm always saddened when I poke at a proprietary data library format and find that it contains a bug-ridden, informally-specified implementation of half of SQLite), and while some file systems make logical atomicity possible to achieve (e.g. via CoW copying data/directories, doing mutations, atomically swapping a source-of-truth link to the new version, and dropping the old), filesystem-as-database systems tend to fail-corrupted often due to unexpected issues (from bugs to "oops, don't have write access on 1/1000 files" to SIGKILL/power loss/drive failure) during data modifications. While I wish more file-based systems were as robust as maildir, I won't hold my breath when SQLite is right there.

pjc50 · 2024-05-24T08:10:52.000000Z

> universal format/protocol agnostic

That's not a "should we", that's a "we can't". Too bit a civilization-level project.

tracker1 · 2024-05-23T15:11:37.000000Z

It's a somewhat interesting idea... I've had similar ideas in the past regarding maildir replacement without resorting to a db file. I like the idea of having directories representing email dir/folders, you generally will want some level of aggregation and/or search... I've thought that having separate eml (header + body) along with a .meta.json file for additional tagging/details (deleted flag, tags, etc).

Search is a very different story, you wouldn't want to have to do a full directory scan for text based search. So some level of indexing would be useful for a client mail service.

Similarly, I've thought it would be really cool if Cloudflare offered a TCP worker option, you could to a simple mail service backed by R2. The web ui/ux could be pretty awesome and geo distributed.

rakoo · 2024-05-23T16:51:45.000000Z

> Search is a very different story, you wouldn't want to have to do a full directory scan for text based search. So some level of indexing would be useful for a client mail service.

While notmuch and mu exist, I myself use the mblaze suite (https://github.com/leahneukirchen/mblaze) and it's more than enough for me. As a totally unscientific benchmark, it takes 300 ms to find 7 mails out of 24k when searching in headers, 4 seconds when searching in the body.

I myself use a different way: I convert the entire (all 24k of them) list of emails to 1-lines with Sender, Subject, Date, Folder and feed it to fzf which gives me preview as well. The search is then instant; on the given fields only, but I never need more than that. This is my full MUA: https://sr.ht/~rakoo/omail/

arp242 · 2024-05-23T15:43:55.000000Z

> Search is a very different story, you wouldn't want to have to do a full directory scan for text based search. So some level of indexing would be useful for a client mail service.

I don't know; my ~/code directory has tons of stuff and searching with ripgrep doesn't seem too slow:

  % time rg HelloWorld | wc -l
  4
  rg HelloWorld > /dev/null  0.13s user 0.12s system 99% cpu 0.251 total

  % time rg string | wc -l
  57813
  rg string > /dev/null  0.20s user 0.14s system 99% cpu 0.339 total

Rough estimate of files that rg will search:

  % scc
  ───────────────────────────────────────────────────────────────────────────────
  Language                     Files       Lines     Blanks    Comments      Code
  …
  Total                        11024     1864982     175565      208777   1480640
  ───────────────────────────────────────────────────────────────────────────────

Finding close to 60k matches in 11k files/1.7M lines in about 0.3 seconds isn't too bad.

It should be said I ran a few commands on that directory before the above results, so there's probably some filesystem caching going on, but I can't be bothered to reboot.

For many (not all, obviously) cases I think you may be able to get away without a index. Most people aren't subscribed to tons of email lists and get maybe a few emails a day at the most.

I'd consider anything below ~3 seconds to be fine for search, so this scales to about 100k files/emails. At 10 emails/day on average that's about a decade. Most people do not get 10 emails/day on average.

And you can even do some "poor man indexing" by just making a new directory every five or ten years. Most of the time you want just emails from the last year or so.

Arelius · 2024-05-23T16:25:43.000000Z

> Most people do not get 10 emails/day on average.

I'd like to see the stats, but I seem to average around > 40 emails a day, (most are unactionable) but always considered my email load quite light. For people like my wife who do much of their work communication over email, it appears to be much higher.

rakoo · 2024-05-24T09:45:54.000000Z

Ran some stats on my mails of the last 4 years, here are the daily characteristics:

N = 24k Min = 1 Max = 211 Median = 11 Avg = 16.023907 Stddev = 18.312062

A lot of them are actually chat messages through DeltaChat so not representative of usual mail activity. When I remove them I get this:

N = 16k Min = 1 Max = 56 Median = 10 Avg = 11.378933 Stddev = 8.1572529

Arelius · 2024-05-28T00:39:33.000000Z

Sorry, I had meant by "stats" anything about the average among users, since I suspect that you might actually be an outlier on the lower-end among people who work professionally in or with technology.

tracker1 · 2024-05-23T18:47:44.000000Z

I'm also considering a Server/Service that has a web ui component, where it's shared server resources... yeah, running a search on a local ssd/nvme is crazy fast... now do it when there are 100k other users on that filesystem.

aidenn0 · 2024-05-24T16:53:38.000000Z

I get about 32 per day:

  notmuch count date:-100d..-1d
  3248

And I have more than 100k emails:

  notmuch count
  267584

geek_at · 2024-05-23T18:01:57.000000Z

not 100% related but I have build OpenTrashmail [1] which gives you the emails in 3 variants. As folders on disk (no DB used), as RSS feed or as JSON feed. Which satisfied my needs for local management of emails

[1] https://github.com/HaschekSolutions/opentrashmail

zimpenfish · 2024-05-24T06:44:05.000000Z

> separate

Then you start running into race conditions when updating / reading. The great thing about Maildir is that the updates can be done atomically - saves you a lot of locking and complexity.

mxuribe · 2024-05-23T17:27:55.000000Z

@tracker1 If i'm not mistaken i think thunderbird and other email clients who support conventional maildir often include a local db (such as sqlite) whose purpose tends to be mostly for helping indexing content to ease some aspects of search. That being said, as others have noted, search mostly tends to be fast enough at the filesystem level. ;-)

tracker1 · 2024-05-24T15:54:16.000000Z

Not in an instance where the file system is remote, and used by thousands of users. Such as via S3/R2 or similar.

mxuribe · 2024-05-24T16:53:55.000000Z

Good point! I was only thinking from the client side/perspective, and not server side. :-)

crtified · 2024-05-23T23:11:04.000000Z

Vaguely related anecdote with no punchline.

Over a decade ago I developed, for our small (small enough to not have any IT dept or IT management) office a bespoke extension for Outlook (yes, bad idea, I know) which translated all incoming emails and attachments into the standard file system, decanted into project folders.

It was triggered upon any opening of an Unread email, and required the user to pick a project from a list, and hit OK. Cancelling was an option (for personal emails).

There was a config tab for the admin to define the filename string, by arranging elements like date/time/to/from/subject/.., and any attachments were also placed as files.

A very imperfect approach, but under the circumstances it was a vast improvement over the prior mess of individual mailboxes bestrewn with all manner of project correspondence and files, which made intricate queries about past doings into frustrating spaghettified detanglements.

And ultimately - perhaps like a good deal of IT - at heart was uninformed management, and the reality of ordinary users with little notion of information management.

inopinatus · 2024-05-24T11:10:25.000000Z

The problem with any mailbox storage standard that is explicitly labelled “do not use this for delivery”, in this case because the author explicitly rejects (and in 2023 openly mocked) the concurrency and crash-resilience demands that Maildir seeks to offer, is that someone will inevitably use it for delivery.

Aloisius · 2024-05-23T18:22:37.000000Z

For MacOS, extracting attachments into files is useful so that Spotlight can index them for search. I believe the same is true for Windows.

Mail.app, uses a directory structure that looks similar* to this for say, gmail:

    {account-uuid}/[Gmail].mbox/All Mail.mbox/{mailbox-guid}/Data/Messages/{msguid}.partial.emlx

    {account-uuid}/[Gmail].mbox/All Mail.mbox/{mailbox-guid}/Data/Attachments/{msguid}/{mime part #}/{mime subpart #}/filename.ext

The emlx format is a bit different from eml. It contains the number of bytes for the message at the top and an xml plist at the end that has message flags, last viewed time, gmail labels, etc. For partial.emlx files, the base64 content is removed from the email itself and a content length is added.

This format has its drawbacks, of course.

* Not shown is the hierarchy based on message uid used to keep the number of files in the Messages directory down.

graycat · 2024-05-23T16:28:13.000000Z

Been thinking about this subject:

Of course, standard (usual, common) email is just text. Right for the pictures, to have them just as text, they are encoded as base64. Right, its MIME (MultiMedia Internet Mail Extensions).

Soooo, okay, my ISP (Internet Service Provider) has an email service. The service is a Web site, and it does offer getting the "Source", that is, the text, all as just one file.

Now, suppose for each email message I send/receive, I keep the text in its own file, with just the text, just as I got it from, say, my ISP. I will handle the file naming, indexing, summarization, etc.

Help!!!! Is there an email program that I can run that, for each of those files, can read it and display it? Sure, it should be able to display the text, as text, that is not one of the MIME extensions but also be able to do the right thing for each of the rest, still images, video clips, audio, whatever. Know of such a program???? Thanks!

Gormo · 2024-05-24T21:08:40.000000Z

Pretty much any standard mail client should be able to do what you are describing. Thunderbird can open raw email files and display them as though they were received via POP3 or IMAP.

graycat · 2024-05-27T03:30:01.000000Z

Thanks. I have a copy of Thunderbird. Maybe tried it, but should try it again. Of course, if the email has the HTML, JS, etc. of a complicated Web page, Thunderbird would have to have full Web browser functionality or pass the HTML part to a Web browser.

Maybe I could just write some simple code to read the actual email, find the Web page part, and send it to Firefox.

makeitdouble · 2024-05-24T01:48:47.000000Z

There seems to be nothing about performance and how to deal with file count within a directory.

Anyone who tried to naively store millions of files in a file system folder realizes at some point that listing files becomes horrible, there's no GUI tool that will handle that gracefully, and even on the CLI this is a very serious roadblock.

It's still fine for accessing files straight by name, and there must be ways to read each file sequentially, but the concept of folder merely becomes an arbitrary namespace and not something to handle a whole group.

The other obvious option is to shard the mail folders to ensure there's no more than X files in each folders, but that becomes pretty complex IMHO.

At the end of the day, a database is needed somewhere down the line.

tomatocracy · 2024-05-25T03:40:34.000000Z

I've been pretty happy with Dovecot's mdbox format. This stores multiple emails per file and multiple files per mailbox which alleviates the problem of ending up with millions of files in one directory on filesystems which aren't tuned for this. Metadata is stored (only) in a separate index file which at first feels a bit "brittle", but this also alleviates a lot of the downsides of "indexed Maildir" type solutions (no more need to check for consistency between index and filesystem etc). Breaking with the idea that the entire email is stored in a single file (whether with other emails as with mbox or on its own as with Maildir etc) then also lets you do things like deduplicating attachments between emails (this was my primary motivation for exploring it).

The main downside for me is that you can only really access the mailbox using dovecot/IMAP as tools like mutt don't support it.

aidenn0 · 2024-05-24T17:03:33.000000Z

> The other obvious option is to shard the mail folders to ensure there's no more than X files in each folders, but that becomes pretty complex IMHO.

I mean it depends on the volume of e-mail you get, but:

  INBOX
  Archive/2004
  Archive/2005
  ...
  Archive/2023

Isn't something I would call complex, and works out to about 10k files per directory for me.

foresto · 2024-05-24T05:23:15.000000Z

What always nagged at me about the one-file-per-message approach is what happens when you accumulate many messages, perhaps by being on high-volume mailing lists, or never throwing anything away, or both. In particular:

How much space is wasted due to partially filled filesystem blocks? This is less important with today's workstation drives than it was 30 years ago, but perhaps still relevant on a single-board computer with limited flash storage, for example.

How does performance suffer from scanning a directory with millions of files, or if they're spread across multiple directories, from traversing the directories? Even if the delivery and user agents handle it well, what about the command line tools that would make one-file-per-message appealing? What if it's a network filesystem?

Filesystems can be chosen and tuned for their expected contents, of course, as usenet admins once did for news spools. But most users won't maintain a special filesystem just for email; they will expect it to work well on the same fs that they use for everything else.

With those considerations in mind, I can understand the appeal of multiple messages per file, whether it's a database or just plain old mbox format with a nearby index.

Neither approach seems strictly better than the other.

zimpenfish · 2024-05-24T06:41:53.000000Z

> as usenet admins once did for news spools

Message-per-file was abandoned pretty quickly once the volume started to go up in favour of things like Diablo's "huge file that's a circular buffer" approach. Then the tuning was more about, IIRC, how big you could make inodes to efficiently handle huge (100s of GB) files (not really a problem for mail messages!)

(although I have to say I am 20 years out of usenet admin and maybe things have swung back towards the INN style - it does make long retention easier and modern filesystems are probably much better. Back then we were experimenting with everything from JFS to XFS to FreeBSD to ...)

aidenn0 · 2024-05-24T15:58:53.000000Z

> How much space is wasted due to partially filled filesystem blocks? This is less important with today's workstation drives than it was 30 years ago, but perhaps still relevant on a single-board computer with limited flash storage, for example.

I store my e-mail in a maildir on ZFS with compression enabled. I have not tuned it in any way. My archive directory is 30% smaller than the total number of bytes according to "du" (i.e. compression outweighs the space overhead)

For a more "normal person" comparison, I made a fresh ext4 file-system and copied it over and used "df" to get the exact number of blocks in use; overhead was about 2%. Seems fine to me.

[edit] Median file size is 5744 bytes, 1/10/90/99 percentile sizes in bytes are 1777/3050/41059/169654

aidenn0 · 2024-05-24T17:00:55.000000Z

> How does performance suffer from scanning a directory with millions of files, or if they're spread across multiple directories, from traversing the directories? Even if the delivery and user agents handle it well, what about the command line tools that would make one-file-per-message appealing? What if it's a network filesystem?

General purpose file systems can manage several thousand per directory; splitting up into directories is probably a good thing (I archive mine per year). Walking the extra directories adds negligible overhead, since all but the leaf directories will have a very small number of entries. To go from one million to ten thousand files per directory you only have to add a single level of 100 directories.

Gys · 2024-05-23T15:46:48.000000Z

The mail protocol is plain text so it’s not difficult to save emails as individual files. I had such setup some years ago for a company. Emails were stored in one folder per week, each email in its own subfolder with attachments extracted and a meta text file. References were in a database.

I also remember working with a windows email server that saved all emails only as files, no db, although the directory structure was more complicated. But that was maybe 20 years ago…

AdieuToLogic · 2024-05-23T21:40:24.000000Z

Whenever I see efforts to treat email as files, I fondly think of my time using nmh[0]. Until the pervasive use of multimedia email, nmh was a really nice way to communicate with email IMHO.

0 - https://www.nongnu.org/nmh/

follower · 2024-05-24T05:11:12.000000Z

Note that it seems some details of the spec have changed since the blog post was written, so the linked blog post is slightly outdated/lagging behind the actual spec document here: https://man.sr.ht/~bitfehler/m2dir/

(Which I discovered when I ran into this project yesterday.)

via conversation on the mailing it seems there are currently two WIP Rust crates/libraries being developed to implement the spec--one by the primary spec writer & another by an "email-interested" :) 3rd party, developed independently (AIUI) in part as an exploration of library API design space.

jll29 · 2024-05-23T15:40:16.000000Z

One wonders why email isn't kept in a well-thought out directory structure since the beginnings of UNIX, given that almost anything is a file in UNIX, and especially given the power of UNIX text processing tools.

technofiend · 2024-05-23T15:44:52.000000Z

If you'd like to try it, MH uses directories and files for managing your email: https://www.gnu.org/software/emacs/manual/html_node/mh-e/ind... As you mentioned this is the unix way and MH (according to Wikipedia) dates back to 1979: https://en.m.wikipedia.org/wiki/MH_Message_Handling_System

MassPikeMike · 2024-05-23T20:23:49.000000Z

The parent's pointer is to "MH-E", the emacs package, which is a great interface to MH for folks who use emacs to read their email.

For folks who don't, I wanted to clarify that MH also works great outside of emacs. Its command-line tools are composable, so you can do things like reply to the first message about chess sent this week:

repl `pick -subject chess -after "19 May 24 0000 PST"`

Using them in scripts is especially powerful.

The modern implementation is "nmh", "New Message Handler", https://www.nongnu.org/nmh/. MH was the mail system within MIT's Athena computing environment back in the day, so many MIT folks developed a fondness for it and it retains a following. There's even a very comprehensive O'Reilly book, free online: https://rand-mh.sourceforge.io/book/

PurpleRamen · 2024-05-23T16:10:32.000000Z

There were multiple formats for storing mails through the times. And many are using folders. But each format has their own problems, and were optimized for certain benefits. And on unix you have to make this workable with multiple programs accessing them in parallel, because in the early days there were no servers who had tight control over everything. So, formats were often designed around using or preventing file locks, making efficient use of storage or allowing fast handling and management of mail-flags.

mbreese · 2024-05-23T21:31:51.000000Z

Not even just common formats, but way back, Mail was delivered by copying files from one server to another. I (barely) remember using UUCP before SMTP/NNTP to sync Mail and news. So, the format that you stored messages in was very important. It’s easy to copy a single message when it is a complete file.

p_l · 2024-05-24T09:44:26.000000Z

SMTP followed on from MTP, which in turn attempted to replace previous schemes involving FTP or similar "manipulate mailbox remotely" approaches.

Along the lines, bridging UUCP and other text-only email systems was also taken into account.

ck45 · 2024-05-23T20:23:29.000000Z

For a quite long time, a very popular format was mbox (the most popular?), which is a single file. With the arrival of qmail, it was slowly replaced by Maildir.

childintime · 2024-05-24T07:47:11.000000Z

Wow, a spec with a filename with colons in it. Going crazy.

How about something like this instead:

20230904T1347-sender@example.com-GTfrlwJfN5vyR28R

Fewer underscore also mean better readability to me.

stephen_cagle · 2024-05-24T08:04:09.000000Z

Are there other good alternatives for treating email as simple files that I can contrast this with? I'm quite surprised that this does not already exist? Are there hybrid approaches like FUSE filesystem to your email?

WillAdams · 2024-05-24T14:49:10.000000Z

I've been somewhat surprised that there hasn't been an effort to re-work e-mail as a content management system --- incoming e-mails have all attachments stripped off and stored in a hierarchy based on sender/subject/recipient/date (and dupes discarded and replaced w/ a pointer) and replaced w/ links w/ the matching e-mail text stored as an editable wiki or similar marked up text, outgoing e-mails are synched up w/ the appropriate attachment and the wiki/marked up text updated based on the content.

TeMPOraL · 2024-05-24T15:20:46.000000Z

This sounds a lot like how messages are handled in companies that run well-integrated Microsoft systems. I've only seen this from user perspective, but attachments I send or receive via e-mail (Outlook) or Teams (IM) can be browser per-conversation, and/or end up on SharePoint, can be updated or accessed from other tools, etc. Best I can tell, this is a kind of content management system, centered around SharePoint, but I don't really understand how it works.

inopinatus · 2024-05-24T23:26:12.000000Z

I've seen something like this in CRM and in closed-circuit corporate/institutional groupware. But when discarding RFC-standard text as the canonical form then we also discard (or must re-implement) every functional, security, and deliverability measure that assumes as much in the design when addressing global email.

The simplest illustration I'd give of this is through a question: "how you gonna preserve my PGP signatures?" and if the answer is "ah, you are not our target segment" then you get how it's not a global solution.

KRAKRISMOTT · 2024-05-24T14:52:58.000000Z

Maybe you should apply for YC, Front app is a multi billion dollar company, they could use some disruption.

WillAdams · 2024-05-24T15:29:44.000000Z

Sounds like I'm just behind the times and Microsoft has some sort of solution using Sharepoint.

8organicbits · 2024-05-24T17:59:44.000000Z

SharePoint is not a reason to stop pursuing a startup idea.

vidarh · 2024-05-24T10:22:42.000000Z

We did exactly this for Nameplanet in'99. Started with plain Maildir, the added more info in the filename as new mails were found, or status changed.

We finally added a cache of some data in a dot-file (that'd just get blown away and recalculated if it failed a format check).

It made a very slightly enhanced POP3 server sufficient for a web frontend with good performance.

But all the changes to the Maildir was optional - any software that didn't support it could still operate on them and the missing bits would just get recreated.

QasimK · 2024-05-23T19:57:36.000000Z

I've been thinking about doing this myself, so it's fantastic to see a project.

I find a files-centric (and more broadly filesystem-centric) approach easier to grapple with than one that focuses on apps (and hiding away the data). It makes it much easier to access my own data for other purposes outside of what the app provides. In particular when the files are in plain-text or otherwise human-editable. I can reuse all of the existing tool that I'm familiar with to search, modify or re-purpose the data.

skydhash · 2024-05-23T20:49:16.000000Z

I can do away with files if the app provides scripting capabilities (IPC, plugins,...). I know the average users won't use it, but if you've nailed down your workflow, it's liberating to be able to speed up parts of it.

zokier · 2024-05-23T16:04:24.000000Z

Is there a reason why metadata and the message are stored so separately? I.e. why

   INBOX/2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R
   INBOX/.meta/GTfrlwJfN5vyR28R.flags

instead of

   INBOX/2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R/message
   INBOX/2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R/.flags

The latter structure would allow creating/deleting the message and flags atomically.

mathfailure · 2024-05-23T16:24:02.000000Z

That'd require jumping between dirs when traversing multiple messages.

adius · 2024-05-24T06:50:09.000000Z

Can somebody please define a specification on how to store emails in SQLite? Seems to be the only sensible approach if you ask me.

_xnmw · 2024-05-24T09:41:15.000000Z

For searching I just use DevonThink. Works with either mbox or a directory/file structure. Instant full-text search, date-based filtering, and continuous re-indexing as I archive my email there monthly with command line tools for GMail and ProtonMail's import/export tool.

robertlagrant · 2024-05-23T20:54:14.000000Z

I was hoping the mailing list link would be to an FTP site I'd upload my email to.

colinsane · 2024-05-23T21:14:56.000000Z

my chief concern with the spec was actually "do FTP clients generally support `:` in a filename?"

but then i realized i'm not likely to mount a remote M2dir so i'm far less concerned with the answer.

follower · 2024-05-24T05:03:38.000000Z

> so i'm far less concerned with the answer.

In that case, let me definitely take the time to provide you an answer then... :D

The linked blog post outlining the spec is slightly outdated/lagging behind the actual spec document here: https://man.sr.ht/~bitfehler/m2dir/

The updated detail relevant to your query being: the `:` separator was replaced by use of `,` (comma) as the separator instead: https://man.sr.ht/~bitfehler/m2dir/#parsing-filenames

More detail on the reasons for the change is given in the associated commit message [informative commit messages? what a novel concept! :D ]: https://git.sr.ht/~bitfehler/m2dir/commit/100b7683f53899c836... [TL;DR: bash & NTFS/ADS]

amelius · 2024-05-24T15:47:51.000000Z

The HTML in email is already incompatible with standard tools.

For example, finding all emails where I discussed the "DIV" tag with somebody:

grep --ignore-case --word-regexp DIV *.eml

Unfortunately, since most email is in HTML, this would match every email.

lofenfew · 2024-05-24T16:01:42.000000Z

  [^ <] *div

turns out you can parse html with regex

amelius · 2024-05-24T17:27:02.000000Z

You are making assumptions here.

MisterTea · 2024-05-24T14:54:48.000000Z

Plan 9 has a similar concept called upasfs http://man.postnix.pw/9front/4/upasfs

clircle · 2024-05-24T17:05:49.000000Z

What's the problem they solve ? I have never put a modicum of thought into how my messages are organized on my harddisk.

chriscappuccio · 2024-05-23T14:38:43.000000Z

As a CLI fan, I'm interested in where this could go

mathfailure · 2024-05-23T16:26:28.000000Z

It would go only where you'd bring it to.