For anti-entropy in a distributed system, one could use a combination of the fol...

bodyfour · on Jan 29, 2014

Specifically, you want a Merkle tree where the final bucket is based on the lowest N bits of the message arrival time, probably at a granularity of a several seconds. In the typical sync case all of the unseen messages will then be clustered in a few buckets and the Merkle tree will be very efficient.

For instance, if each bucket contains 8 seconds worth of arrival time, then 2^17 buckets mean they'll wrap around only every ~12 days. So if you are syncing after a day away there will be only 8% of the buckets potentially dirty (even if you get an incredible flood of mail!)

Having the bucket use based on time also allows you to avoid some of the roundtrip costs when doing frequent syncing. In a single request, the client can speculatively send its hashes of buckets that are likely to have changed since its last sync time along with its root hash. The server finds which buckets changed, sending those new bucket contents -- it can then see that the top level hash must now agree on both sides and you're done in one RT.

Of course it's possible that other buckets have changed in rare edge cases; in this case the top hashes will not match and you have to do the full Merkle descent to resync.

jorangreef · on Jan 29, 2014

That's an interesting idea to hash emails to Merkle buckets by arrival time. It may lead to uneven distribution of emails to buckets though, increasing the amount of bandwidth required to sync a bucket in the worst case.

There are other optimizations you can bring in when you sync the tree:

1. If you are working your way down the remote's tree and you notice that your local signature is zero for the equivalent remote's node signature, then you know that your entire subtree is empty, and you can short-circuit and start downloading entire buckets.

2. Conversely, if you are working your way down the remote's tree and you notice that your local signature is present but the equivalent remote's node signature is zero, then you know that your entire subtree has data, but the remote's entire subtree is empty, and you can short-circuit and start uploading entire buckets.

3. As you work your way down sections of the tree, you can start to build an idea of the average email to bucket ratio. If several buckets are only likely to contain at most one email, then you can short-circuit again.

bodyfour · on Jan 29, 2014

I think the LSB bits of arrival time is actually a really good distribution long-term (remember: we want the distribution to be uneven in the short term, but even in the long term) Maybe you could improve it a little bit by making each bin a prime number of seconds so automated messages sent at the same time every day can hit each bucket eventually. i.e. instead of 8 second granularity, use 7 or 11.

nmjenkins · on Jan 29, 2014

(Author of the spec here)

A fixed-size Merkle tree is a great idea if we wanted to sync the entire set of messages between distributed MX destinations. In fact, we should probably consider that approach for our server <-> server replication. However, JMAP is aimed at client <-> server sync. There are a number of different issues to consider here. The client may have limited or no cache (a webmail app for instance must normally start anew each time, and needs to efficiently fetch just what it needs; if we had to wait for a 60GB mailbox to sync every time we loaded our webmail, well, obviously that's not going to work. A mobile mail client often only wants to download the first few items in the mailbox given the current sort order so it can display them to the client. The JMAP spec allows a client to download, say, just the list of the first 50 messages, newest first, in the Inbox (single round trip). Then, at a later point, it can ask for and receive a space-efficient delta update of what has changed within those first 50 messages in the Inbox, regardless of how big the Inbox really is, or what other messages may have been delivered elsewhere (including lower down the Inbox, but outside the top 50). Again in a single round trip. This is really powerful: you can see that by comparing the refresh time of a folder in the FastMail mobile web app with the iOS or Android native mail app (over IMAP); the mobile web app is much quicker.

The other thing to remember is that the metadata about the message is mutable, and needs to be kept in sync as well. Again, the modseq system described in JMAP allows this to happen efficiently. If you have a distributed server system with multiple masters, you will again use a different system to keep those in sync (as they each assign new modseqs, whereas the client does not). However, this too can be dealt with relatively easily: if the modseq for a message differs between two masters, it must be set on both to max(the higher of the two modseqs, the next highest global modseq on the server with the currently lower modseq). This will ensure that both will end up in the same state, and that the client will get refresh its state for that message if it has to, no matter which master it last synced with.

So essentially, there's a difference in what you want in a protocol for server <-> server synchronisation for distributed mail backends compared to client <-> server sync for mail apps. JMAP is focussed on the latter, but tries to make sure it doesn't include assumptions that severely limit the former. The spec does not require a particular format of message ids: they do not have to be based around IMAP's ascending UIDs (although the sync algorithm for partial mailbox listings may be slightly less efficient if not).

bodyfour · on Jan 29, 2014

> if we had to wait for a 60GB mailbox to sync every time we loaded our webmail, well, obviously that's not going to work

First, the Merkle tree would just be of the IDs not the message contents. I.e. the problem it is trying to solve is for the heavy-weight client that wants to resync its idea of which message UIDs are contained in a mailbox. Which ones it then requests metadata (or full-contents) is up to it.

It's true that for a ephemeral client (like a webapp) you don't even want to get the whole list of 100K messages in your inbox (although you probably will ask the server for a count of them) However, here having a sequential message number hardly helps either. For example, when I click on "sort by date" in my mail client it shows me the messages by sending date, not by what integer IMAP UID they were given.

The problem of "give me the metadata for the top 100 messages sorted by criteria X" is just something that has to be offered by the server. This is no different than features like searching -- if bandwidth were infinite the client could do it, but it's not so you need to do the work near where the data is stored.

The server will have to maintain data structures to make these kinds of queries efficient. The important point however, is that these are conceptually just caches of information that is in the mailstore. Therefore there is no need to keep that synced between far-flung servers.

> The other thing to remember is that the metadata about the message is mutable, and needs to be kept in sync as well

Yes, message data and metadata are separate problems. If two servers both try to update the metadata for message X at the same instant one needs to win (which one is probably not that important) If two servers both try to add new messages to a folder at the same instant they both need to win, ideally without having to coordinate at all in advance.

For the metadata you can use a modseq, a Lamport timestamp, or probably even just a {wallclock time,server ID} tuple. Assuming good time synchronization is reasonable for a mail cluster. If two clients both try to change message flags within 10ms of each other it probably isn't important who wins as long as somebody definitively does.

brongondwana · on Jan 31, 2014

I've been waiting a bit to reply to this to see if there's a polite way to answer, but there isn't.

You suffer from a disease, and it's far too bloody common amongst people who design IETF standards.

You're optimising a rare case, where in the worst case you would be re-iding the bulk of the messages delivered in a time period of a few hours, leading clients to have to re-fetch the metadata again (note: it's only the modseq that would change in our design - the UUID is still globally unique and unchaged).

That's no worse than "FETCH 1:* FLAGS" now.

And in exchange for this slight benefit, you would cause EVERY SINGLE CLIENT EVERYWHERE to implement a significantly more complex protocol which requires 4 roundtrips to resync.

Talk about a dog with a very big, very waggy tail.

I'm sorry, but I can't take this suggestion seriously. I've seen too many protocols which are designed with this sort of focus on the worst cases. You have to make sure they don't totally suck - but the important place to optimise is the common case.

...

It turns out there is a way you can gain almost all of the benefits you're looking for - which is delayed MODSEQ allocation. Newly delivered messages can be created with no MODSEQ. When you sync with another either client or server, you allocate a MODSEQ higher than than any which the other end has seen.

Which means you only pay an extra cost if a JMAP peer or client connects at both ends during the link downtime.