Hacker News new | past | comments | ask | show | jobs | submit login
Gmvault: Backup and restore your Gmail account (github.com/gaubert)
503 points by uyoakaoma on Nov 18, 2016 | hide | past | favorite | 136 comments

I have been using this for years and it's great software. One tip, store each different email account in it's own "database" my crontab looks like this:

    #Email backup
    0 0,6,12,18 * * * /usr/local/bin/gmvault sync -d /XXXXXXX/email-backup/joshstrange --resume josh@joshstrange.com >> ~/mail-backup-logs/joshstrange.com.log 2>&1
    0 1,7,13,19 * * * /usr/local/bin/gmvault sync -d /XXXXXXX/email-backup/otheremail --resume other@otheremail.com >> ~/mail-backup-logs/otheremail.com.log 2>&1
I sync it all locally to my house then back it up to Dropbox as well. The reason to store them in different datebases is you cannot "filter" them out when restoring so if they all go to the same DB if you restore you are restoring ALL your email across all accounts to one new account.

Chiming in to second this comment for anyone who is skeptical of using gmvault. I too have used it for years with great success. Thanks to the author for creating it!

Thank you for your advice, that's handy to know.

What's the database format it uses, sqlite3? (I tried looking on the repo but couldn't any obvious reference)

I believe it is all flat-file. When I say "Database" I am just referencing what they call it:

    -d DB_DIR, --db-dir DB_DIR
But from what I see the structure looks like this:

           1234554543262346.eml.gz - I assume the meat and potatoes of the email along with attachments, not sure
           1234554543262346.meta - JSON file with msg_id, thread_id, flags, labels, subject, etc

if you want to try something that downloads Gmail via imap and indexes it into an sqlite3 db (with FTS5 fulltext of from/to/cc/bcc/subject/body fields) and extracts attachments to filesystem, take a look at a recent project of mine:


It also saves the raw .eml files to disk. No support for labels (yet), but it does properly link up threads in the db using `References` from the parsed headers (setting both MPTT and adjacency-list fields)

It's WIP, so contributions welcome :)

FWIW regarding speed, i was able to download, index & extract my entire INBOX + Sent Items (14k emails, 3.5GB total) in < 10min on a fast connection. the limiting factor by far was connection/imap speed.

I'm interested to know what the pros and cons are of this utility vs using the Google takeout functionality? I like the idea of this project but I don't know what it would gain me over Google's native export? Is it the restoration that's missing from Google's service?


My coworker and I both tried this yesterday, and today we both received an email with this error message: "Sorry, we encountered a problem when creating your Google data archive."

Took out my Google data a few times, now amounting 1.4GB never had this issue. Let's hope it's only a temporary slipup.

btw, the data is pretty exhaustive, even for non major Google applications or more recently added features (Maps Location History).

Me too. They must have had a surge of takeouts given the recent lock-outs.

Probably the most useful aspects are that you can run it in batch mode (without supervision), and that you can backup incrementally.

(Note that these aspects may also make the tool a little more dangerous to use.)

I'd like to know, too. On the surface, it would seem that scripting is the main benefit to it, but there might well be other pros/cons.

Very good point. This would make it easier for me to add to a service or cron job to pull periodically.

I just started a takeout session on all my google drive, emails data too.

I wonder how to use them once I have the backup data?

Exactly what you get from Takeout varies from service to service; for email you get an MBOX-format mailbox file (that you can then import into a desktop email client of your choice).

Really? mbox, not maildir? That must be insanely huge.

I pulled mine recently and it was about 3.5GB with ~115k items (chat messages are included). Also pulled my wife's - 217k items @6GB.

Right. So, maildir format then, not mbox.

It is actually mbox. I should have provided more detailed numbers - in my Takeout file, for example, there are 91,360 chat messages and only 23,407 email messages .

So there's 23407 email messages but only one file containing all of them?

Yep! All messages are in a single mbox file and it's 3.2GB.

  cdubz@professor-farnsworth ~/data $ du -h Mail-chris.mbox 
  3.2G	Mail-chris.mbox

Wow. I stand corrected and that's awful. Yet another reason to use gmvault!

Why it awful? As an archive, seems decent.

(1) inconsistent escaping rules (dealing with the literal string \nFrom)

(2) easy to corrupt

Worth noting that Google provides some Python sample code for parsing the file which works great.

Interesting - could you point out where to find this? I poked around a bit but didn't come up with anything.

After yesterday's post[1], I've setup a cron job to regularly backup Gmail/Google Drive.

I'm using Gmvault for Gmail (emails and chats), pretty effortless and with enough features for me, especially the export to maildir.

And rclone[2] for Google Drive, in "create a local mirror" mode (sync).

Any other tools to backup your Google life?

[1] - https://news.ycombinator.com/item?id=12972554 [2] - http://rclone.org/

I am actively trying to get off of dropbox. I really want a similar native application experience that syncs to-and-from S3. Not a cron job using the s3 cli, not something that only works on osx/windows/etc... So far owncloud enterprise is the only polished looking solution I've found, but thats a bit overkill...

You are looking for Syncany: https://www.syncany.org/

But really, you may not need S3 at all and just sync between your devices with Syncthing: https://syncthing.net/

Shameless plug: Syncthing is in official Debian repositories! https://tracker.debian.org/pkg/syncthing

"The core team of Syncany is on hiatus for an indefinite amount of time. Feel free to do with the code what the license allows and encourages, but please don't expect any maintenance"

Not feasible to use this for critical infrastructure if there is no maintenance happening.

How have you found syncthing perf? When I tried it a while back the perf was horrible. It ate CPU and was pretty slow, even though I was just testing with 2 machines on my local network.

This was improved recently with 14.0.7:

- https://github.com/syncthing/syncthing/releases/tag/v0.14.7

Also, It will improve further with File System Notifications (from what I understand?):

- https://github.com/syncthing/syncthing/pull/2807

It's still slow and use lots of CPU.

Thanks, Syncthing looks like exactly what I was looking for! I'm guessing it needs some external server for coordination? It seems a bit unnecessary to restart after each config chsnge, is there some technical reason for it? Nice job otherwise, will try it out.

Yes, it needs an exernal server.

By default, Syncthing is pre-configured with community-hosted discovery servers:

- https://github.com/syncthing/syncthing/blob/master/lib/confi...

The community also hosts relay servers, so if your two devices can't communicate with eachother directly, it will work anyway.

Relay servers take bandwidth. Anyone can run a relay server, and it will automatically join the relay pool and be available to Syncthing users. This is documented here:

- https://docs.syncthing.net/users/strelaysrv.html#strelaysrv

It would also be possible to host your own private relay and discovery pools, if you need that for some reason.

I use amazon cloud drive unlimited $60/yr and arq backup to have client side encrypted backups. I also arq backup to my local NAS.

If you want to use an open source tool you can use borg backup & rsync.net as your external backup site. Borg doesn't have good S3 integration, and using fuse & s3 doesn't work that well either. It works best when the borg daemon is on the reciever box too to help with indexing and such.


Interesting you mention this ...

There's a big item on one of my whiteboards: "put gmvault into the environment" ... the idea being that you could run 'gmvault', over SSH, on rsync.net:

ssh user@rsync.net gmvault ... blah blah ...

I've been meaning to do this forever ... it would be great if rsync.net customers could not install anything, but just run gmvault as an ssh command.

The only reason it takes time is that we do not have a python interpreter in our environment - we try to keep things as simple and locked down as possible - which means we have to "freeze" gmvault as a binary executable in order to put it into place ...

So ... folks want this ?

Please don't use OwnCloud. It eats your files and cost us loads of time and effort, in addition to sowing FUD among my office coworkers, who thought someone was deleting files from the shared/sync drive. Plus it doesn't support delta sync [0], so if (for example) you're syncing large files like (for example) True/VeraCrypt volumes, you're going to be pushing a lot of data around. This is especially awful since you're not doing this on a LAN but to S3, which means your raw cost in dollars for operating this software will be much much larger than with another solution like SyncThing or Seafile which does support delta sync.

[0] https://owncloud.org/faq/#partialsyncing

ownCloud, or nowadays NextCloud [1], is still the best solution I have, hosted on a private virtual server. It's not a drop-in replacement for sure -- I'm basically prepared to do a full clean reinstall each time I want to upgrade -- but they have desktop and Android clients, and it's working just fine. Of course, I also have a separate backup of all the files.

[1] https://nextcloud.com

I did an extensive lookup of dropbox alternatives last year, and ended up self-hosting seafile. Multiple family members use it and it works great, plus encryption built-in from day one (me not being able to read their stuff is critical to me). There has been no outage, and upgrades are easy.

You may want to have a look at odrive (https://www.odrive.com). Use it to sync a bunch of different storage accounts via this single app. Works alright for what it does, and they have an S3 option.

How about Tarsnap?

I've been using syncthing and I'm reasonably pleased with it. I don't use it heavily and only over my local network, though, so I'm not sure how well it handles archiving changes and file conflicts.

Won't forklift serve your purpose?

WARNING/fun fact: it doesn't download all emails properly. Last time I tried it, it seemed that when the Gmail server randomly closed a connection (or maybe some other time, but I think it was in these instances), the program would just keep whatever partial results it had and then move on to the next email. Which meant I had a lot of partial emails on my drive (only a small fraction of all the emails, but still), and no way to detect them.

I have been using both GYB (https://github.com/jay0lee/got-your-back) and mbsync (http://isync.sourceforge.net/mbsync.html) for this. How is gmvault better/different?

Why do you need to use both? Or are you saying that they're both equivalent and you'd had success with each?

I've been using gmvault for only one feature: ability to fetch e-mails that match a certain filter.

Reading over the details, gmvault supports two things I don't see in GYB and one thing mbsync doesn't support:

- XOauth

- Sync to another gmail account (incrementally)

Edit: Also supports only backing up emails with a specific gmail tag.

Given how easy and carelessly Google can close your account and ruin your digital life, I guess periodic backups of your cloud accounts will soon be considered a good practice.

> I guess periodic backups of your cloud accounts will soon be considered a good practice.

It has been good practice for a couple of decades. Of course, getting people to follow good practices is another matter....

In this case, it's not as good as a normal backup. To my knowledge, Google Docs/Sheets/etc. is exported in other file formats (MS Word format, PDF format, etc), not their internal file format. So it's not a real backup, just an export.

Good point. And we have no way of knowing what Google's internal format looks like...

I'm afraid this tool might trigger my account to be disabled.

Then use Google Takeout [0], Google's official tool for backing up all your data.

[0] https://takeout.google.com/settings/takeout

It uses the same IMAP protocol that your mail clients do. I have been using it for years and not a single issue with google closing down any of the 5-6 accounts I have archived.

This - and having no unique data anywhere. Having your files distributed this day and age should be general approach.

Data is only half the problem. The other is that email address has become the "primary key" for everything. Banking websites, random forums, everything. It's sometimes impossible to change it because it's the "primary key" for identity on that website. And email addresses are not portable like phone numbers.

Email addresses are portable to any webmail, mail server or other email infrastructure provider if you have your own domain and then forward to the service of your choice. This way you can use Gmail if you like or maintain your own full email stack, or anything in between, while still addressing your mail to something you control.

but do you really control your domain? if somebody hacks into your registrar or forges your signature or something and transfers it to themselves, they would get your email, and it might be very difficult for you to get it back. Or maybe I am being too paranoid about this and it's less likely that your domain gets taken from you compared to your mail provider deciding to terminate you.

You can also do something like using fastmail, where you don't control your domain, but being a paying customer you do have somebody to call if there are issues

I find the attack itself technically plausible, but where's the motivation? Why would someone go to the effort just to get some random personal domain? Sounds a little far-fetched, and I've never heard it happening to a personal domain of a non-VIP. Meanwhile, we know that Gmail accounts get terminated regularly, and appealing is hard-to-impossible.

Cryptocurrency mostly (based on public disclosures), but sometimes other digital assets like twitter handles or game currency. Sometimes domain names themselves are the asset.

As is now well known, mobile numbers are often also hijacked to subvert weak 2fa.

This is something I have tried to weigh the risks of when it comes to use a Gmail account vs my own domain. What is more likely to happen: Google permanently locks me out out of my Gmail account or I forget to renew my domain or have it somehow stolen from me?

> What is more likely to happen: Google permanently locks me out out of my Gmail account or I forget to renew my domain or have it somehow stolen from me?

Then use TLDs where that can't happen.

If you forget to renew a .de domain, for example, you get a letter, and have 2 weeks to restore it, or end it - per default, it gets held in TRANSFER mode, for 50€/month.

Also, you can always get the domain back, if you can prove ownership.

There's many TLDs where such safety is standard.

This is true if you're starting from nothing. However, Gmail is over a decade old at this point and is the primary email for millions of users. And something like custom domain is beyond the technical skills of average users.

However, Gmail is over a decade old at this point and is the primary email for millions of users.

Switching is annoying, but not impossible. You just forward the old address to the new and progressively replace it on any accounts you might have. I still have access to my Gmail account, it just hasn't received a non-spam email in years.

Plus you may be able to get your new email address to import all your old Gmail. Outlook.com does this, if you want...

Oddly enough, it was really easy to set up a custom domain on Gmail: I did it. However, Google stopped offering that service for free, which will stop most ordinary people from using it...

Since Gmail supports imap and pop3, one could simply use any proven email client to backup the emails. I don't think a special tool that or may not work is needed for this.

Gmail tags don't map cleanly on to IMAP folders. The abstraction is leaky especially for backup/restore.

Which "proven email client" do you recommend? There's an Import/Export extension for Thunderbird, one of the last standing desktop mail apps, but it's not good at handling huge, multi-thousand message exports.

gmvault is purpose-built and quite simple. I've used it before with success. It backs up your whole mailbox in one command. Why should we fiddle with a desktop mail client?

Thunderbird works fine for me to handle several GB of emails (the mailbox archives that it generates are plain text and standard so there's no need to export them). For the command line there are also tools like `getmail`[0], with the added benefit that they work with any email provider.

[0] http://pyropus.ca/software/getmail/

FWIW I imported a 6GB .mbox file with Thunderbird last week, worked fine -

I stopped using Thunderbird because it was choking under my 20GB of mailboxes.

> Thunderbird, one of the last standing desktop mail apps

How about clawsmail ? It's a pretty good client too, and still updated.

I love claws mail, but have to switch to thunderbird because it doesn't work well in a 4K monito.

fetchmail, getmail and offlineimap are all "proven email clients" that work great for this.

+1 on offlineimap. Those of us that may be a little strange in the head and prefer using mail clients like mutt will often use it to handle background imap syncing.

I've successfully used getmail to back-up my gmail accounts for over ten years now. It's a classic case of a functional, if a bit crufty, solution blinding one to a potentially better solution, so i appreciate this (gmvault) heads-up. thank you.

To my knowledge you can't run Thunderbird as a background service so you'll have to routinely run it. This can be a job scheduled to run in the background.

The 'restore' feature looks nice. But would it be of any value if Google decides to close your account? That is, can I take my backup emails from Gmail and 'restore' it to myemail@self-hosted-domain.com in order to migrate all those emails over?

From reading the docs it seems possible

Go to the website click on documentation then indepth then restore.

You can also export to other formats ex. mbox

Good question, I wonder if anyone’s tried this before. It would be invaluable if that were the case.

I have opted to have 3 gmail accounts with various names; 1 outlook, 1 yahoo account. Every email sent to my gmail is forwarded > outlook > yahoo. I use 2 other gmail accounts for newsletters and etc. I am sure at least 1 company will remain free to access data! BTW thanks for the thunderbird tip.

I've tried and yes, you can.

My backup of Gmail is to use Mail.app via IMAP + download all attachments.

I then have a backup of my computer -> 1. Time Machine + 2. Arq->S3/Glacier

Given that this keeps mail locally in a constantly readible format (offline, copied in mbox)... is there something missing in my basic solution that this cli utility adds?

Or any IMAP client that saves messages locally.

I guess this utility benefits people who only use gmail's web interface.

Here's one I wrote: https://github.com/abjennings/gmail-backup

Not as full featured (can't restore), but it's just a 77-line Python script. You could audit it yourself to make sure it doesn't upload your creds to another server.

Please don't forget to encrypt your backups in case your local drive gets stolen or your cloud backup service is hacked.

doesn't http://www.google.com/takeout include gmail?

Yep! And much more. But gmvault does also restore emails to an account, interestingly.

I am (slowly) working on a project to pull some statistics from Takeout's mbox file for Mail. Also want to play around with the Location History, Chrome data and Hangouts exports.

That would be very useful. I'd like to use takeout incrementally - and have some tools to use those backups at any time (like you describe).

Takeout results in an MBOX file. Seems like it be possible to push that to a new account via IMAP.

Yes but I don't believe there's any way to automate Takeout. Also each new takeout is a full backup vs incremental.


Since yesterday I have a few Google Takeout zip files in my backup ( https://takeout.google.com/settings/takeout ). I've used gmvault in the past, but this looks superior from the outside. Haven't delved into the data I'll admit.

I am working on digging around this data right now, actually. Some details on my website (see profile) and I will post on HN about it at some point when I get things further along (also want to look at Location History and Chrome data exports).

Why should I use this over an email client like Thunderbird?

I find it incredibly easier to configure and run periodically in a cron job. Plus I've never had even the slightest stability issue with gmvault whereas Thunderbird (which I use anyway to read emails at work) has bad days from time to time.

Additionally, the output format (gzipped plain email + metadata) looks very convenient for indexing / analysis; something I'm dreaming of for a long time.

I think there is no point, but if someone uses only web interface this might be valuable.

Off-topic. What are some of the reasons you use desktop clients? I have tried using thunderbird several times but the habit never stuck. Anything I might be missing out on (apart from local IMAP backups)?

(1) Desktop clients used to be faster and more efficient than using a web interface, and probably still are in most cases. However, Gmail is impressively snappy so I don't think it applies in that case.

(2) You can do things with a desktop client that you can't do with a web interface. That includes sort by subject and sort by size.

(3) Desktop clients can have real folders, which Gmail doesn't.

(4) Works off-line. This is still useful, though not as useful as it used to be.

(5) Easy to backup. Easy to automate backup.

Works offline, has its own OS-level windows (so I don't have to hunt around for the tab, can close the browser but keep e-mail open or vice vers), multiple e-mail accounts under one application while keeping them entirely separated (I don't want to give one provider access to my mailboxes at another provider, and I don't want to have to use multiple different web UIs). Also a matter of habit.

Multiple accounts and GPG support

Do you use IMAP? Do emails get fully downloaded in IMAP?

You can set it up like that, yes. (I'm not sure what the default is, since I haven't created a new Thunderbird profile in years)

It downloads everything by default.

I have used this in the past to backup email accounts of resigned employees so we continue to stay well within the maximum number of active accounts for our free Google Apps.

It generally works fine and it allows you to restore the emails to a different account name. (I sometimes temporarily restore an account to search for old emails). It seems to have some issues with restoring accounts with a lot of large emails (large or multiple attachments) especially those that have reached the 15GB quota.

> Handle all Gmail IMAP hiccups.

It's sad that after all these years, google still hasn't gotten around to properly implementing IMAP.

Is it considered safe to leave your oauth ID and Secret public in open source projects?

I have been using Spanning backup for several years now. It backs up our Gmail, documents , calendars, contacts, sites It is one of those set-it-and-forget-it type services. The company is acquired by EMC now. The service costs us around $35/year/email account; for us it is small price for peace of mind. I wrote a review of them few years back - http://reviewofweb.com/gmail/backupify-vs-spanning/

I have been planning to switch to mutt, this tool will also come in handy.

Are there similar tools for MS Outlook or Protonmail or if it is possible to modify gmvault to work with either one?

I use Thunderbird Portable to backup our Google Apps/GSuite email accounts. Open once a week - Bob's your uncle.

Does it work on Raspberry Pi?

I have an always on PI server as git server. I wouldn't mind put this on as one more cron job on it.

It's written in Python, so it should work.

Seems legit. In the past I have used Imapsync to transfer hundreds of email accounts out of gmail / google apps. https://imapsync.lamiral.info/

Ironically using some of these buggy tools can be an effective and swift way to get your account locked out or at least rate limited by anti-abuse systems. Read the code and make sure they are using official APIs and are coded sanely.

They are. The first thing gmvault does is have you authenticate with oauth. From there it's just using the gmail apis / imap to download the data.


Best timing for posting this.

Even before this I've been looking to get a bit independent from Google, atm I'm trying to get Camlistore to replicate my data over various places. (I wish it had some erasure encoding)

Are there similar tools for google docs documents? drive files? contacts?

A lazy alternative is to forward all incoming email to a yahoo account. I've been doing this for years and it's come in handy on the rare occasions when Gmail is unavailable.

Does this only backup attachments and images as well as text? I can't seem to find any reference to this on the site.

EDIT: Ok it appears gmvault encodes attachments into the eml default, cool.

I just searched this thread, the site, and github...Here's the best I've found...apparently it supports saving attachments...

https://github.com/gaubert/gmvault/issues/73 >> Users asked about attachments. Put somewhere in the doc that they are saved.

I was trying to push some people off my Google Apps accounts since google won't allow you to "split" some users off the account. Wish I would have seen this then!

Offtopic: I have Gsuite (free) for my company (<50), is it possible to backup all the emails of my users without knowing their passwords (with XOauth tokens?).

Cool, but I'm going to have to do a source code review before I trust any new tool with my Gmail messages.

No Python 3 support (at least under Anaconda), but interested when its supported.

Why does this matter? Just set up a virtualenv. I've been using this to backup my gmail account for years.

Excellent idea, never occurred to me. -1 for snark, +2 for coming to my aid.

How does this persist/restore filters?

I use mutt setup according to Steve Losh's advice "The Homely Mutt" to achieve this with great success.

Yes, we need to backup our emails before google delete our accounts. Googles operations start to remains me of obamas administration, where nobody can actually say anything out of template..

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact