I hope we can get around to doing it someday. Of course, as usual in an open-source project, contributors welcome :)
(I'm guessing there must already be functionality to diff a bunch of JSON somewhere in the millions of lines of code).
Though I'm sure this doesn't make usually make a dent in a SSD's lifetime. But there are still people running Firefox on low end Android phones with meager flash, and Raspberry Pis with SD cards.
That is actually a good idea we haven't considered yet. A bit too brute force for my tastes, but relatively easy to implement. We would need to determine how much CPU is needed for a diff between two 300Mb JSON files, though (yes, some users have these).
Of course, we're back to the issue of manpower, but definitely worth trying out.
> Though I'm sure this doesn't make usually make a dent in a SSD's lifetime. But there are still people running Firefox on low end Android phones with meager flash, and Raspberry Pis with SD cards.
The implementation of Session Restore for Android is largely independent, so I'm not sure how it works these days.
Or, maybe to reformulate, which wild scenarios does Firefox want to support now? I can imagine that the user's experience wouldn't match the wishes. Some people that use session restore claimed they "lost everything" from time to time, and I had to fish "just the urls" from their session store files which looked strange ("full of everything"), but automatically restored to nothing.
The goal of session restore is to restore your session -- your open tabs should come back, the same pages should load, scrolled to the same place, and with the right content.
Someday the pain may motivate me to try to learn enough about Firefox's internals to do it. :)
It would be interesting if somebody would actually analyze what takes the most of the mentioned 300 MB. I see a lot of base64 encoded stuff, if they are "favicons," come-on. There are so many caches in Firefox already, JSON files certainly aren't the place for these images.
But yeah, storing favicon in Session Restore would be pretty bad. I didn't remember that it was the case, though.
700 KB of binary images in a 1.7 MB session file, which can be compressed only to the 70% of its size.
I also see a lot of things like \\u0440 which spends eight characters for one unicode character (in another file, not from me). But that file was reduced to 37% of initial size with LZ4. It seems LZ4 is still worth doing, if the content remains easily accessible with the external tools, e.g. lz4cli.
It'll compress to about 30% of the size, it's easy to do, and it shouldn't add more than a tiny CPU overhead over formatting the JSON itself.
It solves half the problem with like 15 minutes of work.
Then an hour of writing good tests.
Then lots of manual and automated testing on four or five platforms, and fixing the weird issues you get on Windows XP SP2, or only on 64-bit Linux running in a VM, or whatever.
Then making sure you don't regress startup performance (which you probably will unless you have a really, really slow disk).
Then implementing rock solid migration so you can land this in Nightly through to Beta.
Then a rollback/backout plan, because you might find a bug, and you don't want users to lose their session between Beta 2 and Beta 3.
Large-scale software engineering is mostly not about writing code.
No, for example, LZ4 is unbelievably fast:
almost 2 GB per second in decompression!
I've just tried compressing some backupXX.session file (the biggest I've managed to find, just around 2 MB) and it compressed to 70% of the original, probably not enough to implement the compression -- and I suspect the reason is that the file contains too much base64 encode image files which can't be much compressed?
So the answer to having sane session files can be first to stop storing the favicons (and other images(?)) there? I still believe somebody should analyze the big files carefully to get the most relevant facts. For the start, somebody should make a tool that extracts all the pictures from the .session file (should be easy with python or Perl, for example), just that we know what's inside.
For a while now I have been running a cronjob to commit my profile's sessionstore-backups directory to a git repo every 5 minutes.
This is because, occasionally, when Firefox starts, it will load tabs in some kind of limbo state where the page is blank and the title is displayed in the tab--but if I refresh the page, it immediately goes to about:blank, and the original tab is completely lost.
When this happens, I can dig the tabs out of the recovery.js or previous.js files with a Python script that dumps the tabs out of the JSON, but if Firefox overwrites those files first, I can't. So the git repo lets me recover them no matter what.
What I have noticed is that the git repo grows huge very quickly. Right now my recovery.js file is 2.6 MB (which seems ridiculous for storing a list of URLs and title strings), but the git repo is 4.3 GB. If I run "git gc --aggressive" and wait a long time, it compresses down very well, to a few hundred MB.
But it's simply absurd how much data is being stored, and storing useless stuff like images in data URIs explains a lot.
Like you, I also observed that exactly the people who depend on the tabs to "remain" after the restart are those who are hit by the bugs in the "restoration" and as I've said, I believe the users would more prefer to have "stable" tabs and URLs than the "fully restored sessions in the tabs" when all the tabs fully randomly (for them) disappear. Maybe saving just the tabs and URLs separately from "everything session" would be a good step to the robustness (since it would be much less data and much less chance to get corrupted) and then maybe, pretty please, an option "don't save session data" can be in the about:config too)?
Once there's decision to store just the URLs of the tabs as the separate file, the file can even be organized in a way that just the URL that is changed gets rewritten, therefore making the "complete corruption" of the file impossible and also removing the necessity for Firefox to keep N older versions of the big files (which then eventually still don't help the user like you).
Yes, that would be very, very useful. I can get by if the tab's scroll position and favicon and DOM-embedded images--and even formdata--are lost. But if the tab itself is lost, and it was a page I needed to do something with, I may never even realize it is gone...
Add the code that's able to load compressed session backups and leave it in for a couple versions.
Once enough versions have passed enable the code that writes compressed session backups.
It's really not that hard to do unless you want to enable it now.
Also, what do these add-ons do? The only use case I can think of is figuring out whether the user has a tab open to a given site, and that's going around the browser's security model, so breaking that would be a good thing.
It's something that sounds easy until you actually try to get it coded up and shipping.
I explained below that this thing isn't a factor for Android because the program gets suspended.
It doesn't talk about the difficulties in getting data safely to disk. There's just a worry that taking a hash of the entire session state is expensive. I'm skeptical that a fast hash would take long compared to the time spent serializing to JSON in the first place, let alone time spent diffing.
1. Thanks for your work.
2. How do I really contribute? Could you link to what I need to do to start working on this now? I'm affected by this problem ame want to figure out if I can fix it.
That advice was a lot simpler than I thought it would be.
For example, let's say I have 10 tabs open but I have been using only the latest tab during the last 5 minutes. In this scenario I don't care about the cookies (eg due to ongoing ajax) nor the state of the other 9 tabs. If the browser crashes I'm OK with those 9 tabs being restored with a 5 minutes old snapshot.
So for instance FF should be smart to save the state of new/recent tabs, and should slow down progressively on old/inactive tabs.
On the back-end, I believe that Session Restore should be backed by a database, with each tab updated independently, rather than a big bunch of JSON data. The rationale being that:
- it's performance-critical;
- it's safety-critical;
- we have users with 300+ Mb of data in their Session Restore and JSON isn't meant for this scale of data;
- we wouldn't need to rewrite x Mb of data every 15 seconds, just a per-tab update;
- if we're using a relational database, it would be easier to trust the code to not screw up with the data;
- we wouldn't need to load the entire Session Restore upon startup.
(On the minus side, this might make backups a bit more complicated.)
On the front-end, we would need a high-level API for Session Restore, which would let us do things such as accessing per-tab data, (de)hibernating tabs, etc. Oh, and it would need to be accessible by WebExtensions.
In the middle, we would need to re-engineer Session Restore to make sure that we don't need to maintain this huge object representing the entire state of the session. We would also need improvements e.g. to cookie management, to avoid having to re-collect cookies so often.
Why not just use the filesystem? Imagine a layout like this:
| |--+ tab1
| | |--- cookies
| | |--- formdata
| | |--- title
| | |--- url
| ---+ tab2
| |--- cookies
| |--- formdata
| |--- title
| |--- url
| |--- cookies
| |--- formdata
| |--- title
| |--- url
No serializing and writing entire sessions to disk at once. Only small writes to small files. All plain-text, easy to read outside of the browser. Easy to copy, backup, modify, troubleshoot. No complex JSON, serializing, or database code. Use the write-to-temp-file-then-rename-over-existing-files paradigm to get atomic updates.
Need to know if part of a session on disk is stale? Check the mtime for that file, see if it's older than the last time that data was changed in memory.
Simple stuff. Straightforward. No overengineering. No bloat.
Why not do this?
It's a non-issue anyways, I have yet to see anyone show any actual evidence of an ssd dying prematurely or suffering degraded performance from this behaviour.
Also, people who keep Session Restore tabs open (you know, the tab that lets you restore your session, when you have crashed) and continue browsing – Firefox needs to store several nested Session Restore JSON files.
To mitigate this I have created a daemon to copy the file to a separate directory every few hours. I then delete those old versions manually every year or so.
The git repo grows in size rapidly, but can be compressed way down. (Today it was 4.3 GB, but "git gc" compressed it to 230 MB). Every now and then you can blow away the repo or filter-branch to get rid of outdated sessions.
Maybe you could give users the option? (until it can be fixed fully)
about:config shows a big warning message "This may void your warranty".