
SingleFileZ, a web extension for saving pages as HTML/ZIP hybrid files - ivank
https://github.com/gildas-lormeau/SingleFileZ
======
wongarsu
MHTML seems to have the same purpose, is supported by all browsers (except
Firefox, but including IE5) and is standardized since 1999.

It's a strange format, but I think it falls in the "good enough" category.
Still I never see it in the wild

[https://en.m.wikipedia.org/wiki/MHTML](https://en.m.wikipedia.org/wiki/MHTML)

~~~
capableweb
The more ubiquitous format in modern times is Web ARChive (WARC[0]) format
which is supported by tools like wget, Apache Nutch and organizations like
Internet Archive and most national libraries[1]

WARC is also standarized by ISO and has a nice spec that's pretty easy to
understand[2]

@gildas I saw there was a comparison table[3] but it seems to be missing WARC.
Could you shed some light on why?

\- [0]
[https://en.wikipedia.org/wiki/Web_ARChive](https://en.wikipedia.org/wiki/Web_ARChive)

\- [1]
[http://digitalia.sbn.it/article/view/1473](http://digitalia.sbn.it/article/view/1473)

\- [2] [https://iipc.github.io/warc-
specifications/specifications/wa...](https://iipc.github.io/warc-
specifications/specifications/warc-format/warc-1.1/)

\- [3] [https://github.com/gildas-lormeau/SingleFile#file-format-
com...](https://github.com/gildas-lormeau/SingleFile#file-format-comparison)

~~~
gildas
I included in the table only formats that are at least supported natively in
one modern browser. That's why the WARC format is missing. I'm not against
adding it in the table though.

~~~
capableweb
I see. Thanks for your reply! Yes, to open a WARC file you would need
something like [https://github.com/webrecorder/webrecorder-
player](https://github.com/webrecorder/webrecorder-player) or another
viewer[0] but the benefit is that you can now contribute to the webs archiving
efforts and just upload the result directly to Internet Archive!

With that said, I do understand your motivation.

\- [0]
[https://www.archiveteam.org/index.php?title=The_WARC_Ecosyst...](https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#WARC_viewer)

------
znpy
This is completely unrelated, but I want to ask this question because people
that know about this are probably following the thread:

Assuming I download webpages from via ssl/TLS, would there be a way to also
save their criptographic signature so that the resulting file, along with the
website certificate, could be verifiable, possibly in court?

I've seen a number of situation where malicious public clerk do not update ab
official public website with information about upcoming events, and then
updating it once it's too late.

I'm wondering whether I could use Https features to bring such actors to
court.

~~~
jstanley
[https://tlsnotary.org/](https://tlsnotary.org/) might do what you want.

~~~
znpy
Thanks!

------
jccalhoun
This is interesting. I've been using
[https://github.com/danny0838/webscrapbook](https://github.com/danny0838/webscrapbook)
since firefox changed their addons. It also saves web pages in a zip file but
names them with a .htz extension.

I saved this page with both webscrapbook and singlefilez. Both archives looked
the same. Webscrapbook's was 22.4kb while singlefilez was 65.9kb. I unzipped
singlefilez and rezipped it with higher compression and got it to 20kb but it
wouldn't open in the browser.

While the size doesn't really matter, what I don't like is that singlefilez
renamed the images to sequentially numbered files (1.gif, 2.giff, etc.) and
css to stylesheet_0.css while webscrapbook kept the original names of the
files. I would much rather it kept the original file names.

~~~
gildas
Thank you for your feedback.

The additional 40KB corresponds to the part which self-extracts the zip file
(in order to view the page without installing any extension). Note that the
original URLs can be found in the comments of each entry in the zip file.

------
burtonator
I wrote a similar feature within Polar:

[https://getpolarized.io](https://getpolarized.io)

It can take full HTML files and internalize the CSS + HTML and compiles them
into a .zip file.

The major difference is that its an Electron app and not a web extension
though we're about 80% of the way done porting all of it to a web extension.

In retrospect, I would have done this as an EPUB.

Our users have asked for other features like taking multiple pages and
building them as one 'book' and this feature is supported in EPUB.

Also, it would mean Polar would support EPUB natively anyway which is another
big feature we need as we only support PDFs right now.

I lot of people here mention MHTML and WebArchive formats.

I think my main criticism of these is that EPUB has more universal 'reader'
support and EPUB 3.0 is basically just HTML in an enclosure anyway.

~~~
gildas
Actually, what is supposed to be interesting and innovative in SingleFileZ is
the fact that it produces _self-extracting_ valid zip files in the form of
HTML files. Thus, they can be read natively by any modern browser that
supports JavaScript.

~~~
burtonator
Ah. Interesting. That's a good innovation. How do you determine the
domain/URL? I guess it's just the URL of the source?

What would you do if your HTML extraction script had a bug in the pre-compiled
form? I guess you're just stuck?

I guess that's not the end of the world.

The EPUB form in a future version of Polar would at least require an EPUB
reader which makes it a bit heavier.

~~~
gildas
SingleFileZ creates its own paths to reference all the resources of the page
in the zip. This prevents any invalid path issues. The URL of the resource is
stored as a comment for each entry in the zip though (I'm unsure I'm really
answering to your question). If there is a bug in the extraction script, there
are some chances you can still unzip the file.

------
mikaelmorvan
Having a component that can save a web page in a very compact format is great!
It's far better than MHTML.

In addition, the backup is not as usual the naive backup of files but a real
backup of the page as interpreted by the browser.

For me it is the best component for saving web pages from far away.

------
jplayer01
Can somebody explain the benefit of this over just SingleFile? I’ve seen this
fork before, but never quite understood how it differs and how it might
benefit me more.

~~~
gildas
The main benefit is the size of the saved page. The file will be smaller
because binary resources (e.g. images) are not encoded in base 64 [1].
Moreover, the page and these resources are also compressed. The other benefit
is that you can unzip the saved page and edit it more easily than a page saved
with SingleFile because the saved page won't contain data URIs [2].

[1]
[https://en.wikipedia.org/wiki/Base64](https://en.wikipedia.org/wiki/Base64)

[2]
[https://en.wikipedia.org/wiki/Data_URI_scheme](https://en.wikipedia.org/wiki/Data_URI_scheme)

------
teddyh
I use the Firefox “ScrapbookQ” addon:

[https://addons.mozilla.org/en-
US/firefox/addon/scrapbookq/](https://addons.mozilla.org/en-
US/firefox/addon/scrapbookq/)

Mostly because I used to use the older “Scrapbook” add-on (before it stopped
working in Firefox 60) and I still have a number of pages saved in that format
– ScrapbookQ is, with some effort, compatible with those saved pages.

------
donatj
Why not just use the existing WebArchive format?

~~~
gildas
I'm the author of the extension. Browsers don't offer a way to register a
(browser) extension with a given mime-type or filename extension. That's why
SingleFileZ uses the HTML format to wrap the zip content.

~~~
dhruvdh
Quick question, can I also use this to serve a webpage?

As in take a website I am developing and have my server serve SingleFileZ
instead of what I would usually.

~~~
gildas
Yes you can. See this page for example: [https://gildas-
lormeau.github.io/](https://gildas-lormeau.github.io/) (check the source code
of the page if you're curious). It was saved with SingleFileZ and is served
via a HTTP server. Note that you don't need any extension to view it in modern
browsers.

On paper, you could even store an entire website in a SingleFileZ file. It
just need to be implemented...

~~~
NKosmatos
That would be great! I know of a few scenarios where HTTtrack and SingleFileZ
would save a lot of space and effort.

------
majkinetor
Very nice alternative to MHT.

FYI, when I saved github readme page animated gif location was blank.

Thanks.

~~~
gildas
Thank you. I could not reproduce your issue on the latest versions of Chrome
and Firefox. I'll try to do more tests to see what went wrong on your end.

------
BiteCode_dev
What's the benefit over the forked SingleFile ?

I use it regularly, and it creates a single pure HTML file with inline medias.
Easy to read anywhere.

So appart from the compression gain, why the need for SingleFileZ ?

------
falcolas
Completely personal opinion here: I love PDF for saving web pages. They
preserve the formatting, don't include javascript, and can be viewed across
most OSes and browsers.

~~~
Tagbert
But pdf does a terrible job of preserving formatting. In most cases it breaks
a site up into pages based on paper size without regard to content. Many
elements do not render correctly or use a print media tag and forget
formatting. Finally, like most PDFs the result odd frozen into a print based
size which makes subsequent viewing clumsy.

~~~
jvzr
Some tools allow for a full-page PDF (the whole page as a long, continuous
PDF)

------
watersb
I wish epub was a viable format for web page archive. Too much overhead for
single page?

