
What If OpenDocument Used SQLite? - ejstronge
https://sqlite.org/affcase1.html
======
falcolas
I'd like to take a moment and look at this "what if" from the point of view of
data archiving.

Zip files use a well documented, easily-reproducable and widely implemented
deflate algorithm. It's reasonable to assume that someone could, 100 years
from now, unzip an OpenDocument file. The contents are XML files - basically
text - which can be read with any simple text editor and parsed. The XML is
also at least somewhat self-documenting regarding the contained document.

SQLite databases have a binary format which is known to change periodically.
An attempt to read a SQLite file in 100 years would have to know the exact
version of the file, and be able to reach back through history to find that
version of SQLite (or a compatible version), and find a way to compile it for
whatever CPU architecture exists in 100 years.

This could be mitigated by including the appropriately versioned documentation
of the binary format with the database itself - however to the best of my
knowledge, full documentation of the binary format (outside the code) does not
exist. The end product - a relational database - would also be less self-
documenting than the XML version.

IMO, it would be much better to keep the inefficient zipped XML files.

[EDIT]: The binary file format:
[https://www.sqlite.org/fileformat.html](https://www.sqlite.org/fileformat.html)

Learn something new every day.

~~~
silotis
> SQLite databases have a binary format which is known to change periodically.
> An attempt to read a SQLite file in 100 years would have to know the exact
> version of the file, and be able to reach back through history to find that
> version of SQLite (or a compatible version), and find a way to compile it
> for whatever CPU architecture exists in 100 years.

I contend this is not as big of a problem as you make it sound. First of all,
the sqlite file format states its version in a string at the beginning of the
file, so determining the version of a file is trivial even without any special
tools. The sqlite source code is extremely widely distributed. Between linux
distros and the many projects which, for better or worse, embed sqlite in
their source trees, there are easily over a million copies of sqlite3.c
floating around. The chances of all of these copies being lost, even in a
catastrophic event, is negligible. Even if we did somehow lose every copy of
sqlite3.c, reverse engineering a binary format is far from impossible. In fact
the old binary MS Office file formats were reverse engineered by multiple
groups.

The chances of us losing the ability to compile C code are even more
negligible. If human civilization has fallen that far, we will probably be too
busy killing each other with sharp sticks over drinkable water to care about
reading some hundred year old documents.

~~~
falcolas
Not that I disagree that sqlite is fairly ubiquitous _today_ , but remember
that not that long ago, DbaseII, dbm, and BerkleyDB were fairly ubiquitous.

I give you "the relevant XKCD":

[https://xkcd.com/1909/](https://xkcd.com/1909/)

Somewhat more practically, a relevant DLib Magazine [0] quote: "not all file
formats are suitable for long term preservation, even if they have an open
specification. Some lossy and compressed file formats pose a higher risk of
total loss if even a single bit is lost."

And from the Library of Congress [1], in the context of format preferences for
text with structural markup: "XML or SGML using standard or well-known DTD or
schema appropriate to a particular textual genre."

[0]
[http://www.dlib.org/dlib/july16/houghton/07houghton.html](http://www.dlib.org/dlib/july16/houghton/07houghton.html)
[1]
[http://www.digitalpreservation.gov/series/challenge/](http://www.digitalpreservation.gov/series/challenge/)

~~~
comex
> "not all file formats are suitable for long term preservation, even if they
> have an open specification. Some lossy and compressed file formats pose a
> higher risk of total loss if even a single bit is lost."

Wouldn't this issue apply _more_ to OpenDocument, which is compressed (into a
ZIP archive), than SQLite, which (at least by default) is not?

But then, I question the appropriateness of the advice. If you're serious
about archiving, you should be using error-correcting codes in some form, so
that the archived data will remain recoverable bit-for-bit even with a large
number of bit errors in the underlying medium. To be honest, I'm not that
familiar with long-term archiving practices, but if you have some kind of RAID
setup, that should give you both error correction (for bit errors) and drive
redundancy (for loss of entire drives). Alternatively, you could use dedicated
ECC tools like par2.

True, most data that gets preserved will probably be preserved by chance, by
people who are _not_ serious about archiving, and may not take sufficient
steps to prevent errors. But they're also not going to choose a format for
optimal archiving either, so you're kind of stuck with the fact that many
modern file formats have built-in compression and/or checksums, and thus don't
hold up well when corrupted. We could keep the issue in mind when designing
new formats, but is resilience to corruption really worth the additional
storage cost of leaving data uncompressed? Or perhaps we could design formats
to have built-in ECC instead of just checksums, but that would also waste
space...

------
mjw1007
For me, the biggest reason not to do this sort of thing is that SQLite still
often refuses to work over network filesystems.

I appreciate that this is because the authors are being careful about possible
corruption when locking isn't working properly, but if they really mean the «
SQLite competes with fopen() » line they need to find a way to get round this,
because fopen() doesn't do that.

~~~
PaulHoule
Amazingly, the Jet engine in Microsoft Access (a file-based database a lot
like SQLite) works very well over Microsoft's network filesystem, but it helps
that file locks in Windows are much more solid than file locks in Linux. (The
tradeoff is that Windows machines go out to lunch every so often)

This document

[http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XM...](http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMPSpecificationPart3.pdf)

is a great overview of a number of desktop and media file formats
(particularly in the context of how to find holes in them in which you can
stick metadata.) Read between the lines and you can understand some of the
tradeoffs; the database-based format shows up once or twice and, from the
viewpoint of XMP, it is the one in which it is hardest for a foreign program
to find an embedded media packet. On the other hand, SQLite is open source so
you can say "just open it with SQLite".

~~~
WorldMaker
Modulo the fact that friends don't let friends use Jet in 2017. ;)

Microsoft hasn't supported Jet in forever (a decade or more), and doesn't
recommend it. The number of corrupted MDB files (and stranger, corrupted LDB
lock files) I've had to work with in the last couple of years (in trying to
migrate Jet-using VB6 apps to something 2017 has to offer) has shown me wisdom
that Jet/MDB is a database format that should be properly thanked for its
service over the years, then gracefully old yellered.

------
peterkelly
That would be coming full circle, and going back to the bad old days of binary
.doc and .xls files, which used an efficient data store for fast incremental
updates:

[https://www.joelonsoftware.com/2008/02/19/why-are-the-
micros...](https://www.joelonsoftware.com/2008/02/19/why-are-the-microsoft-
office-file-formats-so-complicated-and-some-workarounds/)

OpenDocument and its competitor (OOXML) both use zip/XML for portability
reasons. Yes, they're less efficient. But they're also completely independent
of any specific software implementation [1].

[1] Mostly - OOXML has some parts that depend on MS Office's implementation.
But ODF is completely described by the spec.

~~~
abtinf
I don't see how your comment is responsive to the excellent article.

doc and xls were proprietary binary formats.

sqlite files couldn't be more open - the software to interact with them is in
the public domain and ported everywhere.

And the article itself doesn't even suggest using sqlite as a binary store.
Rather, it suggests using the capabilities of sqlite to augment text-based
data.

~~~
mseebach
> sqlite files couldn't be more open

Yes, actually they could. They could be readable without any software at all.
Like XML.

All of the improvements suggested in the article could be implemented by
extending the XML format to allow including by other XML files by reference.

~~~
jwr
This is a common recurring theme that I find strange. Why do people believe
that if something can be browsed with less/Notepad, it is somehow more
"readable" than a binary representation? I've seen and worked with tons of
XML, and most of it was utterly useless without accompanying documentation,
just as a binary format would be. The fact that I could "see" the XML on my
screen did not make the job easier.

The problem with formats "readable without any software" is that you usually
end up with everything represented as strings, with no knowledge of actual
types and constraints. So, you still need that documentation, even though the
format is "readable without any software".

~~~
enord
It's not about human readability, it's about separating implementation from
representation, so "readable without any software" is a good test for wether
it's "readable with any sfotware".

If your document editor represents documents as a stack of edit-events (for
undo/redo or whatever) and you document every nook and cranny of this
"document format", standardize it as "binary event-sourced document format,
.esd" then your binary spec will be very close to a full description of your
document editor. Everyone who wants to use your format now has to reimplement
your editor.

Anyone can leak implementation details into any format, and it is to some
extent unavoidable, but if you can make it readable (or semi-readable), some
thought has at least been put into the representation.

~~~
_pmf_
> so "readable without any software" is a good test for wether it's "readable
> with any sfotware"

What an extraordinarily stupid notion. Your filesystem is not readable without
a filesystem driver for the specific binary format used to represent it, so
you cannot even read your text file without having something in your stack
that understands a more structured data organization scheme.

~~~
enord
Yes, but that comes with the operating system that comes with the computer i
bought and use in the house which comes with electricity from the electrical
grid in the city that lies in an aerated valley (luckily!) on the third planet
from the sun. I should have added a disclaimer i guess.

~~~
justinclift
Just to point out... SQLite comes built it to pretty much every OS these days
too. :)

------
btrask
I'm a big fan of SQLite, but this kind of marketing puts me off a bit. ZIP and
XML are popular standards with dozens of implementations. The only thing that
can read an SQLite DB is SQLite, AFAIK.

To be sure, SQLite is very well supported, runs on tons of platforms and is
going to be around for a long time. But still, it is less archival quality,
and more inherently complex (even though that complexity is very well
encapsulated).

~~~
abtinf
Why do you view a "standard" or "dozens of implementations" as a good thing -
what is the context of your evaluation?

Why is it not comparable, if not significantly better, to have a mostly
uniform public domain code base that has been deployed to billions of devices?

~~~
lifthrasiir
The diversity of mutually compatible implementations avoids a dominant
implementation's critical issue being catastrophic and keeps the standard
moving to the direction no single party can totally influence. SQLite _does_
have a list of semi-standardized requirements [1] which should technically
allow for alternative implementations---not having them today makes it less
viable as a true standard.

[1]
[https://www.sqlite.org/requirements.html](https://www.sqlite.org/requirements.html)

~~~
SQLite
A better starting point might be the database file format description.

[https://www.sqlite.org/fileformat.html](https://www.sqlite.org/fileformat.html)

------
nathell
In my app (a search tool for Polish corpora [1]), I'm using ZIP with zero
compression (stored files) as a container file format. The upside is that I
can easily read the archive contents, and then mmap parts of the file as
needed [2].

[1]: [https://github.com/nathell/smyrna](https://github.com/nathell/smyrna)

[2]:
[https://github.com/nathell/smyrna/blob/master/src/clj/smyrna...](https://github.com/nathell/smyrna/blob/master/src/clj/smyrna/corpus.clj#L86)

~~~
nerdponx
Why ZIP versus something like TAR?

~~~
mkj
You can seek to files within a zip file, tar doesn't have an index.

~~~
pyre
Being the (T)ape (AR)chive format, it was designed for tape backup, which
doesn't really have random access. IIRC you would basically cat the output of
tar to the tape.

------
ChuckMcM
This seems like an excellent working format for documents for all of the
reasons listed. One could ask why aren't all structured documents built in
just this way. In particular I believe that CAD files built like this would be
much more robust than they currently are in the likes of TurboCAD.

~~~
randyrand
I don't think SQLite would be good for 3d data exchange between
companies/programs, to name one example.

GLTF is the newest format for this, and its much more suited for the task.

~~~
ChuckMcM
Thanks for the pointer, although it looks focuses on '3D' as in models and
assets for games and renderers, not '3D' as in the dimensions and mechanical
properties of concrete walls. Still I downloaded the spec for a closer read
from the Github repository.

The two important clients for 3D cad data are analysis software (CFD,
Structural, Etc) and CAM software for manufacturing. I didn't get a sense that
GLTF was looking at that space at all when it was being developed.

------
badsectoracula
FWIW ZIP files also allow for partial saves as each entry is stored in a
separate stream so all you need to do is update the directory, which is placed
at the end of the file (in theory you could update archives in an append-only
manner when saving and have a pack command to rebuild the ZIP file in order to
remove the unnecessary data). It is just that (AFAIK) no program that uses ZIP
as a generic container bothers with that (and besides, implementing the
necessary logic for the partial saves might actually be the harder bit,
regardless of the storage format).

~~~
alkonaut
Handling a power outage during a save typically means writing to a new file
and then handling it later. With a big zip I just copy the entire file (hoping
that a file copy is reasonably fast) and then do in-place updates in the copy.
If all the in-place updates in the copied zip work, then replace the original
with the copy.

I haven't found a reasonable way to do this that doesn't require making a copy
of the whole original document. If sqlite did this elegantly, I'd switch in a
heartbeat.

------
kbenson
_The first line of the file is just:

<?xml version="1.0" encoding="UTF-8"?>

The second line of the file contains 211792 characters of impenetrable XML._

So, does no XML formatting tool exist to consume XML and pretty-print the
output? I can tell you, as someone that consumes gigabytes of compressed JSON
each day, having 500,000 plus characters of JSON on a _single_ lines causes me
no problems. The equivalent in XML should also cause any developer or tool
wishing to consume it no problem, as this is a _program_ interchange format,
not a format meant to be read by a person (or why compress it anyway?). If
someone is really espousing readability in document formats, maybe we should
just replace opendocument with some variation of markdown?

Also, if your complaint is that XML is too obtuse because it's not formatted
well, I'mnot sure how replacing it with a schema implemented in a RDBMS format
is much better, _especially_ when there are relatively few implementations of
code that can read that format. Sure, everyone has bindings to SQLITE, but
those are _bindings to the same C library_. Almost every non-toy programming
language has had the DEFLATE algorithm implemented in it, and implementing a
_naive_ XML parser isn't hard (and I'm not sure anything beyond a naive XML
parser is required for the OpenDocument format).

~~~
fusiongyro
The tool you're describing is `xmllint --format`.

~~~
oever
Not quite. `xmlint --format` can introduce unwanted spaces. Consider this
example.

<p><i>hello</i><b>!</b></p>

`xmllint --format` adds a space between the 'o' and the '!'.

~~~
fusiongyro
Is there a way to reformat XML without introducing space?

~~~
kccqzy
I think so. Technically you can put linebreaks and whitespace within XML tags.

------
Const-me
Please don’t.

When people are sending or sharing a document, they don’t expect to leak edit
history of that document. And SQLite inherently leaks that edit history, in
database freed pages and unallocated space. You can VACUUM the database after
each save, but that’ll negate the performance benefits. Also VACUUM was
designed to save space, not sure it reliably cleans up unwanted data.

OTOH, these zipped XMLs are free from unwanted data, they (hopefully) only
contain the data user wants to save, i.e. the content of the last version of
the document.

~~~
chj
People should publish PDF version of a document. But if they send the working
copy around, then the history may be helpful for other collaborators. Think of
git. They can even merge the changes easily.

~~~
gumby
> People should publish PDF version of a document.

Oh please no! PDF is quite user-unfriendly (e.g. can't reflow, typically can't
cut/paste properly) designed to emulate an ancient medium that has benefits
and drawbacks different from screens.

And PDF, while executable, is essentially compiler output, with all the
destruction of semantics that implies.

~~~
pjmlp
I only have issues with ebooks, outside them it is just perfect.

Cut/paste only fails on me, when whoever created the PDF had the brilliant
idea of using images instead of actual text, e.g. scans. Something that using
another format won't help.

Maybe you should use another PDF reader?

------
lifthrasiir
[https://sqlite.org/whentouse.html](https://sqlite.org/whentouse.html)

 _SQLite does not compete with client /server databases. SQLite competes with
fopen()._

I personally do not agree to this argument, because some operations are
intrinsically better suited to text-based formats---for example with text-
based formats one can at least try to diff and merge multiple files (even
though the actual suitability depends more on the data structure and less on
the underlying container format), but I can see where the argument is coming
from.

------
SQLite
Related articles:

[https://www.sqlite.org/appfileformat.html](https://www.sqlite.org/appfileformat.html)

[https://www.sqlite.org/aff_short.html](https://www.sqlite.org/aff_short.html)

------
bambax
> _Third Improvement: Versioning_

Regardless of SQLite, why don't modern productivity tools include at least an
option to automatically version documents, ie store documents as a series of
diffs that one could browse up to the start?

Of course in many cases, edit history is better kept hidden, so this should
not be the default. But why doesn't it exist at all?

~~~
hoov
Difficult, but possible. We did this with Buzzword at Adobe. The hardest part
was the UX, honestly. Most knowledge workers need some hand holding, to say
nothing of teachers and students (we had success selling to high school
teachers).

------
zubairq
This article about storing files in Sqlite is spot on and it works. I did
exactly this and i have tested it with GBs of data and 1000s of files and it
works well. The table structure I use is:

CREATE TABLE IF NOT EXISTS files (id TEXT, name TEXT, contents BLOB);

: as you can see the source code for the Javascript NodeJs web application
file here:

[https://github.com/zubairq/gosharedata/blob/master/src/index...](https://github.com/zubairq/gosharedata/blob/master/src/index.js)

------
alkonaut
I'm maintaining an application that uses a similar (OPC) file format. It has
served us well and it beats the legacy way of having just a directory of files
(e.g. when you want to send a file to someone else it's ONE file) but it
doesn't scale very well when adding lots of data, and incremental update is
painful compared to just using a directory.

How does sqlite handle incremental update? Does it use separate file storage
for large table blobs? If the blobs are in line, how does it handle changing
one blob from 1Mb to 2Mb say? Presumably it just appends the new blob, updates
the pointer, and marks the original 1Mb area as unused and possible to clean
up? That leaves a 3Mb file, which can be vacuumed to 2Mb then?

------
pagnol
> The use of a ZIP archive to encapsulate XML files plus resources is an
> elegant approach to an application file format. It is clearly superior to a
> custom binary file format.

Why would it be superior? The biggest advantage I can think of is code reuse,
but what else?

~~~
BinaryIdiot
Interoperability? If it's in a zip and in a human readable format even without
the necessary application you can still dig into data. But it also helps when
writing a new application as it only has to understand zip and xml, something
virtually everything can handle.

As much as I hate xml this seems like a decent use for it.

------
mwexler
Look, I love SQLite as much as the next person, but as a database, it's got a
lot of problems. From the lack of a formal time-date storage class to the
"dynamic typing", if we are going to go binary and complex, is SQLite the best
place to plan for future stability and ease of use?

The first question is "Q1: Does a RDBMS help here?" and then 2nd "Q2: Given
Q1=Y, Which might we use, what features are reqd and how are they
implemented?". Parent jumped to SQLite as the answer (given the source) but I
am unconvinced on Q2, or even Q1.

------
stuaxo
I don't see why the zip file can't be read incrementally - that sounds like an
issue in their Zip Loader. - Zip stores each file it contains separately.

------
geordee
I find the idea awesome. Once the spec for such a document format is created,
there would be a tool or two implementing it.

------
hokkos
Interesting but where should the database normalization should end ? Should
every text element with a single style be a separate entry with an associated
style using a foreign key ? Should all the xml disappear and be coded around
databases concepts ?

------
cjsuk
If it used SQLite it’d be like serialised COM again (xls,doc). Kill me.

Zipped XML actually allows us to parse and manipulate it easily with a
strongly defined schema. SQLite can barely enforce any consistency.

As for XML it’s actually pretty marvellous. Most people don’t understand it
which is where the problem is.

~~~
simonh
I'm sorry, where is the article did you read the suggestion that the format
should stop using XML? I can't find any such suggestion. Looking at the other
comments here it looks like the author must have recommended ditching XML
completely, but the actual article doesn't seem to bear that up. Has it been
radically altered since being posted to HN? The disconnect between the pro-XML
arguments here and the actual article's assumption of continuing to use XML is
really quite jarring.

The nearest I can find is where it suggests breaking up the XML into smaller
XML documents, but that's a whole heck of a long way away from being the same
thing.

~~~
cjsuk
Im not even talking about the data model. The manifest would be replaced, all
resources wouldn’t have usable URIs, you couldn’t deserialise a stream using
sax easily, dtd and schema references that ship with it would be invalidated.

Those are all XML too. It’s not just markup.

It turns a bundle of files with strict relations into an unworkable stateful
blob of loose links. Welcome to the world of corruption at the application
layer.

~~~
simonh
I would hardly call a system such as the one proposed corruption.

Ok, so maybe this proposal might make the format a little harder to access
programatically, but it's a pretty minor effect. You just need to extract the
data from a SQLite DB instead of a zip file. You could even store a manifest
in the DB as well.

I think the thing I object to about all the criticism here is that it's all
about making things easy and simple and pure from a development point of view.
The article proposes concrete end user benefits from a scheme like this, which
to me seem pretty worthwhile. Surely the focus should be on the user
experience first and then look at how to implement that?

~~~
cjsuk
XML is pure on its own. Data, validation, transformation, schema, even
programming.

From a development perspective you’re adding more steps.

------
mark-r
I thought there was an application that did something like this, but can't
remember it anymore. It's nice to see a writeup that steps you incrementally
through the benefits of this approach.

~~~
dchest
Acorn (awesome image editor for Mac) uses SQLite as .acorn image format.

Schema:

    
    
        sqlite> .schema
        CREATE TABLE image_attributes ( name text, value blob);
        CREATE TABLE layers (id text, parent_id text, sequence 
        integer, uti text, name text, data blob);
        CREATE TABLE layer_attributes ( id text, name text, value blob);
    
    

My app, Mémoires (diary for Mac), and some other Mac apps, use SQLite via
Apple's Core Data as document format.

------
gungoman
Penetration testers will have a field day with this if it happens.

~~~
jnwatson
Why? SQLite has been put in safety critical systems and has the best test plan
I've ever seen for an open source project.

~~~
rkangel
It has the best test plan I've seen for _any_ pure software product, and I'm
including both medical and safety critical stuff.

------
thoastbrot
The typo in the very first sentence (where/were) puts me off :(

------
jbb67
Yes please. Anything is better than XML

------
upofadown
... and then sglite changes the format of the files on the disk and you are
entirely boned...

~~~
TheAceOfHearts
If you read through their docs [0] you'd see they're committed to maintaining
backwards compatibility with all SQLite 3 releases. There haven't been
breaking changes since 2004.

In addition, the library is incredibly easy to embed with any project as it's
largely a self-contained system [1]. If SQLite 4 were released with breaking
changes, you could happily continue using SQLite 3.

[0] [https://sqlite.org/onefile.html](https://sqlite.org/onefile.html)

[1]
[https://sqlite.org/selfcontained.html](https://sqlite.org/selfcontained.html)

~~~
upofadown
This is a document format we are talking about here that should be good for
the ages, not just 13 years. In practice you would have to define your
document format in terms of a particular version of SQLite. So you would have
to unambiguously document that version in such detail that you cover
everything that anyone would ever want and be able to do. Once that version of
SQLite becomes unmaintained then things are broken.

Someone would have to maintain that version of what would now be the ODF SQL
disk format forever. Much of what would have to be maintained would be
pointless complexity not required for the representation of documents.

~~~
SQLite
You seem to be operating under the assumption that the SQLite database file
format is an ad-hoc, dodgy affair that the maintainers might change on a whim.
I'd like to try to dispel that notion.

The SQLite database file format has been unchanged (except by adding
extensions through well-defined mechanisms) since 2004. And we developers have
pledged to continue that 100% compatibility through at least the year 2050.
The file format is cross-platform and byte-order independent. We have an
extensive test suite to help us guarantee that each new version of SQLite is
completely file compatible with all those that came before. The file format is
documented
([https://sqlite.org/fileformat.html](https://sqlite.org/fileformat.html)) to
sufficient detail that the spec alone can and has been used to create a
compatible reader/writer. (I know this because the authors of the compatible
reader/writer software pointed out deficiencies in the spec which I
subsequently corrected.) The file format is relatively simple. A reader for
SQLite could be written from the spec with no more effort than it would take
to implement a ZIP archive reader with its decompressor from spec.

There is approximately 1 trillion (10^12) SQLite database files in active use
today. Even without the promises from developers, the huge volume of legacy
database files suggests that the SQLite database file format will be around
for a long time yet.

The latest version of SQLite will happily read and write any SQLite3 database
written since the file format was designed in 2004. Ancient legacy versions of
the SQLite library will read and write databases created yesterday, as long as
those databases do not use any of the extensions that were added later.

------
Jizzle
The article notes "that this is only a thought experiment." And I think it
should stay that way. It's very interesting to think about, yet SQLite offers
nothing but problems for developers and content creators. While maybe a bit
clunky, a zip file manages to be a close best choice. When considering that a
document may be hundreds of megabytes with thousands of assets OR a single
page of text, zip offers enough simplicity and robustness, while being highly
accessible.

~~~
majewsky
And SQLite doesn't... why again?

