Hacker News new | comments | show | ask | jobs | submit login
What If OpenDocument Used SQLite? (sqlite.org)
305 points by ejstronge 10 months ago | hide | past | web | favorite | 190 comments



I'd like to take a moment and look at this "what if" from the point of view of data archiving.

Zip files use a well documented, easily-reproducable and widely implemented deflate algorithm. It's reasonable to assume that someone could, 100 years from now, unzip an OpenDocument file. The contents are XML files - basically text - which can be read with any simple text editor and parsed. The XML is also at least somewhat self-documenting regarding the contained document.

SQLite databases have a binary format which is known to change periodically. An attempt to read a SQLite file in 100 years would have to know the exact version of the file, and be able to reach back through history to find that version of SQLite (or a compatible version), and find a way to compile it for whatever CPU architecture exists in 100 years.

This could be mitigated by including the appropriately versioned documentation of the binary format with the database itself - however to the best of my knowledge, full documentation of the binary format (outside the code) does not exist. The end product - a relational database - would also be less self-documenting than the XML version.

IMO, it would be much better to keep the inefficient zipped XML files.

[EDIT]: The binary file format: https://www.sqlite.org/fileformat.html

Learn something new every day.


> SQLite databases have a binary format which is known to change periodically. An attempt to read a SQLite file in 100 years would have to know the exact version of the file, and be able to reach back through history to find that version of SQLite (or a compatible version), and find a way to compile it for whatever CPU architecture exists in 100 years.

I contend this is not as big of a problem as you make it sound. First of all, the sqlite file format states its version in a string at the beginning of the file, so determining the version of a file is trivial even without any special tools. The sqlite source code is extremely widely distributed. Between linux distros and the many projects which, for better or worse, embed sqlite in their source trees, there are easily over a million copies of sqlite3.c floating around. The chances of all of these copies being lost, even in a catastrophic event, is negligible. Even if we did somehow lose every copy of sqlite3.c, reverse engineering a binary format is far from impossible. In fact the old binary MS Office file formats were reverse engineered by multiple groups.

The chances of us losing the ability to compile C code are even more negligible. If human civilization has fallen that far, we will probably be too busy killing each other with sharp sticks over drinkable water to care about reading some hundred year old documents.


Not that I disagree that sqlite is fairly ubiquitous today, but remember that not that long ago, DbaseII, dbm, and BerkleyDB were fairly ubiquitous.

I give you "the relevant XKCD":

https://xkcd.com/1909/

Somewhat more practically, a relevant DLib Magazine [0] quote: "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."

And from the Library of Congress [1], in the context of format preferences for text with structural markup: "XML or SGML using standard or well-known DTD or schema appropriate to a particular textual genre."

[0] http://www.dlib.org/dlib/july16/houghton/07houghton.html [1] http://www.digitalpreservation.gov/series/challenge/


> "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."

Wouldn't this issue apply more to OpenDocument, which is compressed (into a ZIP archive), than SQLite, which (at least by default) is not?

But then, I question the appropriateness of the advice. If you're serious about archiving, you should be using error-correcting codes in some form, so that the archived data will remain recoverable bit-for-bit even with a large number of bit errors in the underlying medium. To be honest, I'm not that familiar with long-term archiving practices, but if you have some kind of RAID setup, that should give you both error correction (for bit errors) and drive redundancy (for loss of entire drives). Alternatively, you could use dedicated ECC tools like par2.

True, most data that gets preserved will probably be preserved by chance, by people who are not serious about archiving, and may not take sufficient steps to prevent errors. But they're also not going to choose a format for optimal archiving either, so you're kind of stuck with the fact that many modern file formats have built-in compression and/or checksums, and thus don't hold up well when corrupted. We could keep the issue in mind when designing new formats, but is resilience to corruption really worth the additional storage cost of leaving data uncompressed? Or perhaps we could design formats to have built-in ECC instead of just checksums, but that would also waste space...


For those who are worried about the length of SQLite support, SQLite is being used in the Airbus A350

https://twitter.com/copiousfreetime/status/67583454330408140...

This means that it has long term support for the life of the airframe which for passenger airliners can easily by 50+ years (see https://en.wikipedia.org/wiki/Boeing_737 which is still flying after 50 years).


Companies like that often just buy up a stock of the hardware they need and stuff it in a warehouse, at least when it's in danger of being discontinued. They don't need to maintain the source code at all. They just need to be able to keep installing the binary.


Another solution to this problem would just be to bundle a txt description of the format with each file. So e.g. have the first n bytes be a plain text description of the sqlite format, and the rest be the DB in sqlite format.


May as well just bundle SQLite while you're at it.


Hmm? I mean sure, but the point is that adding a few kb of text to the file is a more lightweight solution. Less than the overhead of using xml, probably.


Lower overhead today, but in the original context, it's still a greater overhead if/when SQLite is no longer active. At that point, you have to write a binary parser just to get back the RDBMS, at which point you have to reverse-engineer how the tables were built in order to re-create the document.

An interesting read on the topic and challenges of digital preservation: http://www.dl.slis.tsukuba.ac.jp/DLjournal/No_20/1-hedstrom/...


The SQLite binary format is simple and well-documented, so writing a binary parser would be a pretty straightforward exercise.

>at which point you have to reverse-engineer how the tables were built in order to re-create the document.

I don't understand what you mean by this. All you have to do is examine the contents of the various tables. It's no different to examining the contents of an XML file. In either case, how easy it is to reconstruct the document will depend on how well designed the format is.


> It's reasonable to assume that someone could, 100 years from now, unzip an OpenDocument file.

LOL said the thousands of people with old dusty Zip disks sitting in boxes in their attics. That was only 25 or so years ago.


For me, the biggest reason not to do this sort of thing is that SQLite still often refuses to work over network filesystems.

I appreciate that this is because the authors are being careful about possible corruption when locking isn't working properly, but if they really mean the « SQLite competes with fopen() » line they need to find a way to get round this, because fopen() doesn't do that.


Amazingly, the Jet engine in Microsoft Access (a file-based database a lot like SQLite) works very well over Microsoft's network filesystem, but it helps that file locks in Windows are much more solid than file locks in Linux. (The tradeoff is that Windows machines go out to lunch every so often)

This document

http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XM...

is a great overview of a number of desktop and media file formats (particularly in the context of how to find holes in them in which you can stick metadata.) Read between the lines and you can understand some of the tradeoffs; the database-based format shows up once or twice and, from the viewpoint of XMP, it is the one in which it is hardest for a foreign program to find an embedded media packet. On the other hand, SQLite is open source so you can say "just open it with SQLite".


Modulo the fact that friends don't let friends use Jet in 2017. ;)

Microsoft hasn't supported Jet in forever (a decade or more), and doesn't recommend it. The number of corrupted MDB files (and stranger, corrupted LDB lock files) I've had to work with in the last couple of years (in trying to migrate Jet-using VB6 apps to something 2017 has to offer) has shown me wisdom that Jet/MDB is a database format that should be properly thanked for its service over the years, then gracefully old yellered.



Thanks, that's good to know.

So a word-processor-like application might try implementing its own simple global lock (eg with a lock file) and then tell SQLite to use nolock=1.

I'm not sure I'd feel very comfortable doing that but it seems OK in principle.

I wish Subversion would offer an option to do something of the sort for its working copies, now it has an SQLite database in there.


Wouldn't fopen and friends also have issues when locking isn't working properly?


They might well lead to file corruption if two people try to save the same file at once.

But that tends not to matter in practice for applications like word processors (which often want to implement their own system to stop two people opening the same in editable mode at the same time, not just stop them saving at the same time).

Or the fopen-using application can just do things the simple way and write out a new temporary file to replace the old one with.


The same way sqlite-using app could write a new database file and replace the old. What does this argument support in this subthread?


Then you lose many of the advantages being touted in the « What If OpenDocument Used SQLite? » page that we're discussing. Consider the following paragraph:

« Hence, "File/Save" takes longer than it ought, especially on older hardware. Newer machines are faster, but it is still bothersome that changing a single character in a 50 megabyte presentation causes one to burn through 50 megabytes of the finite write life on the SSD. »


Indeed. But I’m questioning an argument, not a method.


Oh.

As it happens, I wasn't making an argument. I was answering nikbackm's question.


That would be coming full circle, and going back to the bad old days of binary .doc and .xls files, which used an efficient data store for fast incremental updates:

https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...

OpenDocument and its competitor (OOXML) both use zip/XML for portability reasons. Yes, they're less efficient. But they're also completely independent of any specific software implementation [1].

[1] Mostly - OOXML has some parts that depend on MS Office's implementation. But ODF is completely described by the spec.


I don't see how your comment is responsive to the excellent article.

doc and xls were proprietary binary formats.

sqlite files couldn't be more open - the software to interact with them is in the public domain and ported everywhere.

And the article itself doesn't even suggest using sqlite as a binary store. Rather, it suggests using the capabilities of sqlite to augment text-based data.


Let's say you want to extract an image file from an .odt file and you haven't got Writer installed. You can do that by extracting the file from the .odt using unzip or similar.

This article champions the idea that you should store the files as compressed binary BLOBs in SQLite. What does that mean you need to do? Connect to the file with SQLite, fetch the BLOB, hexdump it since it will return as a string, decompress it, and then save it to disk. Or save it to disk and then decompress it. You'd have to do something like this: https://stackoverflow.com/questions/15448373/how-to-dump-a-f...

That is what Open Document is supposed to end.

This is not an excellent article. It's complaining that the Open Document Formats are bad because they don't use an optimized format. They're not supposed to be optimized. Documents aren't large enough to cause the types of problems that databases are intended to fix. They're supposed to be portable and easy to work with. They make it possible to access your data using the most basic computing tools: a ZIP archive tool and a text editor. That's the whole point.


If you have an image stored in an SQLite database file, you can extract it using the "sqlite3" command-line tool as described at https://www.sqlite.org/cli.html#file_i_o_functions

Or, you could do it several dozen other ways. For example, maybe you have an archiver tool for SQLite that works like "zip" - https://www.sqlite.org/sqlar/doc/trunk/README.md

About this much we agree: The objective is to make the content of the document more accessible and easier to use. The article intended to show that using an SQLite database rather than a pile-of-files in a ZIP archive achieves that goal. If my presentation were stored in an SQLite database, I might (depending on how the schema was structured of course) be able to ask interesting questions of the presentation, such as:

  *  How many pages are in this presentation?

  *  Which pages contain the word "discombobulated"?

  *  How many images are in the presentation?

  *  Does this presentation use a prepackaged background?

  *  Does the presentation begin with my companies standard "start" slide?
Can you write (say) a Python program that will answer any of the above about an OpenOffice .odp file? All I'm trying to say is that if OpenOffice .odp files were SQLite databases with a sensible schema, then probably you could answer any of the questions above with perhaps 5 lines line of code or less. Or just a shell script that uses the "sqlite3" command-line tool. As it stands now, if you want low-level information about an OpenOffice presentation, you have to open manually open the document on your disktop and click around - it cannot be (easily) automated. An ".odp" file is essentially an opaque BLOB that is not useful to me. Only the OpenOffice program (or its various forks) can access it.

I have 104 historical presentations in a folder and I want to find some slide I wrote years ago. Right now, I have to laboriously open each presentation and manually search for the slide I want. If the format were SQLite, I could perhaps do the same with a query, which if run from a shell script, would quickly search all 104 presentations for me.

So, Yes, the whole point is to make the content more easily accessible and usable. Relative to a pile-of-files in a ZIP archive, SQLite does exactly that.


> Can you write (say) a Python program that will answer any of the above about an OpenOffice .odp file?

Using the libraries or packages available from the OpenDocument home page? (http://opendocumentformat.org/developers/) Probably, yes. Four of your five questions questions are just: "Is there an API?" The answer is: Yes, yes there is. Did you try searching for an Open Document API? For the image file one, however, all you really need is ZipFile.infolist() because ZIP files already have an API.

> All I'm trying to say is that if OpenOffice .odp files were SQLite databases with a sensible schema, then probably you could answer any of the questions above with perhaps 5 lines line of code or less.

Most of the questions involve an XQuery against content.xml. You could iterate through the files, extract content.xml, and apply the XQuery. No, it's not easy to write the correct XQuery because XQuery sucks and the schema is complex, but it can definitely be done.

I don't know why you think it's going to be magically simplified by changing the storage method. You'd either have compressed XML BLOBs like the article's first suggestion -- which puts you right back where you already are -- or you'll have dozens of tables with a ton of metadata and document structure to dig through. Either way you will need a schema reference to understand what's going on. An RDBMS isn't going to make the content simpler. The schemas are complex because the documents are potentially very complex.


> I don't know why you think it's going to be magically simplified by changing the storage method.

Because ZIP is a key/value store and SQLite is relational.

(Later:) Here is a slide comparing SQLite to ZIP from a talk I'm scheduled to give on Monday: https://sqlite.org/tmp/sqlite-v-zip.jpg

Both SQLite and ZIP will store files and both are well-established open formats with a trillion instances in the wild.

SQLite does everything ZIP does but also gives you transaction, a rich query language, a schema, and the ability to store small (1-8 byte) objects and to translate objects into the appropriate byte-order for the reader.


> It's complaining that the Open Document Formats are bad because they don't use an optimized format.

That's the opposite of what the article states. From the introduction of the article:

> Note that this is only a thought experiment. We are not suggesting that OpenDocument be changed. Nor is this article a criticism of the current OpenDocument design.


Yeah, and if you finish the quote, the next line is:

> The point of this essay is to suggest ways to improve future file format designs.

That makes it sound like, "No offense, but < extremely rude comment >." Just because they say they're not trying to be critical doesn't mean they're not actually being critical.


> What does that mean you need to do? Connect to the file with SQLite, fetch the BLOB, hexdump it since it will return as a string, decompress it, and then save it to disk.

Or you could just use a SQLite GUI (eg sqlitebrowser.org). Select the field/BLOB you're interested in, then Export (saves it to a file). Done. :)


The current Sqlite (post-3.0, 2004) file format:

https://www.sqlite.org/fileformat.html

File Format Changes in SQLite:

https://sqlite.org/formatchng.html

"Since 2004, there have been enhancements to SQLite such that newer database files are unreadable by older versions of the SQLite library. But the most recent versions of the SQLite library should be able to read and write any older SQLite database file without any problems."

Me, I'm just happy all my old doccies are in TeX.


> sqlite files couldn't be more open

Yes, actually they could. They could be readable without any software at all. Like XML.

All of the improvements suggested in the article could be implemented by extending the XML format to allow including by other XML files by reference.


This is a common recurring theme that I find strange. Why do people believe that if something can be browsed with less/Notepad, it is somehow more "readable" than a binary representation? I've seen and worked with tons of XML, and most of it was utterly useless without accompanying documentation, just as a binary format would be. The fact that I could "see" the XML on my screen did not make the job easier.

The problem with formats "readable without any software" is that you usually end up with everything represented as strings, with no knowledge of actual types and constraints. So, you still need that documentation, even though the format is "readable without any software".


It's not about human readability, it's about separating implementation from representation, so "readable without any software" is a good test for wether it's "readable with any sfotware".

If your document editor represents documents as a stack of edit-events (for undo/redo or whatever) and you document every nook and cranny of this "document format", standardize it as "binary event-sourced document format, .esd" then your binary spec will be very close to a full description of your document editor. Everyone who wants to use your format now has to reimplement your editor.

Anyone can leak implementation details into any format, and it is to some extent unavoidable, but if you can make it readable (or semi-readable), some thought has at least been put into the representation.


>It's not about human readability, it's about separating implementation from representation

No, it's not. An sqlite-based document can have as many representations as an XML-based document -- nothing in its implementation prevents this.

>so "readable without any software" is a good test for wether it's "readable with any sfotware"

The OS, filesystem, shell, editor etc needed to read the XML file are still software.

As is the XML parser needed to do anything useful with it in the data realm.

>If your document editor represents documents as a stack of edit-events (for undo/redo or whatever) and you document every nook and cranny of this "document format", standardize it as "binary event-sourced document format, .esd" then your binary spec will be very close to a full description of your document editor. Everyone who wants to use your format now has to reimplement your editor.

You do understand that the XML file can also just be including a stack of edit-events, right?

This is totally orthogonal to the underlying storage (XML or sqlite or whatever).


> so "readable without any software" is a good test for wether it's "readable with any sfotware"

What an extraordinarily stupid notion. Your filesystem is not readable without a filesystem driver for the specific binary format used to represent it, so you cannot even read your text file without having something in your stack that understands a more structured data organization scheme.


> Your filesystem is not readable without a filesystem driver for the specific binary format used to represent it, so you cannot even read your text file without having something in your stack that understands a more structured data organization scheme.

This argument doesn't hold water, since file content doesn't depend on which filesystem it's on.

An archiver may choose to store data in some very particular way, like some specific tape driver. They're free to do that without corrupting the content of any files, since the files don't depend on the filesystem.

On the other hand, if an archiver chose to store, say, unzipped versions of their opendocument files, then they've corrupted the data: opendocument files are zips, unzipped data is not opendocument. The format does depend on the file content.


Yes, but that comes with the operating system that comes with the computer i bought and use in the house which comes with electricity from the electrical grid in the city that lies in an aerated valley (luckily!) on the third planet from the sun. I should have added a disclaimer i guess.


Just to point out... SQLite comes built it to pretty much every OS these days too. :)


> Why do people believe that if something can be browsed with less/Notepad, it is somehow more "readable" than a binary representation? I've seen and worked with tons of XML, and most of it was utterly useless without accompanying documentation, just as a binary format would be.

Binary formats start out mostly unreadable by humans [1]. XML and other textual formats at least have the possibility of being made to where they can be read by laypersons.

1] I'll note that after enough immersion, I've seen people read binary core dumps, etc. but that takes much more time and practice than with XML.


Some formats don't need to be read as plaintext by the average person.

Would the average person edit a .svg by hand? No, he'd use Adobe Illustrator or anything else.

Would he edit a .docx file by treating it as a zip archive and edit content.xml? No, he'd use MS Word/LibreOffice Writer...

Suppose Sqlite used XML/JSON instead of binary files, would you modify them with notepad? No, you'd probably use a SQLite browser software.(Or an application that is more tailored to the domain)


> Some formats don't need to be read as plaintext by the average person.

Fair point, and in fact, I'd go further - most of the time most people, do not need to directly edit or view most files in most formats. Even if you took 'most' to mean 99% or higher, I'd be comfortable with that statement.

Where I differ is that I think the essence of the argument is really whether or not a binary file format offers enough value to be worth entirely eliminating the direct edit/view possibility for everybody all the time. Even if it's not a common case, it can be game-changingly useful when you need it.

Just to illustrate, you give three examples, and I have counter examples for each:

> Would the average person edit a .svg by hand? No, he'd use Adobe Illustrator or anything else.

I've modified bounding boxes by directly editing SVG, as well as read SVG directly to analyze some plots a library was generating for me.

> Would he edit a .docx file by treating it as a zip archive and edit content.xml? No, he'd use MS Word/LibreOffice Writer...

I've done this recently to extract out embedded documents on OSX (where the native versions of Office do not directly support this.)

> Suppose Sqlite used XML/JSON instead of binary files, would you modify them with notepad?

HSQLDB can use text based SQL scripts to store data, and I've modified and edited them directly for several reasons.


Well, yes, this applies to reasonably well-designed, expressive XML (which OpenDocument is). You can definitely write XML that is perfectly incomprehensible without extensive references.


XML is potentially readable, but it's always possible to make the semantics so twisted it's as bad as a binary encoding. A binary encoding is obfuscated by nature.

I despise xml quite a bit but I have to admit, it's still potentially better.


It's not readable without any software anyway, you need a text editor. If my editor has a binary file viewer, does that count?

What if my text editor has an sqlite file viewer?


> They could be readable without any software at all.

They aren't, though. Or can you read ZIP archives without an unpacker?


To be fair, you can't read any files without software, you need a text editor even for plaintext.

However, all software is not equal. Any OS comes with a text editor and a ZIP utility by default now. The same is not true for SQLite's format.


But it could easily become true if SQLite was a common document format. Back in the 90's and early 2000's, OS didn't come with a ZIP utility either.


Which OS doesn't come with SQLite?

• Win 10 apparently does (I don't use it, so can't personally verify). Win7/8 don't.

• Every version of OSX for many years does

• Fairly sure the main Linux distro's do

• FreeBSD does, not sure about NetBSD/DragonFlyBSD though.

Checking a local OpenBSD 6.x box just now, sqlite/sqlite3 aren't in the default user path, so maybe that's one that doesn't.


I don't think that's true. Does Ubuntu come with zip by default now?

Normally tar balls are used and zip has to installed afterwards:

apt install zip

I read the article as advocating for more open binary formats. That it is useful to have binary formats is logical. That someone would like to have open ones is I think a nice thought.


"Any OS comes with a text editor and a ZIP utility by default now."

The same thing is becoming true of SQLite (and pretty much is true if you exclude older versions of Windows). We're talking about one of the two (estimated) most-widely-deployed software projects in the world today (http://sqlite.org/mostdeployed.html). It's FOSS, ubiquitous, easily embeddable, works remarkably well, and has bindings for pretty much every programming language under the sun. As far as a general-purpose file format goes, a SQLite database is thus pretty close to ideal.


> The same is not true for SQLite's format.

That's not entirely true. Anything that comes with Python installed has SQLite bundled/compiled in. It's just the library, not the cli, but still.


Arguably, any OS should come with SQLite too. It's pretty close to being as standard as text (which can be complicated too sometimes).


Actually, iirc the OpenDocument format also includes an uncompressed, human readable, pure XML format.


> Actually, iirc the OpenDocument format also includes an uncompressed, human readable, pure XML format.

For some values of "readable" and "human".


It's the 'flat' versions (fods etc.). Diff can be used on them.


Writing a zip unpacker — including deflate decompressor — can be wrapped up in half a day by an average developer.


It sounds like good nuclear war proof solution. Maybe it is better to encode documents in Morse, because average radio man could transmit it via key in half a week.


> It sounds like good nuclear war proof solution.

It fits the requirements of reading the file without (external) software since you can trivially reimplement it, and is also a much lower threshold than "you need to reimplement sqlite", which is why pretty much every system out there comes bundled with a zipfile unpacker.


First of all, notepad is software. And second, xml is totally useless without software.

But you do have a point that in a rare case, you could get your hands on the raw data which makes text easier.

But everything is a tradeoff, and the question really is if the tradeoff in this case is worth it, since sqlite is so open and portable.


But you could also open the sqlite fine format using an sqlite client.

The only difference is that xml is also human "readable". But most xml isn't really readable without a tool anyway.


Yes, and that only difference is important, and it makes XML more open than SQLite. Perhaps not enough to matter (depending on your needs), but it is more open.

You may not want to look at XML without a tool, but you could. Or, most likely, you could build a tool to look at the XML. This is a crucial design consideration of XML - even if all XML tooling, open or not, is lost forever, you can still read the XML. Not so with SQLite.

Especially for a document format, this point seems important. That said, OpenDocument is zipped, which, while a simpler and even more ubiquitous format, is still a binary format that requires a library to read, so that detracts somewhat from the point. It's difficult to imagine a future where zip libraries survive but SQLite libraries don't.


Strictly speaking you can’t read XML without software either. Even a simple text editor is software of course, but you’ll also find XML files in arbitrary character encodings, and you absolutely do need non-trivial software to parse and display “plain” text in UTF-8.


You don't need a text editor, you just need to be able to read a sequence of bytes. XML anchors at the almost lowest possible point: you have the stream of characters. It's a pointless exercise to consider a format except on the premise that you have access to data in some way, and sequence of bytes is the simplest possible abstraction of data, a sequence of characters the second simplest.

ASCII encoded text is simpler than UTF-8, but there is content that can't be represented in ASCII. Zip comes a bit up the food chain from there, and SQLite comes quite a bit further.

XML is about being meaningful without tooling at the lowest possible level. Any added required tooling or documentation by definition makes the data less 'open'.


In the event of catastrophic loss of human knowledge to the scale that it makes it difficult or impossible to recreate a ZIP or SQLite driver, I would presume that maintaining the replacing the hardware on which XML/text files are stored would be an even bigger problem. Catastrophic collapse of civilization IMO would pretty much doom all our digital knowledge. Only books can survive such an event.


You don't need a catastrophic loss of human knowledge for this to be a problem. Running a piece of software written in the 80s, even with the source code available, can be quite a challenge. Support for some features of the current SQLite format might deprecated, then removed, over the next many versions. Or perhaps a subtle bug is introduced that renders certain ancient versions unreadable (https://news.ycombinator.com/item?id=15531489) and sticks around for long enough that you end up in a situation where you need to invest very significantly in software archaeology.


> Support for some features of the current SQLite format might deprecated, then removed, over the next many versions. Or perhaps a subtle bug is introduced that renders certain ancient versions unreadable (https://news.ycombinator.com/item?id=15531489) and sticks around for long enough that you end up in a situation where you need to invest very significantly in software archaeology.

At this point, you've side-tracked into an argument that anyone could make against any software.


Bingo. Which is why it's a quality not to have dependencies on any specific software when designing a format for longevity. XML has an absolute minimum of such dependencies, Open Document has only zip added to the mix, which is a format with tons of different and independent implementations, plenty of documentation for how to implement it - on top of being a fairly simple format to begin with.

SQLite for all it's significant qualities are none of that, and thus you shouldn't use it if designing for file format longevity.


>Yes, actually they could. They could be readable without any software at all. Like XML.

You can't read XML without software.

Perhaps you take editors and cat and less as a given.

In which case, nothing prevents you from taking sqlite the binary (or 1000 serializers that can be produced very easily) as a given too.


> extending the XML format

XML is only trivially "readable without any software at all". The same is true of any binary format - the issue is that it can be read, but not easily processed - which is also the case for XML (you need to integrate an XML parser for that).

And if you are including XML libs, why not SQlite libs?


The same is not true for any binary file. A hexdump of a plain text file is not as readable as opening the same plain text file with any basic text editor such as any notepad clone. That's the whole point: no one needs to install any software, let alone a specialized client, to read any human-readable format.

More to the point, no one needs to install a full blown database engine to read a plain text document.

You may argue that specialized tools make the job easier, but you also need to acknowledge that requiring specialized tools just to open a text file is silly.


A printout of XML contains the data in a structured and understandable format.

A printout of a binary dump does not provide any clues as to how the data was structured.


This is useful if you only need to read the data. If you need to process it in large amounts, its not so useful. And what if you need to traverse graph-like data in an XML format?

> a binary dump does not provide any clues as to how the data was structured

True, but if that binary format is SQlite, the structure is fully documented. You could work it out mentally just as you could work out any given XML schema.


XML is UTF-8

you already need a decoder

http://www.utf8-chartable.de/


doc and xls were even worse than proprietary binary formats: they were raw memory dumps of unserialized internal data structures. This meant implementing a doc opener was basically the tantalizing task of implementing Word-the-engine itself.


> I don't see how your comment is responsive to the excellent article. doc and xls were proprietary binary formats.

Indeed. He probably meant docx and xlsx.


No, the point was that the doc and xls file formats were in themselves mini-filesystems, much like an SQLite file, and could support many of the same operations as described in the article. It however made them insanely complex to interpret.

When the docx and xlsx formats were developed, they lost this capability.


Using SQLite as a container is different from using the whole of SQLite the embedded sql database. You can use the container by reimplementing a reader for the file format. Maybe it would make the parent feel more comfortable if the sqlite db was inside a nice cozy zip (or jar file).


I'm a big fan of SQLite, but this kind of marketing puts me off a bit. ZIP and XML are popular standards with dozens of implementations. The only thing that can read an SQLite DB is SQLite, AFAIK.

To be sure, SQLite is very well supported, runs on tons of platforms and is going to be around for a long time. But still, it is less archival quality, and more inherently complex (even though that complexity is very well encapsulated).


It would put me off too if they were advocating for replacing OpenDocument, but from the article:

> Finally, let us reiterate that this essay is a thought experiment. The OpenDocument format is well-established and already well-designed. Nobody really believes that OpenDocument should be changed to use SQLite as its container instead of ZIP. Nor is this article a criticism of OpenDocument for not choosing SQLite as its container since OpenDocument predates SQLite. Rather, the point of this article is to use OpenDocument as a concrete example of how SQLite can be used to build better application file formats for future projects.


Hardly a thought experiment, when its on the official site.

This is purely a marketing exercise. They're hanging the thought experiment tag on to try and avoid undue criticism.


It's no more a marketing exercise than any tutorial that's posted on a project's site. If they'd submitted this as a proposal to whichever standard's body is responsible for OpenDocument then I would think you have a point, but they're just using it as a well known example.


While I agree that using the SQlite format does introduce a heavy dependency on the SQLite library, I want to point out that there is at least one independent implementation, a pure java library named SqlJet [0].

It didn't work too well when I tried it a few years back, but it proves that it is possible to reimplemented support for the file format itself.

Doing it as well as the original SQLITE library, well, that's different.

[0] https://sqljet.com


Why do you view a "standard" or "dozens of implementations" as a good thing - what is the context of your evaluation?

Why is it not comparable, if not significantly better, to have a mostly uniform public domain code base that has been deployed to billions of devices?


The diversity of mutually compatible implementations avoids a dominant implementation's critical issue being catastrophic and keeps the standard moving to the direction no single party can totally influence. SQLite does have a list of semi-standardized requirements [1] which should technically allow for alternative implementations---not having them today makes it less viable as a true standard.

[1] https://www.sqlite.org/requirements.html


A better starting point might be the database file format description.

https://www.sqlite.org/fileformat.html


lifthrasiir also sort of touched on this (and I endorse their comment), but as a thought experiment, imagine what word processing formats would look like if MSFT hadn't owned .doc, but instead, WordPerfect and maybe two other competitors also shared the format.

When a popular format can't be unilaterally controlled by one actor, both the users and the market as a whole tend to be better off.


> But still, it is less archival quality, and more inherently complex (even though that complexity is very well encapsulated).

More complex yes, but why is it less suitable for archival? It's very well documented, and the code to read it is fully open source and will run on a practically all platforms in existence.


If for some reason sqlite is a problem, why not export to a text format or do a dump of the data? I even recall seeing sqlite to xml conversion. You can do your archiving in xml and do your work with sqlite.


Except that SQLite is a library that can be embedded into any program.

Yes, there's also a binary command line interface to interface with a db file, but that's not what SQLite mainly is.


SQLite has a public domain license, which is problematic in some jurisdictions. If it used Apache or something this would be less of a concern.


Under what circumstances is the public domain license actually a problem?

It's also worth noting that you also have the option of purchasing a license for $6000. I'd imagine it's mostly companies that are likely to have concerns over licensing issues, and for most companies that price should be completely affordable. Especially when you consider the project's superb quality level and extensive test suite.


The $6000 "license" is actually a "Warranty of Title". It is a legal document that certifies that we developers have the right to release SQLite into the public domain (in other words, we didn't copy any part of the software from other sources) and a promise to defend the public-domain status of SQLite against legal challenges.


In my app (a search tool for Polish corpora [1]), I'm using ZIP with zero compression (stored files) as a container file format. The upside is that I can easily read the archive contents, and then mmap parts of the file as needed [2].

[1]: https://github.com/nathell/smyrna

[2]: https://github.com/nathell/smyrna/blob/master/src/clj/smyrna...


Why ZIP versus something like TAR?


You can seek to files within a zip file, tar doesn't have an index.


Being the (T)ape (AR)chive format, it was designed for tape backup, which doesn't really have random access. IIRC you would basically cat the output of tar to the tape.


Doesn't ZIP with no compression amount to the same thing as TAR? What are you claiming the difference is?


Despite the fact that they can be used for similar purposes, the underlying formats are radically different.

TAR is a 'Tape Archive'... the metadata for each file is adjacent to the file contents itself. (Which is what you'd expect for a format designed for a linear access media like tape.)

For ZIP, there's a single unified directory structure that can be used as an index into the file.


That got me thinking, are there any efforts to replace tar? A single unified directory structure does seem like a desirable feature, I'm sure it'd be possible to find other improvements to make in tar. I wonder if anybody is working on a neotar?


The difference between a 'neotar' effort and a 'neovim' effort is that for 'neotar', you'd have to be using it also for me to send you a neotar file. It may seem like a small thing, but this is a major impediment to the adoption of new formats. (Corollary is that neotar would have to offer major improvements to get people to buy in.)

> A single unified directory structure does seem like a desirable feature,

It can make it more difficult to incrementally stream out an archive, so it's not a complete win.


It looks like several have tried, including the Library of Congress: https://en.m.wikipedia.org/wiki/Comparison_of_archive_format...


Would that be different than ZIP then?


But then you can't replace "tar.bz2" or "tar.gz" with "<something else>.gz" since ZIP does both container and compression.


Sure you can. Nothing prevents you from zipping something with compression turned off and then running bz2 on the result.


This seems like an excellent working format for documents for all of the reasons listed. One could ask why aren't all structured documents built in just this way. In particular I believe that CAD files built like this would be much more robust than they currently are in the likes of TurboCAD.


Many internal data stores for OSX are sqlite files, as are the guts of Chrome. As or a portable interchange format ... history. But yes, using an embedded relational store would be vastly preferable to what we have now.


I don't think SQLite would be good for 3d data exchange between companies/programs, to name one example.

GLTF is the newest format for this, and its much more suited for the task.


Thanks for the pointer, although it looks focuses on '3D' as in models and assets for games and renderers, not '3D' as in the dimensions and mechanical properties of concrete walls. Still I downloaded the spec for a closer read from the Github repository.

The two important clients for 3D cad data are analysis software (CFD, Structural, Etc) and CAM software for manufacturing. I didn't get a sense that GLTF was looking at that space at all when it was being developed.


This is actually a very timely article. I think I might start using sqlite as a datastore for one of my personal projects.


FWIW ZIP files also allow for partial saves as each entry is stored in a separate stream so all you need to do is update the directory, which is placed at the end of the file (in theory you could update archives in an append-only manner when saving and have a pack command to rebuild the ZIP file in order to remove the unnecessary data). It is just that (AFAIK) no program that uses ZIP as a generic container bothers with that (and besides, implementing the necessary logic for the partial saves might actually be the harder bit, regardless of the storage format).


Handling a power outage during a save typically means writing to a new file and then handling it later. With a big zip I just copy the entire file (hoping that a file copy is reasonably fast) and then do in-place updates in the copy. If all the in-place updates in the copied zip work, then replace the original with the copy.

I haven't found a reasonable way to do this that doesn't require making a copy of the whole original document. If sqlite did this elegantly, I'd switch in a heartbeat.


> in theory you could update archives in an append-only manner when saving and have a pack command to rebuild the ZIP file in order to remove the unnecessary data

In fact I'd expect a crash-safe "update" to do that, the in-line update only works if the revision of a file is no bigger than the old one.

If you use appending methods to update the file, you write the new version of the existing file, you write a new central directory, and you remove the old directory entry. Only at that point should the update become visible.


The first line of the file is just:

<?xml version="1.0" encoding="UTF-8"?>

The second line of the file contains 211792 characters of impenetrable XML.

So, does no XML formatting tool exist to consume XML and pretty-print the output? I can tell you, as someone that consumes gigabytes of compressed JSON each day, having 500,000 plus characters of JSON on a single lines causes me no problems. The equivalent in XML should also cause any developer or tool wishing to consume it no problem, as this is a program interchange format, not a format meant to be read by a person (or why compress it anyway?). If someone is really espousing readability in document formats, maybe we should just replace opendocument with some variation of markdown?

Also, if your complaint is that XML is too obtuse because it's not formatted well, I'mnot sure how replacing it with a schema implemented in a RDBMS format is much better, especially when there are relatively few implementations of code that can read that format. Sure, everyone has bindings to SQLITE, but those are bindings to the same C library. Almost every non-toy programming language has had the DEFLATE algorithm implemented in it, and implementing a naive XML parser isn't hard (and I'm not sure anything beyond a naive XML parser is required for the OpenDocument format).


The tool you're describing is `xmllint --format`.


Not quite. `xmlint --format` can introduce unwanted spaces. Consider this example.

<p><i>hello</i><b>!</b></p>

`xmllint --format` adds a space between the 'o' and the '!'.


Is there a way to reformat XML without introducing space?


I think so. Technically you can put linebreaks and whitespace within XML tags.


AFAIK this is only a problem with html


Notepad++ also has a lot of plugins.


This, I use the XML and JSON pretty print plugins daily.


> this is a program interchange format, not a format meant to be read by a person

So why is it xml? I mean this defeats the whole point. Not to mention how it adds a lot of useless noise.

If it's a program interchange format, it could at least be in binary.


There exists binary XML.

What OpenDocument uses is a plain text encoding of XML Infoset[1] (ordered tree of attributes and nodes with optional namespaces and external schema) data structure, but it can be encoded in many ways. You could encode XML infoset as JSON, which for example Google used to sort of do for their old “GData” APIs.

(Actually that’s not true, because there’s no such thing as “plain text” so what OpenDocument uses really is a Unicode serialization of said data structure encoded as a (usually) UTF-8 binary stream with an ASCII variable length header directive specifying said encoding.)

So why use XML Infoset? Unfortunately for certain qualities you might want in an open document interchange format there’s just nothing better.

At any point in an XML Infoset document you can add additional namespaced attributes, nodes and even embed entire documents. For example an ODG document will include SVG elements.

Any XML consumer should be able to parse and display a document produced by a completely independent program, and simply skip the parts it doesn’t support.

If you were to design a clean slate data structure with similar qualities you would probably get something very similar to XML Infoset.

Of course there is a lot of low hanging fruit and XML is by no means perfect.

[1] https://en.wikipedia.org/wiki/XML_Information_Set


I find this hard to buy.

Let's start with just one downside (which imo is the killer misfeature): startup time.

According to the OP article you have to read the entire xml before you can do anything with it.

Surely even if you weren't designing SQLite, it should be possible to design a binary format such that you can read relevant information quickly without loading and parsing the entire binary file.


That’s not an inherent limitation of XML, after all XHTML has no such limitations and browsers support incremental rendering in XML mode (at least Firefox).


Introducing whitespace in most xml document formats changes the meaning.


No it doesn’t.


You're both sort of correct. 1 and 2 below are equivalent (afaik), but 1 and 3 most certainly are not.

  1: <x>an </x>
  2: <x> an</x>
  3: <x>a n</x>


It depends on your space settings in the DTD/schema/element. You can be explicit but very few people realise that.


Indeed. To indent properly you need to know the schema. The schema tells you to which element you may add or remove white-space.


That’s why I said most and not all. I did XML long enough to know both schemas and that their support is bad.


Please don’t.

When people are sending or sharing a document, they don’t expect to leak edit history of that document. And SQLite inherently leaks that edit history, in database freed pages and unallocated space. You can VACUUM the database after each save, but that’ll negate the performance benefits. Also VACUUM was designed to save space, not sure it reliably cleans up unwanted data.

OTOH, these zipped XMLs are free from unwanted data, they (hopefully) only contain the data user wants to save, i.e. the content of the last version of the document.


The "PRAGMA secure_delete=ON;" command will fix that for you. Or you can make secure-delete the default at compile-time using the -DSQLITE_SECURE_DELETE option. There is a (small) performance hit, since extra writes have to occur to overwrite the obsolete data with zeros, but it does work.

See https://www.sqlite.org/pragma.html#pragma_secure_delete for further information.


Not enough. See my next comment: https://news.ycombinator.com/item?id=15607650


I agree. People are vastly underestimating the information already exposed by people not understanding what they are sharing when they hand a modern document format out.

New features need to examined for unintended consequences, and the fact that this feature might entail data leakage through fragmented included data should be a red flag.


People should publish PDF version of a document. But if they send the working copy around, then the history may be helpful for other collaborators. Think of git. They can even merge the changes easily.


> People should publish PDF version of a document.

Oh please no! PDF is quite user-unfriendly (e.g. can't reflow, typically can't cut/paste properly) designed to emulate an ancient medium that has benefits and drawbacks different from screens.

And PDF, while executable, is essentially compiler output, with all the destruction of semantics that implies.


I only have issues with ebooks, outside them it is just perfect.

Cut/paste only fails on me, when whoever created the PDF had the brilliant idea of using images instead of actual text, e.g. scans. Something that using another format won't help.

Maybe you should use another PDF reader?


It depends on what your are trying to do.

Short of exporting to an image format, PDF is the best format for preserving and sharing a final rendered output. All of the things you list as drawbacks are virtues if your purpose is static presentation.


For a resume, PDF is ace. I don't want the reader to see version history or copy/paste (even though its peanuts with a half decent PDF reader). I want them to print it.


Unfortunately, companies don't do that. They send the doc/docx. They also send the xls/xlsx instead of the csv. Protecting users from accidentally doing a very wrong thing is better than being slightly more helpful as a default. Security should be the default. IIRC there have been cases where confidential information has been found by looking through the history in government released documents.


You know you can pop calc with a csv file opened in Excel?

from http://georgemauer.net/2017/10/07/csv-injection.html :

  UserId,BillToDate,ProjectName,Description,DurationMinutes
  1,2017-07-25,Test Project,Flipped the jibbet,60
  2,2017-07-25,Important Client,"Bop, dop, and giglip", 240
  2,2017-07-25,Important Client,"=2+5+cmd|' /C calc'!A0", 240
Yeah - that "=2+5+cmd|' \c calc'!A0" gets Windows to execute code...


docx/xlsx are better, they are zipped XMLs too, just like odf.

Older ones, doc & xls are closer to sqlite, for the same reason, performance. I remember finding confidential information in these files received from third parties they probably didn’t want to disclose, e.g. offer price that was before the last edit.


Agreed. History should be considered a feature.


And it is. That's fine, as long as the user actively turns that feature on, and by default no data is leaked.


P.S. Even fragmentation can leak data. E.g. search for the most fragmented portion of the document in the DB, and you'll likely find out the most heavily edited paragraph of the document.


I believe that you raise excellent concerns. I also believe the solutions are perhaps just a bit of thoughtful UI design away.

The options could be something like:

    ----Performance & Privacy-----
    
    Document Save Mode
    
    [x] Balanced. Document history is not shared. (Default)
    [ ] Fastest. Not recommended for sensitive data 
    [ ] Maximum privacy. Slowest, but most secure

    Click for more information on what these choices mean
The first option would zero out previous edits. This would be the default. This would still be faster than re-zipping the entire archive -- you'd only be zero'ing out a single file. Metadata could still be inferred via fragmentation as you say, but I do not believe this is a concern for most use cases.

The second option would optimize purely for speed.

The third option would optimize purely for privacy. Perhaps the entire document database would be rewritten from scratch on every save. This probably wouldn't be too much slower than the current XML dance.

I understand the value of this kind of "paranoid" mode but I think it's overkill for most situations. Personally, the situations where I'd want this kind of behavior are extremely rare. Clearly some people would need/want this and that's fine; it would be only a click away. And then all future documents could inherit this choice until the app is told otherwise.


This UI solves a problem for the computer, not the user. Whenever you find yourself making a UI like this, you should go back and change your program so the computer doesn't have that problem any more. In this case, there's surely a way to write files quickly enough while still maintaining user privacy -- just do that instead of bothering me with a wall of text every time I save a file!


     just do that instead of bothering me with a wall of text every time I save a file!
I certainly wouldn't want to be hit with this wall of text every time I save a file. I was thinking that this sort of option ought to be a global setting that could perhaps be altered on a per-document basis if (and only if) the user cares to muck about with it.

It could, perhaps, be a choice that is presented to the user upon new document creation.

     This UI solves a problem for the computer, not the user. [...]
     In this case, there's surely a way to write files quickly 
     enough while still maintaining user privacy
Life is full of these choices. There's a gas pedal in my car. I can mash it (to go fast but burn a lot of fuel) or be gentle with it (and save gas, but I won't go as fast). Ideally the car should be crazy fast and crazy safe and crazy fuel efficient no matter how you drive the sucker, but that's a tough engineering challenge and ultimately we (today) leave that choice to the drive.

That said, you are 100% correct. Maybe the privacy/paranoia steps are something that could be done on a background thread. The file could be saved quickly, and then the database could be privacy-optimized in the background immediately thereafter.


That's a terrible UI. First of all, it's an issue that users wouldn't even know exists, so they'll probably never look for the options anyway. And the proposed default leaks data, a terrible choice! Secondly, why are there three options? Two of them leak data, one doesn't. Who in their right mind would pick the "leak some data" option?

Leaking sensitive data that users would expect to be deleted is a bug. People don't want their software to be full of configuration options that amount to "fix this security hole, but run really slowly" or "leave this security hole open, and run faster".


    And the proposed default leaks data
The suggested default ("don't share edit history, but don't go 'full paranoid' and vacuum/defragment the database on every save") wouldn't leak data. It would, at times, leak information from which some metadata could sometimes be gleaned. Most edits would presumably not even cause fragmentation unless they spanned more than a single data page in Sqlite's on-disk storage format.

If you counter that "metadata is data; therefore it does leak data" I suppose you'd be correct in a very strict sense. But, regardless of semantics, I think "sometimes hinting at which parts of the file have been edited" is quite a bit different from "leaking a document's edit history, verbatim" -- in much the sense that a common cold is different from HIV despite the fact that they're both viruses. (Though, I suppose they can both kill you if the conditions are right)

I can certainly think of situations where a user wouldn't want to leak such vague metadata, but that seems to me like an edge case and we as developers should strive for sensible defaults. I can't think of too many instances when I would have cared if somebody could perhaps infer which part of a document had been edited most often.

That said, yeah, offering choices like this is clunky. Maybe the privacy stuff could just happen automatically in the background following a save, or at export time, etc.


> wouldn't leak data

They will leak data, e.g. deleted information. Search “SQLite forensics” for technical details.

> Most edits would presumably not even cause fragmentation

You replace “a” with “the” in the middle of a large table, and generally the table becomes fragmented. The way it’s fragmented indicates not only where it was edited, but also what kind of edits were made, insertions and deletions fragment database differently.


https://sqlite.org/whentouse.html

SQLite does not compete with client/server databases. SQLite competes with fopen().

I personally do not agree to this argument, because some operations are intrinsically better suited to text-based formats---for example with text-based formats one can at least try to diff and merge multiple files (even though the actual suitability depends more on the data structure and less on the underlying container format), but I can see where the argument is coming from.



> Third Improvement: Versioning

Regardless of SQLite, why don't modern productivity tools include at least an option to automatically version documents, ie store documents as a series of diffs that one could browse up to the start?

Of course in many cases, edit history is better kept hidden, so this should not be the default. But why doesn't it exist at all?


Difficult, but possible. We did this with Buzzword at Adobe. The hardest part was the UX, honestly. Most knowledge workers need some hand holding, to say nothing of teachers and students (we had success selling to high school teachers).


The typical excuse is file space. The next excuse is privacy/security concerns. The next excuse is usually that other tools do it better (SharePoint, svn, git) for the types of files that really matter.

MS Word has a Track Changes feature that kind of does this, and it's a pain because you have to remember to remove the revision history before you publish the document because it can expose details about the users that edited the document.


I think Apple introduced this in 10.7: https://support.apple.com/en-us/HT202255

IMO this is better handled by the OS than by each individual app.


This article about storing files in Sqlite is spot on and it works. I did exactly this and i have tested it with GBs of data and 1000s of files and it works well. The table structure I use is:

CREATE TABLE IF NOT EXISTS files (id TEXT, name TEXT, contents BLOB);

: as you can see the source code for the Javascript NodeJs web application file here:

https://github.com/zubairq/gosharedata/blob/master/src/index...


I'm maintaining an application that uses a similar (OPC) file format. It has served us well and it beats the legacy way of having just a directory of files (e.g. when you want to send a file to someone else it's ONE file) but it doesn't scale very well when adding lots of data, and incremental update is painful compared to just using a directory.

How does sqlite handle incremental update? Does it use separate file storage for large table blobs? If the blobs are in line, how does it handle changing one blob from 1Mb to 2Mb say? Presumably it just appends the new blob, updates the pointer, and marks the original 1Mb area as unused and possible to clean up? That leaves a 3Mb file, which can be vacuumed to 2Mb then?


> The use of a ZIP archive to encapsulate XML files plus resources is an elegant approach to an application file format. It is clearly superior to a custom binary file format.

Why would it be superior? The biggest advantage I can think of is code reuse, but what else?


Interoperability? If it's in a zip and in a human readable format even without the necessary application you can still dig into data. But it also helps when writing a new application as it only has to understand zip and xml, something virtually everything can handle.

As much as I hate xml this seems like a decent use for it.


Because the author ends up duplicating features of the zip format. Think of all the stuff that goes along with a ppt slide. Fonts, images, videos, sounds. You want to embed the original files anyway, because storing the content in your own custom format is pointless extra work. think how complicated a video file format is. It's a good fit for a file archive.


Look, I love SQLite as much as the next person, but as a database, it's got a lot of problems. From the lack of a formal time-date storage class to the "dynamic typing", if we are going to go binary and complex, is SQLite the best place to plan for future stability and ease of use?

The first question is "Q1: Does a RDBMS help here?" and then 2nd "Q2: Given Q1=Y, Which might we use, what features are reqd and how are they implemented?". Parent jumped to SQLite as the answer (given the source) but I am unconvinced on Q2, or even Q1.


I don't see why the zip file can't be read incrementally - that sounds like an issue in their Zip Loader. - Zip stores each file it contains separately.


I find the idea awesome. Once the spec for such a document format is created, there would be a tool or two implementing it.


Interesting but where should the database normalization should end ? Should every text element with a single style be a separate entry with an associated style using a foreign key ? Should all the xml disappear and be coded around databases concepts ?


If it used SQLite it’d be like serialised COM again (xls,doc). Kill me.

Zipped XML actually allows us to parse and manipulate it easily with a strongly defined schema. SQLite can barely enforce any consistency.

As for XML it’s actually pretty marvellous. Most people don’t understand it which is where the problem is.


I'm sorry, where is the article did you read the suggestion that the format should stop using XML? I can't find any such suggestion. Looking at the other comments here it looks like the author must have recommended ditching XML completely, but the actual article doesn't seem to bear that up. Has it been radically altered since being posted to HN? The disconnect between the pro-XML arguments here and the actual article's assumption of continuing to use XML is really quite jarring.

The nearest I can find is where it suggests breaking up the XML into smaller XML documents, but that's a whole heck of a long way away from being the same thing.


Im not even talking about the data model. The manifest would be replaced, all resources wouldn’t have usable URIs, you couldn’t deserialise a stream using sax easily, dtd and schema references that ship with it would be invalidated.

Those are all XML too. It’s not just markup.

It turns a bundle of files with strict relations into an unworkable stateful blob of loose links. Welcome to the world of corruption at the application layer.


I would hardly call a system such as the one proposed corruption.

Ok, so maybe this proposal might make the format a little harder to access programatically, but it's a pretty minor effect. You just need to extract the data from a SQLite DB instead of a zip file. You could even store a manifest in the DB as well.

I think the thing I object to about all the criticism here is that it's all about making things easy and simple and pure from a development point of view. The article proposes concrete end user benefits from a scheme like this, which to me seem pretty worthwhile. Surely the focus should be on the user experience first and then look at how to implement that?


XML is pure on its own. Data, validation, transformation, schema, even programming.

From a development perspective you’re adding more steps.


I thought there was an application that did something like this, but can't remember it anymore. It's nice to see a writeup that steps you incrementally through the benefits of this approach.


Acorn (awesome image editor for Mac) uses SQLite as .acorn image format.

Schema:

    sqlite> .schema
    CREATE TABLE image_attributes ( name text, value blob);
    CREATE TABLE layers (id text, parent_id text, sequence 
    integer, uti text, name text, data blob);
    CREATE TABLE layer_attributes ( id text, name text, value blob);

My app, Mémoires (diary for Mac), and some other Mac apps, use SQLite via Apple's Core Data as document format.


Was it Fossil SCM by chance? Because it happens to have the same author. :) http://fossil-scm.org/


Penetration testers will have a field day with this if it happens.


Pen-testers have been hammering hard on SQLite for years. Attacks against SQLite are not as easy as you imagine.

Pen-testers do still occasionally find minor problems. See https://www.sqlite.org/src/info/04925dee41a21ffc for the latest example. But generally speaking, it is safe to open an SQLite database received from an untrusted source. If you are extra paranoid, activate the "PRAGMA cell_size_check=ON" feature and/or run "PRAGMA integrity_check" to verify the database before use.


Why? SQLite has been put in safety critical systems and has the best test plan I've ever seen for an open source project.


It has the best test plan I've seen for any pure software product, and I'm including both medical and safety critical stuff.


The typo in the very first sentence (where/were) puts me off :(


Yes please. Anything is better than XML


... and then sglite changes the format of the files on the disk and you are entirely boned...


If you read through their docs [0] you'd see they're committed to maintaining backwards compatibility with all SQLite 3 releases. There haven't been breaking changes since 2004.

In addition, the library is incredibly easy to embed with any project as it's largely a self-contained system [1]. If SQLite 4 were released with breaking changes, you could happily continue using SQLite 3.

[0] https://sqlite.org/onefile.html

[1] https://sqlite.org/selfcontained.html


This is a document format we are talking about here that should be good for the ages, not just 13 years. In practice you would have to define your document format in terms of a particular version of SQLite. So you would have to unambiguously document that version in such detail that you cover everything that anyone would ever want and be able to do. Once that version of SQLite becomes unmaintained then things are broken.

Someone would have to maintain that version of what would now be the ODF SQL disk format forever. Much of what would have to be maintained would be pointless complexity not required for the representation of documents.


You seem to be operating under the assumption that the SQLite database file format is an ad-hoc, dodgy affair that the maintainers might change on a whim. I'd like to try to dispel that notion.

The SQLite database file format has been unchanged (except by adding extensions through well-defined mechanisms) since 2004. And we developers have pledged to continue that 100% compatibility through at least the year 2050. The file format is cross-platform and byte-order independent. We have an extensive test suite to help us guarantee that each new version of SQLite is completely file compatible with all those that came before. The file format is documented (https://sqlite.org/fileformat.html) to sufficient detail that the spec alone can and has been used to create a compatible reader/writer. (I know this because the authors of the compatible reader/writer software pointed out deficiencies in the spec which I subsequently corrected.) The file format is relatively simple. A reader for SQLite could be written from the spec with no more effort than it would take to implement a ZIP archive reader with its decompressor from spec.

There is approximately 1 trillion (10^12) SQLite database files in active use today. Even without the promises from developers, the huge volume of legacy database files suggests that the SQLite database file format will be around for a long time yet.

The latest version of SQLite will happily read and write any SQLite3 database written since the file format was designed in 2004. Ancient legacy versions of the SQLite library will read and write databases created yesterday, as long as those databases do not use any of the extensions that were added later.


As long as SQLite4 includes built in SQLite3 migration/loading, breaking changes are not as bad.


The article notes "that this is only a thought experiment." And I think it should stay that way. It's very interesting to think about, yet SQLite offers nothing but problems for developers and content creators. While maybe a bit clunky, a zip file manages to be a close best choice. When considering that a document may be hundreds of megabytes with thousands of assets OR a single page of text, zip offers enough simplicity and robustness, while being highly accessible.


And SQLite doesn't... why again?




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: