Zip files use a well documented, easily-reproducable and widely implemented deflate algorithm. It's reasonable to assume that someone could, 100 years from now, unzip an OpenDocument file. The contents are XML files - basically text - which can be read with any simple text editor and parsed. The XML is also at least somewhat self-documenting regarding the contained document.
SQLite databases have a binary format which is known to change periodically. An attempt to read a SQLite file in 100 years would have to know the exact version of the file, and be able to reach back through history to find that version of SQLite (or a compatible version), and find a way to compile it for whatever CPU architecture exists in 100 years.
This could be mitigated by including the appropriately versioned documentation of the binary format with the database itself - however to the best of my knowledge, full documentation of the binary format (outside the code) does not exist. The end product - a relational database - would also be less self-documenting than the XML version.
IMO, it would be much better to keep the inefficient zipped XML files.
[EDIT]: The binary file format: https://www.sqlite.org/fileformat.html
Learn something new every day.
I contend this is not as big of a problem as you make it sound. First of all, the sqlite file format states its version in a string at the beginning of the file, so determining the version of a file is trivial even without any special tools. The sqlite source code is extremely widely distributed. Between linux distros and the many projects which, for better or worse, embed sqlite in their source trees, there are easily over a million copies of sqlite3.c floating around. The chances of all of these copies being lost, even in a catastrophic event, is negligible. Even if we did somehow lose every copy of sqlite3.c, reverse engineering a binary format is far from impossible. In fact the old binary MS Office file formats were reverse engineered by multiple groups.
The chances of us losing the ability to compile C code are even more negligible. If human civilization has fallen that far, we will probably be too busy killing each other with sharp sticks over drinkable water to care about reading some hundred year old documents.
I give you "the relevant XKCD":
Somewhat more practically, a relevant DLib Magazine  quote: "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."
And from the Library of Congress , in the context of format preferences for text with structural markup: "XML or SGML using standard or well-known DTD or schema appropriate to a particular textual genre."
Wouldn't this issue apply more to OpenDocument, which is compressed (into a ZIP archive), than SQLite, which (at least by default) is not?
But then, I question the appropriateness of the advice. If you're serious about archiving, you should be using error-correcting codes in some form, so that the archived data will remain recoverable bit-for-bit even with a large number of bit errors in the underlying medium. To be honest, I'm not that familiar with long-term archiving practices, but if you have some kind of RAID setup, that should give you both error correction (for bit errors) and drive redundancy (for loss of entire drives). Alternatively, you could use dedicated ECC tools like par2.
True, most data that gets preserved will probably be preserved by chance, by people who are not serious about archiving, and may not take sufficient steps to prevent errors. But they're also not going to choose a format for optimal archiving either, so you're kind of stuck with the fact that many modern file formats have built-in compression and/or checksums, and thus don't hold up well when corrupted. We could keep the issue in mind when designing new formats, but is resilience to corruption really worth the additional storage cost of leaving data uncompressed? Or perhaps we could design formats to have built-in ECC instead of just checksums, but that would also waste space...
This means that it has long term support for the life of the airframe which for passenger airliners can easily by 50+ years (see https://en.wikipedia.org/wiki/Boeing_737 which is still flying after 50 years).
An interesting read on the topic and challenges of digital preservation: http://www.dl.slis.tsukuba.ac.jp/DLjournal/No_20/1-hedstrom/...
>at which point you have to reverse-engineer how the tables were built in order to re-create the document.
I don't understand what you mean by this. All you have to do is examine the contents of the various tables. It's no different to examining the contents of an XML file. In either case, how easy it is to reconstruct the document will depend on how well designed the format is.
LOL said the thousands of people with old dusty Zip disks sitting in boxes in their attics. That was only 25 or so years ago.
I appreciate that this is because the authors are being careful about possible corruption when locking isn't working properly, but if they really mean the « SQLite competes with fopen() » line they need to find a way to get round this, because fopen() doesn't do that.
is a great overview of a number of desktop and media file formats (particularly in the context of how to find holes in them in which you can stick metadata.) Read between the lines and you can understand some of the tradeoffs; the database-based format shows up once or twice and, from the viewpoint of XMP, it is the one in which it is hardest for a foreign program to find an embedded media packet. On the other hand, SQLite is open source so you can say "just open it with SQLite".
Microsoft hasn't supported Jet in forever (a decade or more), and doesn't recommend it. The number of corrupted MDB files (and stranger, corrupted LDB lock files) I've had to work with in the last couple of years (in trying to migrate Jet-using VB6 apps to something 2017 has to offer) has shown me wisdom that Jet/MDB is a database format that should be properly thanked for its service over the years, then gracefully old yellered.
So a word-processor-like application might try implementing its own simple global lock (eg with a lock file) and then tell SQLite to use nolock=1.
I'm not sure I'd feel very comfortable doing that but it seems OK in principle.
I wish Subversion would offer an option to do something of the sort for its working copies, now it has an SQLite database in there.
But that tends not to matter in practice for applications like word processors (which often want to implement their own system to stop two people opening the same in editable mode at the same time, not just stop them saving at the same time).
Or the fopen-using application can just do things the simple way and write out a new temporary file to replace the old one with.
« Hence, "File/Save" takes longer than it ought, especially on older hardware. Newer machines are faster, but it is still bothersome that changing a single character in a 50 megabyte presentation causes one to burn through 50 megabytes of the finite write life on the SSD. »
As it happens, I wasn't making an argument. I was answering nikbackm's question.
OpenDocument and its competitor (OOXML) both use zip/XML for portability reasons. Yes, they're less efficient. But they're also completely independent of any specific software implementation .
 Mostly - OOXML has some parts that depend on MS Office's implementation. But ODF is completely described by the spec.
doc and xls were proprietary binary formats.
sqlite files couldn't be more open - the software to interact with them is in the public domain and ported everywhere.
And the article itself doesn't even suggest using sqlite as a binary store. Rather, it suggests using the capabilities of sqlite to augment text-based data.
This article champions the idea that you should store the files as compressed binary BLOBs in SQLite. What does that mean you need to do? Connect to the file with SQLite, fetch the BLOB, hexdump it since it will return as a string, decompress it, and then save it to disk. Or save it to disk and then decompress it. You'd have to do something like this: https://stackoverflow.com/questions/15448373/how-to-dump-a-f...
That is what Open Document is supposed to end.
This is not an excellent article. It's complaining that the Open Document Formats are bad because they don't use an optimized format. They're not supposed to be optimized. Documents aren't large enough to cause the types of problems that databases are intended to fix. They're supposed to be portable and easy to work with. They make it possible to access your data using the most basic computing tools: a ZIP archive tool and a text editor. That's the whole point.
Or, you could do it several dozen other ways. For example, maybe you have an archiver tool for SQLite that works like "zip" - https://www.sqlite.org/sqlar/doc/trunk/README.md
About this much we agree: The objective is to make the content of the document more accessible and easier to use. The article intended to show that using an SQLite database rather than a pile-of-files in a ZIP archive achieves that goal. If my presentation were stored in an SQLite database, I might (depending on how the schema was structured of course) be able to ask interesting questions of the presentation, such as:
* How many pages are in this presentation?
* Which pages contain the word "discombobulated"?
* How many images are in the presentation?
* Does this presentation use a prepackaged background?
* Does the presentation begin with my companies standard "start" slide?
I have 104 historical presentations in a folder and I want to find some slide I wrote years ago. Right now, I have to laboriously open each presentation and manually search for the slide I want. If the format were SQLite, I could perhaps do the same with a query, which if run from a shell script, would quickly search all 104 presentations for me.
So, Yes, the whole point is to make the content more easily accessible and usable. Relative to a pile-of-files in a ZIP archive, SQLite does exactly that.
Using the libraries or packages available from the OpenDocument home page? (http://opendocumentformat.org/developers/) Probably, yes. Four of your five questions questions are just: "Is there an API?" The answer is: Yes, yes there is. Did you try searching for an Open Document API? For the image file one, however, all you really need is ZipFile.infolist() because ZIP files already have an API.
> All I'm trying to say is that if OpenOffice .odp files were SQLite databases with a sensible schema, then probably you could answer any of the questions above with perhaps 5 lines line of code or less.
Most of the questions involve an XQuery against content.xml. You could iterate through the files, extract content.xml, and apply the XQuery. No, it's not easy to write the correct XQuery because XQuery sucks and the schema is complex, but it can definitely be done.
I don't know why you think it's going to be magically simplified by changing the storage method. You'd either have compressed XML BLOBs like the article's first suggestion -- which puts you right back where you already are -- or you'll have dozens of tables with a ton of metadata and document structure to dig through. Either way you will need a schema reference to understand what's going on. An RDBMS isn't going to make the content simpler. The schemas are complex because the documents are potentially very complex.
Because ZIP is a key/value store and SQLite is relational.
(Later:) Here is a slide comparing SQLite to ZIP from a talk I'm scheduled to give on Monday: https://sqlite.org/tmp/sqlite-v-zip.jpg
Both SQLite and ZIP will store files and both are well-established open formats with a trillion instances in the wild.
SQLite does everything ZIP does but also gives you transaction, a rich query language, a schema, and the ability to store small (1-8 byte) objects and to translate objects into the appropriate byte-order for the reader.
That's the opposite of what the article states. From the introduction of the article:
> Note that this is only a thought experiment. We are not suggesting that OpenDocument be changed. Nor is this article a criticism of the current OpenDocument design.
> The point of this essay is to suggest ways to improve future file format designs.
That makes it sound like, "No offense, but < extremely rude comment >." Just because they say they're not trying to be critical doesn't mean they're not actually being critical.
Or you could just use a SQLite GUI (eg sqlitebrowser.org). Select the field/BLOB you're interested in, then Export (saves it to a file). Done. :)
File Format Changes in SQLite:
"Since 2004, there have been enhancements to SQLite such that newer database files are unreadable by older versions of the SQLite library. But the most recent versions of the SQLite library should be able to read and write any older SQLite database file without any problems."
Me, I'm just happy all my old doccies are in TeX.
Yes, actually they could. They could be readable without any software at all. Like XML.
All of the improvements suggested in the article could be implemented by extending the XML format to allow including by other XML files by reference.
The problem with formats "readable without any software" is that you usually end up with everything represented as strings, with no knowledge of actual types and constraints. So, you still need that documentation, even though the format is "readable without any software".
If your document editor represents documents as a stack of edit-events (for undo/redo or whatever) and you document every nook and cranny of this "document format", standardize it as "binary event-sourced document format, .esd" then your binary spec will be very close to a full description of your document editor. Everyone who wants to use your format now has to reimplement your editor.
Anyone can leak implementation details into any format, and it is to some extent unavoidable, but if you can make it readable (or semi-readable), some thought has at least been put into the representation.
No, it's not. An sqlite-based document can have as many representations as an XML-based document -- nothing in its implementation prevents this.
>so "readable without any software" is a good test for wether it's "readable with any sfotware"
The OS, filesystem, shell, editor etc needed to read the XML file are still software.
As is the XML parser needed to do anything useful with it in the data realm.
>If your document editor represents documents as a stack of edit-events (for undo/redo or whatever) and you document every nook and cranny of this "document format", standardize it as "binary event-sourced document format, .esd" then your binary spec will be very close to a full description of your document editor. Everyone who wants to use your format now has to reimplement your editor.
You do understand that the XML file can also just be including a stack of edit-events, right?
This is totally orthogonal to the underlying storage (XML or sqlite or whatever).
What an extraordinarily stupid notion. Your filesystem is not readable without a filesystem driver for the specific binary format used to represent it, so you cannot even read your text file without having something in your stack that understands a more structured data organization scheme.
This argument doesn't hold water, since file content doesn't depend on which filesystem it's on.
An archiver may choose to store data in some very particular way, like some specific tape driver. They're free to do that without corrupting the content of any files, since the files don't depend on the filesystem.
On the other hand, if an archiver chose to store, say, unzipped versions of their opendocument files, then they've corrupted the data: opendocument files are zips, unzipped data is not opendocument. The format does depend on the file content.
Binary formats start out mostly unreadable by humans . XML and other textual formats at least have the possibility of being made to where they can be read by laypersons.
1] I'll note that after enough immersion, I've seen people read binary core dumps, etc. but that takes much more time and practice than with XML.
Would the average person edit a .svg by hand? No, he'd use Adobe Illustrator or anything else.
Would he edit a .docx file by treating it as a zip archive and edit content.xml? No, he'd use MS Word/LibreOffice Writer...
Suppose Sqlite used XML/JSON instead of binary files, would you modify them with notepad? No, you'd probably use a SQLite browser software.(Or an application that is more tailored to the domain)
Fair point, and in fact, I'd go further - most of the time most people, do not need to directly edit or view most files in most formats. Even if you took 'most' to mean 99% or higher, I'd be comfortable with that statement.
Where I differ is that I think the essence of the argument is really whether or not a binary file format offers enough value to be worth entirely eliminating the direct edit/view possibility for everybody all the time. Even if it's not a common case, it can be game-changingly useful when you need it.
Just to illustrate, you give three examples, and I have counter examples for each:
> Would the average person edit a .svg by hand? No, he'd use Adobe Illustrator or anything else.
I've modified bounding boxes by directly editing SVG, as well as read SVG directly to analyze some plots a library was generating for me.
> Would he edit a .docx file by treating it as a zip archive and edit content.xml? No, he'd use MS Word/LibreOffice Writer...
I've done this recently to extract out embedded documents on OSX (where the native versions of Office do not directly support this.)
> Suppose Sqlite used XML/JSON instead of binary files, would you modify them with notepad?
HSQLDB can use text based SQL scripts to store data, and I've modified and edited them directly for several reasons.
I despise xml quite a bit but I have to admit, it's still potentially better.
What if my text editor has an sqlite file viewer?
They aren't, though. Or can you read ZIP archives without an unpacker?
However, all software is not equal. Any OS comes with a text editor and a ZIP utility by default now. The same is not true for SQLite's format.
• Win 10 apparently does (I don't use it, so can't personally verify). Win7/8 don't.
• Every version of OSX for many years does
• Fairly sure the main Linux distro's do
• FreeBSD does, not sure about NetBSD/DragonFlyBSD though.
Checking a local OpenBSD 6.x box just now, sqlite/sqlite3 aren't in the default user path, so maybe that's one that doesn't.
Normally tar balls are used and zip has to installed afterwards:
apt install zip
I read the article as advocating for more open binary formats. That it is useful to have binary formats is logical. That someone would like to have open ones is I think a nice thought.
The same thing is becoming true of SQLite (and pretty much is true if you exclude older versions of Windows). We're talking about one of the two (estimated) most-widely-deployed software projects in the world today (http://sqlite.org/mostdeployed.html). It's FOSS, ubiquitous, easily embeddable, works remarkably well, and has bindings for pretty much every programming language under the sun. As far as a general-purpose file format goes, a SQLite database is thus pretty close to ideal.
That's not entirely true. Anything that comes with Python installed has SQLite bundled/compiled in. It's just the library, not the cli, but still.
For some values of "readable" and "human".
It fits the requirements of reading the file without (external) software since you can trivially reimplement it, and is also a much lower threshold than "you need to reimplement sqlite", which is why pretty much every system out there comes bundled with a zipfile unpacker.
But you do have a point that in a rare case, you could get your hands on the raw data which makes text easier.
But everything is a tradeoff, and the question really is if the tradeoff in this case is worth it, since sqlite is so open and portable.
The only difference is that xml is also human "readable". But most xml isn't really readable without a tool anyway.
You may not want to look at XML without a tool, but you could. Or, most likely, you could build a tool to look at the XML. This is a crucial design consideration of XML - even if all XML tooling, open or not, is lost forever, you can still read the XML. Not so with SQLite.
Especially for a document format, this point seems important. That said, OpenDocument is zipped, which, while a simpler and even more ubiquitous format, is still a binary format that requires a library to read, so that detracts somewhat from the point. It's difficult to imagine a future where zip libraries survive but SQLite libraries don't.
ASCII encoded text is simpler than UTF-8, but there is content that can't be represented in ASCII. Zip comes a bit up the food chain from there, and SQLite comes quite a bit further.
XML is about being meaningful without tooling at the lowest possible level. Any added required tooling or documentation by definition makes the data less 'open'.
At this point, you've side-tracked into an argument that anyone could make against any software.
SQLite for all it's significant qualities are none of that, and thus you shouldn't use it if designing for file format longevity.
You can't read XML without software.
Perhaps you take editors and cat and less as a given.
In which case, nothing prevents you from taking sqlite the binary (or 1000 serializers that can be produced very easily) as a given too.
XML is only trivially "readable without any software at all". The same is true of any binary format - the issue is that it can be read, but not easily processed - which is also the case for XML (you need to integrate an XML parser for that).
And if you are including XML libs, why not SQlite libs?
More to the point, no one needs to install a full blown database engine to read a plain text document.
You may argue that specialized tools make the job easier, but you also need to acknowledge that requiring specialized tools just to open a text file is silly.
A printout of a binary dump does not provide any clues as to how the data was structured.
> a binary dump does not provide any clues as to how the data was structured
True, but if that binary format is SQlite, the structure is fully documented. You could work it out mentally just as you could work out any given XML schema.
you already need a decoder
Indeed. He probably meant docx and xlsx.
When the docx and xlsx formats were developed, they lost this capability.
To be sure, SQLite is very well supported, runs on tons of platforms and is going to be around for a long time. But still, it is less archival quality, and more inherently complex (even though that complexity is very well encapsulated).
> Finally, let us reiterate that this essay is a thought experiment. The OpenDocument format is well-established and already well-designed. Nobody really believes that OpenDocument should be changed to use SQLite as its container instead of ZIP. Nor is this article a criticism of OpenDocument for not choosing SQLite as its container since OpenDocument predates SQLite. Rather, the point of this article is to use OpenDocument as a concrete example of how SQLite can be used to build better application file formats for future projects.
This is purely a marketing exercise. They're hanging the thought experiment tag on to try and avoid undue criticism.
It didn't work too well when I tried it a few years back, but it proves that it is possible to reimplemented support for the file format itself.
Doing it as well as the original SQLITE library, well, that's different.
Why is it not comparable, if not significantly better, to have a mostly uniform public domain code base that has been deployed to billions of devices?
When a popular format can't be unilaterally controlled by one actor, both the users and the market as a whole tend to be better off.
More complex yes, but why is it less suitable for archival? It's very well documented, and the code to read it is fully open source and will run on a practically all platforms in existence.
Yes, there's also a binary command line interface to interface with a db file, but that's not what SQLite mainly is.
It's also worth noting that you also have the option of purchasing a license for $6000. I'd imagine it's mostly companies that are likely to have concerns over licensing issues, and for most companies that price should be completely affordable. Especially when you consider the project's superb quality level and extensive test suite.
TAR is a 'Tape Archive'... the metadata for each file is adjacent to the file contents itself. (Which is what you'd expect for a format designed for a linear access media like tape.)
For ZIP, there's a single unified directory structure that can be used as an index into the file.
> A single unified directory structure does seem like a desirable feature,
It can make it more difficult to incrementally stream out an archive, so it's not a complete win.
GLTF is the newest format for this, and its much more suited for the task.
The two important clients for 3D cad data are analysis software (CFD, Structural, Etc) and CAM software for manufacturing. I didn't get a sense that GLTF was looking at that space at all when it was being developed.
I haven't found a reasonable way to do this that doesn't require making a copy of the whole original document. If sqlite did this elegantly, I'd switch in a heartbeat.
In fact I'd expect a crash-safe "update" to do that, the in-line update only works if the revision of a file is no bigger than the old one.
If you use appending methods to update the file, you write the new version of the existing file, you write a new central directory, and you remove the old directory entry. Only at that point should the update become visible.
<?xml version="1.0" encoding="UTF-8"?>
The second line of the file contains 211792 characters of impenetrable XML.
So, does no XML formatting tool exist to consume XML and pretty-print the output? I can tell you, as someone that consumes gigabytes of compressed JSON each day, having 500,000 plus characters of JSON on a single lines causes me no problems. The equivalent in XML should also cause any developer or tool wishing to consume it no problem, as this is a program interchange format, not a format meant to be read by a person (or why compress it anyway?). If someone is really espousing readability in document formats, maybe we should just replace opendocument with some variation of markdown?
Also, if your complaint is that XML is too obtuse because it's not formatted well, I'mnot sure how replacing it with a schema implemented in a RDBMS format is much better, especially when there are relatively few implementations of code that can read that format. Sure, everyone has bindings to SQLITE, but those are bindings to the same C library. Almost every non-toy programming language has had the DEFLATE algorithm implemented in it, and implementing a naive XML parser isn't hard (and I'm not sure anything beyond a naive XML parser is required for the OpenDocument format).
`xmllint --format` adds a space between the 'o' and the '!'.
So why is it xml? I mean this defeats the whole point. Not to mention how it adds a lot of useless noise.
If it's a program interchange format, it could at least be in binary.
What OpenDocument uses is a plain text encoding of XML Infoset (ordered tree of attributes and nodes with optional namespaces and external schema) data structure, but it can be encoded in many ways. You could encode XML infoset as JSON, which for example Google used to sort of do for their old “GData” APIs.
(Actually that’s not true, because there’s no such thing as “plain text” so what OpenDocument uses really is a Unicode serialization of said data structure encoded as a (usually) UTF-8 binary stream with an ASCII variable length header directive specifying said encoding.)
So why use XML Infoset? Unfortunately for certain qualities you might want in an open document interchange format there’s just nothing better.
At any point in an XML Infoset document you can add additional namespaced attributes, nodes and even embed entire documents. For example an ODG document will include SVG elements.
Any XML consumer should be able to parse and display a document produced by a completely independent program, and simply skip the parts it doesn’t support.
If you were to design a clean slate data structure with similar qualities you would probably get something very similar to XML Infoset.
Of course there is a lot of low hanging fruit and XML is by no means perfect.
Let's start with just one downside (which imo is the killer misfeature): startup time.
According to the OP article you have to read the entire xml before you can do anything with it.
Surely even if you weren't designing SQLite, it should be possible to design a binary format such that you can read relevant information quickly without loading and parsing the entire binary file.
1: <x>an </x>
2: <x> an</x>
3: <x>a n</x>
When people are sending or sharing a document, they don’t expect to leak edit history of that document. And SQLite inherently leaks that edit history, in database freed pages and unallocated space. You can VACUUM the database after each save, but that’ll negate the performance benefits. Also VACUUM was designed to save space, not sure it reliably cleans up unwanted data.
OTOH, these zipped XMLs are free from unwanted data, they (hopefully) only contain the data user wants to save, i.e. the content of the last version of the document.
See https://www.sqlite.org/pragma.html#pragma_secure_delete for further information.
New features need to examined for unintended consequences, and the fact that this feature might entail data leakage through fragmented included data should be a red flag.
Oh please no! PDF is quite user-unfriendly (e.g. can't reflow, typically can't cut/paste properly) designed to emulate an ancient medium that has benefits and drawbacks different from screens.
And PDF, while executable, is essentially compiler output, with all the destruction of semantics that implies.
Cut/paste only fails on me, when whoever created the PDF had the brilliant idea of using images instead of actual text, e.g. scans. Something that using another format won't help.
Maybe you should use another PDF reader?
Short of exporting to an image format, PDF is the best format for preserving and sharing a final rendered output. All of the things you list as drawbacks are virtues if your purpose is static presentation.
from http://georgemauer.net/2017/10/07/csv-injection.html :
1,2017-07-25,Test Project,Flipped the jibbet,60
2,2017-07-25,Important Client,"Bop, dop, and giglip", 240
2,2017-07-25,Important Client,"=2+5+cmd|' /C calc'!A0", 240
Older ones, doc & xls are closer to sqlite, for the same reason, performance. I remember finding confidential information in these files received from third parties they probably didn’t want to disclose, e.g. offer price that was before the last edit.
The options could be something like:
----Performance & Privacy-----
Document Save Mode
[x] Balanced. Document history is not shared. (Default)
[ ] Fastest. Not recommended for sensitive data
[ ] Maximum privacy. Slowest, but most secure
Click for more information on what these choices mean
The second option would optimize purely for speed.
The third option would optimize purely for privacy. Perhaps the entire document database would be rewritten from scratch on every save. This probably wouldn't be too much slower than the current XML dance.
I understand the value of this kind of "paranoid" mode but I think it's overkill for most situations. Personally, the situations where I'd want this kind of behavior are extremely rare. Clearly some people would need/want this and that's fine; it would be only a click away. And then all future documents could inherit this choice until the app is told otherwise.
just do that instead of bothering me with a wall of text every time I save a file!
It could, perhaps, be a choice that is presented to the user upon new document creation.
This UI solves a problem for the computer, not the user. [...]
In this case, there's surely a way to write files quickly
enough while still maintaining user privacy
That said, you are 100% correct. Maybe the privacy/paranoia steps are something that could be done on a background thread. The file could be saved quickly, and then the database could be privacy-optimized in the background immediately thereafter.
Leaking sensitive data that users would expect to be deleted is a bug. People don't want their software to be full of configuration options that amount to "fix this security hole, but run really slowly" or "leave this security hole open, and run faster".
And the proposed default leaks data
If you counter that "metadata is data; therefore it does leak data" I suppose you'd be correct in a very strict sense. But, regardless of semantics, I think "sometimes hinting at which parts of the file have been edited" is quite a bit different from "leaking a document's edit history, verbatim" -- in much the sense that a common cold is different from HIV despite the fact that they're both viruses. (Though, I suppose they can both kill you if the conditions are right)
I can certainly think of situations where a user wouldn't want to leak such vague metadata, but that seems to me like an edge case and we as developers should strive for sensible defaults. I can't think of too many instances when I would have cared if somebody could perhaps infer which part of a document had been edited most often.
That said, yeah, offering choices like this is clunky. Maybe the privacy stuff could just happen automatically in the background following a save, or at export time, etc.
They will leak data, e.g. deleted information. Search “SQLite forensics” for technical details.
> Most edits would presumably not even cause fragmentation
You replace “a” with “the” in the middle of a large table, and generally the table becomes fragmented. The way it’s fragmented indicates not only where it was edited, but also what kind of edits were made, insertions and deletions fragment database differently.
SQLite does not compete with client/server databases. SQLite competes with fopen().
I personally do not agree to this argument, because some operations are intrinsically better suited to text-based formats---for example with text-based formats one can at least try to diff and merge multiple files (even though the actual suitability depends more on the data structure and less on the underlying container format), but I can see where the argument is coming from.
Regardless of SQLite, why don't modern productivity tools include at least an option to automatically version documents, ie store documents as a series of diffs that one could browse up to the start?
Of course in many cases, edit history is better kept hidden, so this should not be the default. But why doesn't it exist at all?
MS Word has a Track Changes feature that kind of does this, and it's a pain because you have to remember to remove the revision history before you publish the document because it can expose details about the users that edited the document.
IMO this is better handled by the OS than by each individual app.
CREATE TABLE IF NOT EXISTS files (id TEXT, name TEXT, contents BLOB);
How does sqlite handle incremental update? Does it use separate file storage for large table blobs? If the blobs are in line, how does it handle changing one blob from 1Mb to 2Mb say? Presumably it just appends the new blob, updates the pointer, and marks the original 1Mb area as unused and possible to clean up? That leaves a 3Mb file, which can be vacuumed to 2Mb then?
Why would it be superior? The biggest advantage I can think of is code reuse, but what else?
As much as I hate xml this seems like a decent use for it.
The first question is "Q1: Does a RDBMS help here?" and then 2nd "Q2: Given Q1=Y, Which might we use, what features are reqd and how are they implemented?". Parent jumped to SQLite as the answer (given the source) but I am unconvinced on Q2, or even Q1.
Zipped XML actually allows us to parse and manipulate it easily with a strongly defined schema. SQLite can barely enforce any consistency.
As for XML it’s actually pretty marvellous. Most people don’t understand it which is where the problem is.
The nearest I can find is where it suggests breaking up the XML into smaller XML documents, but that's a whole heck of a long way away from being the same thing.
Those are all XML too. It’s not just markup.
It turns a bundle of files with strict relations into an unworkable stateful blob of loose links. Welcome to the world of corruption at the application layer.
Ok, so maybe this proposal might make the format a little harder to access programatically, but it's a pretty minor effect. You just need to extract the data from a SQLite DB instead of a zip file. You could even store a manifest in the DB as well.
I think the thing I object to about all the criticism here is that it's all about making things easy and simple and pure from a development point of view. The article proposes concrete end user benefits from a scheme like this, which to me seem pretty worthwhile. Surely the focus should be on the user experience first and then look at how to implement that?
From a development perspective you’re adding more steps.
CREATE TABLE image_attributes ( name text, value blob);
CREATE TABLE layers (id text, parent_id text, sequence
integer, uti text, name text, data blob);
CREATE TABLE layer_attributes ( id text, name text, value blob);
Pen-testers do still occasionally find minor problems. See https://www.sqlite.org/src/info/04925dee41a21ffc for the latest example. But generally speaking, it is safe to open an SQLite database received from an untrusted source. If you are extra paranoid, activate the "PRAGMA cell_size_check=ON" feature and/or run "PRAGMA integrity_check" to verify the database before use.
In addition, the library is incredibly easy to embed with any project as it's largely a self-contained system . If SQLite 4 were released with breaking changes, you could happily continue using SQLite 3.
Someone would have to maintain that version of what would now be the ODF SQL disk format forever. Much of what would have to be maintained would be pointless complexity not required for the representation of documents.
The SQLite database file format has been unchanged (except by adding extensions through well-defined mechanisms) since 2004. And we developers have pledged to continue that 100% compatibility through at least the year 2050. The file format is cross-platform and byte-order independent. We have an extensive test suite to help us guarantee that each new version of SQLite is completely file compatible with all those that came before. The file format is documented (https://sqlite.org/fileformat.html) to sufficient detail that the spec alone can and has been used to create a compatible reader/writer. (I know this because the authors of the compatible reader/writer software pointed out deficiencies in the spec which I subsequently corrected.) The file format is relatively simple. A reader for SQLite could be written from the spec with no more effort than it would take to implement a ZIP archive reader with its decompressor from spec.
There is approximately 1 trillion (10^12) SQLite database files in active use today. Even without the promises from developers, the huge volume of legacy database files suggests that the SQLite database file format will be around for a long time yet.
The latest version of SQLite will happily read and write any SQLite3 database written since the file format was designed in 2004. Ancient legacy versions of the SQLite library will read and write databases created yesterday, as long as those databases do not use any of the extensions that were added later.