
SQLite as an Application File Format (2014) - afiori
https://sqlite.org/appfileformat.html
======
pvg
Previously:
[https://news.ycombinator.com/item?id=18754634](https://news.ycombinator.com/item?id=18754634)

------
TheRealPomax
The great thing about applications that use sqlite as their application file
format is that you can write secondary apps and utilities to supplement the
applications themselves.

For example, Adobe's Lightroom uses sqlite as their application file format,
which means that it's almost trivial to write an application to help it do
things it's either really bad/exceedingly slow at (like removing images from a
catalogue that no longer exist on disk) or literally can't do (like generating
a playlist of files-on-disk for a specific collection or tag), without being
locked into whatever janky scripting solution exists in-app. If there even is
one to begin with.

All you need is a programming language with a sqlite connector, and you're in
the driving seat. And sure, you'll need to figure out the table schemas you
need to care about, but sqlite _comes with sqldiff_ so it's really easy to
figure out which operations hit which tables with only a few minutes of work.

Good luck reverse engineering proprietary file formats in the same way!

~~~
lbutler
I have an application that uses sqlite as their file format but they've
cleared the header so you can't open it directly.

I'd love to be able to make secondary applications like you've described but
being enterprise software they don't want to make it too easy.

They obviously want to keep people locked in with their $40k per seat
application!

I guess the first step is figuring out the page size and other bits the other
meta data you set in the header [1].

I know I just have to sit down and understand the format better and I will
eventually figure it out...

[1]
[https://www.sqlite.org/fileformat.html](https://www.sqlite.org/fileformat.html)

~~~
TheRealPomax
That's mean, but you'll just have to live with that. If someone takes active
steps towards preventing people from making secondary apps/utils, then they're
(a) jerks, and (b) allowed to do that.

Might be worth actually contacting them to ask why, if you can make the case
that secondary applications will increase the value of their app, not decrease
it.

~~~
traverseda
>That's mean, but you'll just have to live with that

I mean I know we're all on board with the idea of intellectual property
actually being a thing now, but surely there are limits? I've seen people take
the hard-line stance that if something is your property you should be able to
dictate _exactly_ under what situation it can be used, but there have to be
limits to IP holders rights on some level, and I feel like reverse engineering
a file format is a pretty reasonable place to draw that line.

~~~
matheusmoreira
> we're all on board with the idea of intellectual property actually being a
> thing now

We most certainly are not. I personally believe that intellectual property as
a whole doesn't make sense in the 21st century and should be abolished.

> there have to be limits to IP holders rights on some level

There are. The laws generally recognize fair use and reverse engineering for
interoperability.

> I feel like reverse engineering a file format is a pretty reasonable place
> to draw that line

Absolutely. Unfortunately, in the US it seems corporations can force people to
give up their rights by making them agree to it. Therefore, "you must not
reverse engineer our software" is a standard clause in every contract and it's
not negotiable.

~~~
kstrauser
I'd be more on board with it if it were taxed like other property. That
patent's worth $1B, you say? Better let Accounting know. Conversely: you're
suing for $1B for a patent violation but you only paid taxes on it being worth
$300? $300 max it is, then.

~~~
kgwgk
> taxed like other property.

Most property owned by companies is not taxed.

~~~
patmorgan23
Actually physical property is at least in my state both physical plant and
inventory are taxed.

~~~
kgwgk
Interesting, I was not aware of those taxes.

[https://taxfoundation.org/does-your-state-tax-business-
inven...](https://taxfoundation.org/does-your-state-tax-business-inventory/)

~~~
mcny
Nice introduction. Thank you.

I think taxes are very difficult. Personally, I think when Google or Facebook
places an ad on its own (public) platform, they should have to pay a tax
similar to a sales tax as they have sold themselves. Probably shouldn’t apply
to the ads on google.com homepage as they don’t sell those as far as I
understand.

The article mentions neutrality in taxes which is either silly or disingenuous
though. Our tax code is not neutral. We openly use tax code as a way to
motivate public behavior.

Ad space as inventory would be an interesting concept. It is tricky though.
What is the value of something before it is sold? What is a fair tax rate? How
often does this inventory adjust as you have more or fewer users? Messy.

------
corysama
“Atomic transactions” is a feature needs _formal_ support in random file
formats way more often than people realize. Simply writing to a file at all in
an guaranteed-atomic way is much harder than it looks. That guarantee becomes
important when your app gets widely distributed. If you have a million users
of your free mobile app, 1 in a million events happen every day. For example:
random hardware shutdown midway through a file write operation. How is your
app going to react when it reads back that garbled file?

I’ve used SQLite on mobile apps to mitigate this problem. I’ve used LMDB on a
cloud app where the server was recording a lot of data, but also rebooting
unexpectedly. Would recommend. I’ve also gone through the process of crafting
an atomic file write routine in C. [https://danluu.com/file-
consistency/](https://danluu.com/file-consistency/) It was “fun” if your idea
of fun is responding to the error code of fclose(), but I would not
recommend...

~~~
hedora
Posix (and I think windows) guarantee this atomically and durably overwrites a
file:

tempfile = mkstemp(filename-XXXX)

write(tempfile)

fsync(tempfile)

close(tempfile)

rename(tempfile, filename)

sync()

Assume the entire write failed if any of the above return an error.

In some systems (nfs and ext3 come to mind), you can skip the fsync and/or
sync, but don’t do that. It doesn’t make things significantly faster on the
systems where it’s safe, but it definitely will lose data on other systems.

The only loophole I know of is that the final sync can fail, then return
anyway. If that happens, the file system is probably hosed anyway.

~~~
networkimprov
You need a recovery step on startup to retry the rename if tempfile is
complete, or delete it if it isn't.

That means you need a way to verify that tempfile is complete. I do that by
removing filename after completing tempfile. And that requires a placeholder
for filename if it didn't already exist (e.g. a symlink to nowhwere).

On crash, rename may leave both files in place.

This technique doesn't work if you have hardlinks to filename which should
refer to the new file.

~~~
jonas21
Regardless of whether the tempfile is complete, you can just ignore (or
delete) it on startup. From the caller's perspective, the save operation
doesn't succeed until the rename is done and written to disk.

------
nathan_f77
I've been reading through many of the other pages on the SQLite website. I
didn't realize that SQLite is developed by a company (Hwaci) and funded by
selling support, licenses, and testing services. I had always assumed it was
an open source project backed by a non-profit, similar to the Apache Software
Foundation.

This is another very interesting example of an open-source business. I would
be interested to learn more about how the Hwaci company operates (revenue,
number of employees, etc.). I find this very interesting:

> We are a 100% engineering company. There is no sales staff. Our goal is to
> provide outstanding service and honest advice without spin or sales-talk.

They list some "$8K-50K/year" and "$85k" price tags directly on their "Pro
Support" webpage. These would usually be behind a "Schedule a Call" or "Get
Quote" button. I've been thinking about doing something similar with my own
on-premise licenses and support contracts. I'm not very good at sales and I
don't really want to hire a sales team, so I'd be interested to know how this
worked out for them.

I also liked this sentence, which is very similar to Basecamp's philosophy
(and both companies were started around the same time - 1999 vs 2000):

> Hwaci intends to continue operating in its current form, and at roughly its
> current size until at least the year 2050.

It's interesting to think that SQLite could have raised money and grown into a
billion-dollar public company with thousands of employees.

I'm going to listen to this Changelog interview with Richard Hipp now [2], and
also this talk on YouTube [3].

[1] [https://sqlite.org/prosupport.html](https://sqlite.org/prosupport.html)

[2] [https://changelog.com/201/](https://changelog.com/201/)

[3]
[https://www.youtube.com/watch?v=Jib2AmRb_rk](https://www.youtube.com/watch?v=Jib2AmRb_rk)

~~~
TheRealPomax
Also important to note is that enterprise pricing is way higher than you'd
think because when an enterprise needs a solution, something like $1000 is
literally too cheap to justify the cost of procurement (which involves not
just getting a license, but also internal documentation, which in turn needs
to be entered into whatever process certification programme is in place, etc.)

~~~
ViViDboarder
That’s why many companies have a policy where licenses below a certain value
can just be expensed.

------
dragly
From experience, I can also recommend using SQLite as an application file
format. I landed on SQLite after looking for solutions for a file format for
an educational app we made for simulating biological neural networks. The app
is cross-platform, written in Qt and the simulations needed to be stored as a
JSON describing the network, a thumbnail and some metadata. It was also
intended to be extensible with more features and backwards compatible if new
versions were released. I considered creating our own simple format, using ZIP
files, HDF5, Qt resource files or SQLite.

I landed on SQLite for many of the reasons outlined in this article and in
particular because of how easy it was to implement and maintain. SQLite is
supported natively in QtSql, which made it extremely easy to write the save
and load functions, and later extend these with more data fields. In addition,
we did not have to worry about cross-platform support since this was covered
by SQLite and Qt already.

~~~
radarsat1
Interesting, I used HDF5 in a similar situation because we needed to save a
lot of same-sized rows of data (simulation time steps), so a matrix-oriented
format seemed to make sense but it wasn't entirely without some need for
cross-referencing between tables, so it does make me wonder now if sqlite
would have been a comparable or better choice. Any reason for rejecting HDF5
in your case?

~~~
st1ck
Is it cheap to insert/update/delete data in HDF5? If not, that should be the
answer. I'm also curious if Parquet would fit your requirements.

------
mstade
SQLite is amazing – trivial to embed in almost every context I've tried, rock
solid, incredibly flexible, performs well enough for most use cases and for
the others it probably still performs better than the solution I had in mind.
I wouldn't dream of using anything else for application data, unless you
_need_ something text based for human readers as well, and can't deal with
having people use a SQLite client. And even then, really consider whether you
_need_ this.

With SQLite embeddable in websites thanks to wasm, and the ability to create
object URLs it's also pretty trivial to make full blown (read-only) SPAs
delivered as a single HTML file. This last bit might seem crazy – and it kind
of is – but if your clients are all on a LAN in a corporate network so
bandwidth and latency aren't really an issue it makes a bit more sense.

I love SQLite, hands down one of my favorite tools.

------
polyrand
I think SQLite is underrated.

With WAL mode + increasing the page cache you can get some excellent
concurrency, even if doing reads and writes at the same time.

With rqlite it's easy to make it a server database and have a cluster of
SQLite databases
([https://github.com/rqlite/rqlite](https://github.com/rqlite/rqlite)).

I wouldn't try to create the new Instagram with it, but I think it'd be
capable enough for many apps that are built on top of more complex DBs.

~~~
edwinyzh
You are right! In our Delphi/Pascal community we have
[https://github.com/synopse/mORMot](https://github.com/synopse/mORMot), which
is a full client/server DB, remote ORM and RPC system based on top of SQLite.

------
trevor-e
While we are on the topic of file formats, I stumbled across a cool project
the other day that aims to make parsing well-known binary formats much easier:
[http://kaitai.io/](http://kaitai.io/)

In high school I was obsessed with the game Starcraft. It came with its own
map editor that would save to its own proprietary format. Some smart people
came along and reverse engineered that format and allowed us to do all kinds
of neat things we weren't supposed to. I see modding communities for games are
more popular than ever, and finding this project brought back lots of great
memories.

~~~
mdaniel
I'm super-on-board with declarative binary parsing schemes (i.e. not lua or
python) but my own attempt to use kaitai was met with the same frustration
that presumably caused PDF to be missing from the format gallery list
([http://formats.kaitai.io/](http://formats.kaitai.io/)): jump or offset-based
structures

At the time, I tried using a stunt like defining one huge blob that "eats" the
main file, then reaching back into it as we learned more, but it looks like
somewhere along the way they acquired a "substream" behavior
([https://github.com/kaitai-
io/kaitai_struct_doc/blob/c53060f7...](https://github.com/kaitai-
io/kaitai_struct_doc/blob/c53060f773acd80d9a27f7a199d03256d9763713/user_guide.adoc#choosing-
a-substream)) so maybe it's worth another look

~~~
thechao
Can you explain more? I deal with a shocking number of poorly supported binary
formats, and this tool looks awesome. Buuuuttt... a lot of these formats are
“headers with offsets to structured data”. They’re intermixed logs, so
multiple incompatible bi art formats are freely mixed in the same file.

~~~
mdaniel
I can most easily speak to the PDF example, which is harder than the target I
was going after due to it being a mixture of text and binary, with __a lot
__of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one
painful example)

I took a few minutes just to kick the tires on the startxref of PDF, to get a
feel for how the substream business plays out, and then stopped when I got to
the part about how the offset position is written in a dynamically sized ascii
string but represents an offset in the file

    
    
        meta:
          id: pdf
          file-extension: pdf
          endian: le
        seq:
        - id: magic_bytes
          contents: '%PDF-'
        instances:
          startxref_hack:
            type: startxref_hack0
            pos: _root._io.size - 24
            # pick a reasonable guess to wind backward
            size: 24 - 5
          eof_marker:
            # contents: '%%EOF'
            type: str
            encoding: ASCII
            size: 5
            # this isn't strictly accurate,
            # due to any optional CRLF trailing bytes
            pos: _root._io.size - 5
    
        types:
          eat_until_lf:
            seq:
            - id: dummy
              type: u1
              repeat: until
              repeat-until: _ == 0xA
    
          startxref_hack0:
            seq:
            # this isn't accurate, since we may have jumped
            # into "endobj\n" or worse
            - id: junk
              type: eat_until_lf
            - id: startxref_kw
              contents: 'startxref'
            - id: startxref_crlf
              type: eat_until_lf
            - id: startxref_offset
              type: str
              encoding: ASCII
              terminator: 0xA
    

As best I can tell, the actual definition would involve a hypothetical
`repeat: until-backward` where it starts at EOF (-5 in our case, due to the
known EOF constant), reads _backward_ until it hits LF (and/or CR!), captures
that as the startxref offset, reads backward eating CR/LF, skips backward
`strlen("startxref")` bytes, and then is when the tomfoolery starts about
reading the "xref" stanza, which, again, is a ascii description of more binary
offsets, using zero-prefix padded numbers because of course it does

Don't get me wrong -- it's entirely possible that kaitai is targeting
_strictly binary_ formats written by sane engineering teams, but the file
format I was going after had a boatload of that jumping-around, repeating
structs-of-offsets, too, so my holding up PDF as a worst-case example isn't
ludicrous, either

------
ris
Amongst all the praise we're seeing for this approach here, may we please just
have a minute's silence for the various exploits sqlite has seen when loading
arbitrary database files over the years?

My recent favourite being
[https://media.ccc.de/v/36c3-10701-select_code_execution_from...](https://media.ccc.de/v/36c3-10701-select_code_execution_from_using_sqlite)

~~~
empthought
This is a security flaw of the system or application using SQLite, though. It
is trusting data it should not trust and executing code provided with that
data.

~~~
roelschroeven
If I use libpng or libjpeg to decode an image and that causes a buffer overrun
or another security vulnerability, that is very much a bug in libpng or
libjpeg.

The same is true when I open a database in SQLite: if that causes a security
problem, it's a bug in that library. I don't even see how you could validate a
database file before you hand it over to SQLite.

~~~
justinclift
The exploit mentioned above relies on third party functions being loaded into
the database. eg application specific functions

Those functions have to be _explicitly_ loaded by the application after
loading the database. eg: They're not stored in the database and loaded with
it.

You'd validate the database (past the standard integrity checking), by loading
it _and not_ adding any third party functions. Then check it's not doing
anything dodgy with views, or outright disable views. Then load the third-
party extensions (if needed).

If you wanted to go even further, you could also add an authorizer callback
function:

[https://www.sqlite.org/c3ref/set_authorizer.html](https://www.sqlite.org/c3ref/set_authorizer.html)

That's called (multiple times) any time a SQL statement is prepared/executed,
and catches things like functions being run, tables being accessed, etc. You
can use that to only allow a whitelisted set of functions to run, which would
help in some scenarios. eg:

[https://github.com/sqlitebrowser/dbhub.io/blob/e1c5042f857e9...](https://github.com/sqlitebrowser/dbhub.io/blob/e1c5042f857e9fadc13eb0ff193f813103b2eef6/common/sqlite.go#L196-L226)

The "Defense Against Dark Arts" page on the SQLite website has further good
info about this:

[https://www.sqlite.org/security.html](https://www.sqlite.org/security.html)

As a data point, I implemented the above "Defense against Dark Arts" stuff
recently for an online SQLite data publishing platform to let people run free
form queries on databases (dbhub.io). It wasn't all that hard to implement.
Much less so that I'd been expecting. :)

------
boudewijnrempt
I was surprised, but also intrigued when I discovered that Manga Studio (Clip
Studio these days) used sqlite for the native file format, including the pixel
planes.

I wish my predecessor at Krita had made that choice, instead of choosing to
re-use the KOffice xml-based file format that's basically a zip file with
stuff in it. It would have made adding the animation feature so much easier.

~~~
im3w1l
What's wrong with xml + zip?

~~~
derefr
XML-based formats are export formats, not state-keeping formats. To use an
XML-based format for storage, need to have a separate, canonical _in-memory_
representation of the data, which you then snapshot and serialize into XML
upon request. You may or may not be able to get away with serializing less
than your full in-memory object graph upon save, using techniques similar to
DOM reconciliation. Either way, you'll still likely need your entire
document/project _represented_ in memory.

If you're working with something analogous to a text document, this snapshot-
and-serialize approach to saving works fine. If you're working with other
types of data, though, this approach only works for trivial projects; once
your document exceeds ~100MB, the overhead of snapshotting+serializing your
object graph becomes bad enough that people stop saving very often
(dangerous!), and it also makes the saving process itself more fragile (since
the longer a save takes, the more likely it becomes that the process might be
killed by some natural event like a power cut during it†.)

And, once your project size exceeds the average computer's memory capacity, an
in-memory canonical representation quickly becomes untenable. You start to
have to resort to hacks like forcing the user to "partition" their project,
only allowing the user to work with one pieces at a time.

With an applicaton store-keeping format, you have none of these concerns; the
store is itself the canonical data location. You don't have a canonical in-
memory representation of the data; the in-memory representation is simply a
write-through or write-back caching layer for the object graph on disk, and
the cache can be flushed at any time. Or you may not have a cache at all; many
systems that use SQLite as a file-format just do SQL queries directly whenever
they want to know something, never instantiating any intermediate in-memory
representation of the data itself, only retrieving "reports" built from it.

† You can fix fragile saving with a WAL log, but now the WAL log is your true
application state-keeping format, with the XML format just being a convenient
rollup representation of it.

~~~
geokon
I've never had my problem scale to the size that required a database/SQL, but
I don't quite get the advantage of your solution. Having all your interactions
with data have to go to disk though a cache muddles things b/c it makes it
much harder to reason about performance (b/c when do you have a cache miss?
and how do you configure a cache properly?) You introduce a lot more
blackmagic variables to reason about.

If you're editing images I'd think it'd just makes more sense to have all of
your stuff in RAM and then a saving-to-disk is done on a separate thread. I
don't quite get why the users would stop saving in this example.

I'm not saying you're wrong - but more asking for some more details b/c I've
never imagined using a DB on data that can fit in RAM

~~~
HelloNurse
It's primarily a problem of inflexibility handicapping performance, not of
"cache misses" and clever algorithms.

For example, imagine a word processing program opening a document and showing
you the first page: you could load 50MB of kitchen sink XML and 250 embedded
images from a zip file and then start doing something with the resulting
canonical representation, or you could load the bare minimum of metadata (e.g.
page size) from the appropriate tables and the content that goes in the first
page from carefully indexed tables of objects. Which variant is likely to load
faster? Which one is guaranteed to load useless data? Which one can save the
document more quickly and efficiently (one paragraph instead of a whole
document or a messy update log) when you edit text?

~~~
geokon
ah okay, incremental loading seems essential and I hadn't considered it.
Thanks for explaining :)

------
Nihilartikel
Recently I've been using SQLite to store state for a data-syncing CLI utility
in Python where I was considering using CSV or JSON at first. The Python
'Dataset' library,
[https://dataset.readthedocs.io/en/latest/](https://dataset.readthedocs.io/en/latest/)
has taken almost all of the friction out of using SQLite for trivial state
persistence.

I don't have to manually manage schema or create tables - they're all lazily
created from Dict keys on insertion, but indexes and SQL can still be used as
desired.

This lends itself really well to Jupyter notebook assisted development, where
functions can be quickly and interactively iterated without having to muck
around with existing tables whenever data changes shape.

It's been a real productivity boost, and I've been looking around for
something similar to use in Clojure.

~~~
vdm
Thanks for posting this, it's exactly what I was looking through this thread
for.

------
chrismorgan
How would SQLite go as for something like Audacity projects? Audacity projects
are a pile of files: an XML file that declares the project metadata, tracks,
channels, envelopes (an example of non-destructive editing), and references to
the zillions of audio files that it drops in a directory beside the XML file
(typically less than 1MB each).

I presume it splits all the audio up into small files so that most types of
edits can only need to touch a small area.

If you directly ported that to SQLite, would it work fairly well, or would you
want to restructure it somehow? Things like additions or deletions, would it
need to write lots of extra data to the disk (would it be doing something like
defragmenting, or would it grow larger than it should, or are there other
tricks that I don’t know about to delete a chunk from the middle of a file
without needing to rewrite all of the file beyond that point)?

~~~
catalogia
Adobe Lightroom seems comparable (non-destructive image editing.) It uses a
sqlite database that points to image files on the filesystem. I think
precisely the same approach could be taken with an audio editing application.

You could also consider putting the audio files into the sqlite db, which
_might_ work alright. I've heard of image thumbnails being stored in sqlite
dbs (maybe by Lightroom iirc?) though those are probably smaller than your
audio clips I'm guessing.

------
teleforce
Shouldn't this has 2014 inside the title? The story has also appeared several
times in HN before.

Very interesting concept but now I think perhaps application file format using
TileDB will be much better since it can support sparse data as well [1].

[1] [https://github.com/TileDB-Inc/TileDB](https://github.com/TileDB-
Inc/TileDB)

~~~
HelloNurse
TileDB uses a "pile of files", as the article calls it, it doesn't have SQL,
and it has a rather embarrassing point of view on race conditions
([https://docs.tiledb.com/main/basic-
concepts/consistency](https://docs.tiledb.com/main/basic-
concepts/consistency)). It could be a good choice for applications that rely
on large and simple multidimensional arrays, hardly the same "market" as
application file formats.

Moreover, database tables are a _very_ good fit for sparse multidimensional
arrays.

~~~
teleforce
Perhaps SQlite or any SQL engine is not a good idea for application file
format [1],[2], or is it just bad for web applications?

[1][https://news.ycombinator.com/item?id=23517663](https://news.ycombinator.com/item?id=23517663)

[2][https://support.google.com/chrome/thread/17537877?hl=en](https://support.google.com/chrome/thread/17537877?hl=en)

------
fatboy
We transitioned the file format for the images in our climbing-topo app from a
pile of files to SQLite (the images are tiled).

Going from having high hundreds of tiny files on disk per topo photo to just
one was an incredible boon to productivity for things like data backup and
transferring files onto the device for testing.

~~~
wetmore
Tell me more about your app

~~~
fatboy
It acts as a digital window into our print guidebook catalogue. The books are
mostly for the uk, with the rest spread across Europe.

I take the finished desktop-publishing documents and extract and package the
data to put into the app, including tiling the images so we can have very high
resolution photos on low-end devices.

There's a semi-technical article about the topo-view implementation here:

[https://www.ukclimbing.com/articles/features/rockfax_app_dee...](https://www.ukclimbing.com/articles/features/rockfax_app_deep_dives_part_0_topos-12679)

------
juskrey
Years pass, now decades, yet SQLite is still the most pleasant DB to work
with.

~~~
TedDoesntTalk
Agree, but it also side-steps some complex topics like sharding, replication,
high-availability, etc. I’m not suggesting those topics should be addressed by
SQLite; that simplicity is an asset.

Has anyone used SQLite remotely over a network?

~~~
mjw1007
When Subversion changed their working copy format to rely on an SQLite
database, I had a number of problems using working copies on network file
systems.

No doubt this was an indication that the network file system was incorrectly
configured, but I think the fact that it worked in practice with the file-
based format and didn't with the SQLite format is a strike against the idea of
using SQLite for what the user sees as saving a file.

~~~
m_sahaf
SQLite recommends against sharing it on NFS for multiple access because
locking is broken on NFS for all kinds of files, not just SQLite.

> But use caution: this locking mechanism might not work correctly if the
> database file is kept on an NFS filesystem. This is because fcntl() file
> locking is broken on many NFS implementations. You should avoid putting
> SQLite database files on NFS if multiple processes might try to access the
> file at the same time.

Source:
[https://www.sqlite.org/faq.html#q5](https://www.sqlite.org/faq.html#q5)

~~~
mjw1007
Fair enough, but that means if someone's following the advice in the article
("It deserves your consideration as the standard file format on your next
application design") they need to consider "will our users want to use our
files on network filesystems?".

~~~
yellowapple
You could always have a lock file alongside the main to signal to other
instances of the app that the file's in use. A bit janky, but workable.

~~~
mjw1007
That's no worse than what applications often already do when they use a
simpler file format.

But you also have to worry about SQLite (reasonably) refusing to operate
because it tries to make a locking call and gets an error response. I think
there are options to turn that off, though.

------
overcast
What happened to storing blobs in a database being giant rookie mistake? I'm
currently dealing with this at the office, where now a SQLServer database has
grown to unmanageable size from helpdesk software attachments.

~~~
ojnabieoot
Did you read the article? This isn’t about storing blobs as data in SQLite,
but rather using SQLite-formatted databases as the file format for an
application.

As a simple example, Word documents are just zipped XML text files (try
unzipping a .docx and looking inside). Instead of using this, you could a
SQLite .db file (probably with a different extension), translating the XML
files into tables, and folders into databases. The OpenOffice case study has
more details:
[https://sqlite.org/affcase1.html](https://sqlite.org/affcase1.html)

~~~
overcast
Apparently I'm misreading the following paragraph.

Any application state that can be recorded in a pile-of-files can also be
recorded in an SQLite database with a simple key/value schema like this:

    
    
        CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);

~~~
ojnabieoot
I agree that is unclear for an introduction, and that the article could have
been written better. The “pile-of-files” they are referencing isn’t a
filesystem of arbitrary and unstructured binary data stored for humans to look
at later (like your Helpdesk attachments), but rather structured and specific
data for an application to read in a well-defined manner.

Your Helpdesk example used SQLServer problematically because a SQL database
shouldn’t be used as an arbitrary file store. But if you know what the file
structure is and have a reasonable grasp for how it might scale (that each
binary blob is small, that each user only adds one row to the database, etc),
there are huge advantages to “a SQLite table with lots of binary and text
columns” versus “a folder with lots of binary and text files.” And if those
text files are just small key-value pairs then maybe they should also go in
SQLite.

------
6gvONxR4sf7o
I used to advocate for this, then tried opening a sqlite file I'd been using
on another computer and it said it was corrupted. I took it back to the
original computer and it opened just fine. So there was something going on
that made it more complex and less portable than something simple like json,
which is super frustrating because otherwise it's a really useful format for
datasets, and now I can't really trust it.

~~~
curiousmindz
If you use certain new features of SQLite, the database file gets "upgraded".
At that time, you must use a version of SQLite that understands these new
features.

So, the other computer you used was probably running an older version of
SQLite. Just update it to make it work.

~~~
bollockitis
That’s a good catch. I’d there any way to tell which version of SQLite created
the file? Aside from trial and error, I mean?

~~~
lbutler
The version is stored in byte 96 of the header of the sqlite file[1].

I haven't looked but there will be some sqlite command to query it and I'm
sure some viewer tools will display it as well.

[1]
[https://www.sqlite.org/fileformat.html](https://www.sqlite.org/fileformat.html)

~~~
pjscott
You can print the version with this magical incantation:

    
    
        $ python -c 'import sys, struct; print(struct.unpack(">I", open(sys.argv[1], "rb").read()[96:100])[0])' foo.db

------
twarge
Quickbooks for Mac uses sqlite for its file format and this allowed users to
query the books with any SQL browser!

Alas, they encrypted the file starting two years ago.

------
praveen9920
The article has great points.

Main reason for existence of lot of file formats is that enterprises don't
want just about everyone access their files and modifying them. It greatly
reduces the usage of their proprietary software hence their revenues.

I remember the days when open office trying to render doc file but formatting
used to suck big time.

Open source softwares should leverage this kind of file formats for inter
operating.

~~~
catalogia
> _Main reason for existence of lot of file formats is that enterprises don 't
> want just about everyone access their files and modifying them. It greatly
> reduces the usage of their proprietary software hence their revenues._

I'm sure this does happen, but it seems more like MBA-paranoia than a
legitimate concern. For instance I sincerely doubt Adobe has lost any revenue
by using SQLite with lightroom, despite various open source tools being able
to interact with their lrcat (sqlite) files.

------
networkimprov
SQLite is a fine single-file application file format, but if you want average
users to be able to inspect and perhaps fix application data when things go
wrong, a tree of JSON-format files is more friendly.

True, this requires care to ensure that such files are updated reliably, but
that's not quite rocket science.

------
bstar77
Is anyone doing this in Javascript (via electron) and encrypting the file in
such a way that the data is protected and not an obvious sqlite db?

I have some ideas on how to do this, but I'm curious if there's a "preferred"
way to do it.

~~~
pilif
SQLite has encryption plugin support, or you could encrypt the file yourself
at rest. However, no matter what you are doing, as the application needs to be
able to decrypt the file, you have to ship the key and thus the protection
will be able to be circumvented.

It’s really not worth the effort IMHO

~~~
bstar77
That's a good point... I don't want to have the game rely on "phoning home" to
decrypt the db.

~~~
yellowapple
On top of that, even if you did have the game "phone home", that doesn't do
much against someone inspecting the memory of a running game; one could just
wait until the game does have the decryption key.

------
chrisbennet
I used it with a different extension to save application for a product I
wrote.

Any classes that need to be saved had serialize() and deserialize() functions.
Serialize before saving to SQLite and if read in, deserialize after reading it
from the DB.

------
tonyedgecombe
Maybe. I've used XML as a file format for all my recent projects. It's well
supported in the .Net framework, I don't need any 3rd party assemblies. Using
SQL to write out an object graph would involve more work.

~~~
gwbas1c
It depends on how large and complicated your XML file is.

Once your XML file hits a certain size, minor updates incur a gigantic
performance penalty because you have to write out the entire XML file every
time.

Random writes in the middle of an XML file are impossible. For example, if you
were to change an attribute so that it's one character longer, you still have
to rewrite the remainder of the file in order to shift everything over by one
character.

That's the main reason why SQLite is so popular for applications.

(I know that's from personal experience. I had to support an old version of an
application that constantly wrote out XML while the new version that we hadn't
shipped used SQLite. A customer that made heavy use of the old version
basically hit the limits of XML but the version that used SQLite wasn't ready
for them yet.)

But, I'm going to be honest here: We had some tables that only had a few rows,
so I moved those to XML files. QE really liked it because it was easy to
diagnose issues.

~~~
tonyedgecombe
I think it depends on whether you need a file format or a database. All the
benefits of a database are wiped out if you are implementing File->Save for a
document based application.

~~~
gwbas1c
File->Save is an extremely outdated workflow leftover from when disk drives
took multiple seconds and you couldn't edit the document while you were
writing to the disk.

Now your application should continuously save without any manual steps.

------
GnarfGnarf
I just converted a 90's-vintage xBase app to SQLite. I just love it. Should
have done this years ago.

The only thing I miss is the ability to jump X% into an index. Not a deal-
breaker, though.

------
em500
A lot of well known apps are exactly doing this, using a sqlite db as the
internal file format: Whatsapp, WeChat, Google Keep, Apple iMessage, Notes and
Calendar...

Probably tons more if you care to dig around.

------
gjsman-1000
I worked with an SQLite Database on a personal C#/WinForms app. It was a bit
complicated to get started, but afterwards, what a handy trick in comparison
to using something like the discontinued SQL Server Compact.

I did, however, "enhance" it a bit (or "proprietarize" it) by encrypting it
with a short password and an AES algorithm with some uncommon settings. I
never shipped it that way, but the output file looked like any other
proprietary app's file - a mess of random symbols.

------
garganzol
Everything old is new again. I can recall a similar movement at the end of 90s
regarding databases as application file formats. That time it was Microsoft
Jet Database engine.

------
tracker1
Concurrent access to SQLite is the main reason I haven't really looked into
Deno. For that matter, I really think that WebAssembly file system
implementations should have enough flexibility to be able to work with a
SQLite database file in the host OS.

The needs for concurrent SQLite pretty much covers the need for a robust file
system.

------
rb808
I keep switching between csv, parquet and sqlite. I dont need another article
to encourage me to switch again! :)

------
AtlasBarfed
Does SQLLite have a more text-friendly storage format? This would enable Text
editors to do edits/modifications, which is critical to debugging and testing
and learning. Savefiles don't need the same ongoing access optimizations, or
size optimizations for wire protocols, so a less efficient format that is text
editor friendly would help this cause.

Also, does SQLLite have libraries in the native code for the major languages
to read and write to them? XML (ugh), JSON, and YAML have managed to get
decent implementations in almost all major languages.

~~~
HelloNurse
Using a text editor to mess with a SQLite database doesn't make any sense. Do
you use text editors to edit, say, MPEG media?

SQLite databases can be edited with with generic GUI and command line tools,
both SQL-based and tabular editors, which are safer and more convenient than a
text editor could ever be.

------
chocolatkey
When I was reversing the XMDF E-Book format, I was surprised to find an
embedded SQLite database... in an E-Book?? After reading this article, it
makes more sense why someone would make that decision

------
gwbas1c
One minor point that needs to be raised: Using SQLite as your application file
format locks you into the SQLite source code.

(And in most cases that's not a bad thing, it's free and open source.)

But, if you truly want an open file format, someone needs to be able to
independently write a program that can read your file without relying on third
party dependencies. This is why the browser vendors decided not to put SQLite
into the HTML and JavaScript standards.

It is an awesome product! I've worked with it for over a decade and I'm a fan!

~~~
rswail
> it's free and open source

It's more than "open source" it's in the _public domain_.

The file format is fully documented and if someone like ISO or ANSI wanted to,
they could make a standard out of it. It's also forward compatible since
inception and versioned.

The browser vendors decided not to put SQLite into their browsers because
"key/value good, SQL bad" and "not invented here".

IndexedDB is a clumsy reinvention with minimal ACID properties and isn't far
advanced from ISAM. They could have used the SQLite file format and
implemented IndexedDB as an API over the top. They could have allowed both
standards to be implemented and then let reality take its course to choose
which one was successful.

------
wodenokoto
To my very limited knowledge it is very common for iOS and android apps to
store data in SQLite.

Is it possible to extract or change these databases?

I have a few (offline) apps on my phone that I’d love to append data to

~~~
kccqzy
For unjailbroken iOS the only way to extract these databases is to make an
unencrypted device backup in iTunes (Finder in Catalina) and then look at the
backup file. The backed up files won't be conveniently named but you can still
find it.

~~~
wodenokoto
Can I push my edits back onto my iPhone or is that a pipe dream?

~~~
kccqzy
I suppose it's possible, but you need to study the way backups are organized
and verified.

------
rlpb
Would such an app hold the sqlite database open while the user has the
document "open", and live-write user changes back to the database immediately?
Or would it follow the traditional model of the user choosing to "Save"?

I worked on an app that did the former many years ago (to an Access database,
not sqlite), and it did not go well because this broke user expectations on
the usual "open/save/save as" model.

~~~
catalogia
I think the former is preferable and most modern users will prefer it. However
if your users are upset about it, you could do the later, or you could do the
former and give them a dummy/placebo "save" option that does nothing, or maybe
VACUUMs the DB.

~~~
mark-r
The problem is not Save, the problem is when you close the app without using
Save - the user will expect that to work like Cancel and discard all session
changes.

------
me551ah
Where can you use SQLite?

Embedded Systems: Yes

Raspberry Pi : Yes

Mobile Apps. : Yes

Desktop Apps : Yes

Browsers : No

Servers : Yes

Supercomputers : Yes

~~~
gwbas1c
You used to be able to use it in a browser with Google Gears.

The reason why it was never adopted was because the browser makers wanted to
be able to independently implement their own database instead of everyone
having to use the same source code.

------
michaelbuckbee
IIRC one of the attempts to deliver on "Github but for Design files" used
SQLite as an app file format.

------
blackrock
Is SQLite able to handle multiple processes that want to read or write to the
file database at the same time? Or nearly at the same time.

They say that SQLite is more in competition with fopen, for file open, rather
than a true RDMS system.

------
zubairq
Yeah you can use SQLite as a file format. We use it to store our project data
at [https://github.com/zubairq/pilot](https://github.com/zubairq/pilot)

------
zzo38computer
They say that Microsoft Word, PowerPoint, etc are "fully custom formats", but
actually those formats are "wrapped pile-of-files formats" (and 7-Zip can open
them), although the contents of the files inside are custom formats (and they
seem to have failed to consider that category).

The application ID number in the SQLite header can be used to identify
application file formats. The application ID number is a 32-bit number, and
there have been a few different ways to handle it; I have seen the use of
hexadecimal and of ASCII; I used base thirty-six, and I have then later seen
the suggestion to use RADIX-50. Additionally, there is a document about
"defense against dark arts" in case you need to load untrusted files.

TeXnicard uses a SQLite database file (with application ID 1778603844) for the
card database file. The version control file (optional, and not fully
implemented yet) uses a custom format (which is fully documented), and it does
support atomic transactions. It consists of a header followed by a sequence of
frames, which are key frames and delta frames. The header of the version
control file contains two pointers: one to the beginning of the most recently
committed key frame, and the other one to the end of the most recently
committed frame (whether a key frame or a delta frame; if all frames are fully
committed, this will be equal to the length of the file). These pointers are
written only after the rest of the file is written; if it gets interrupted,
reads will ignore the partially written data, and further writes will
overwrite the partially written data.

ZZ Zero uses a Hamster archive of custom (but documented) formats as its world
file format. (A Hamster archive is zero or more "lumps" concatenated together.
A lump consists of a null-terminated ASCII filename, 32-bit PDP-endian data
size (measured in bytes), and then the data of that lump. The preceding text
in these parentheses is the full definition of the Hamster archive format; you
can use this to implement your own.)

Free Hero Mesh uses a "pile-of-wrapped-pile-of-files" format. A puzzle set
consists of four files: .class (which stores class definitions), .xclass
(which stores pictures and sounds to be used by the class definitions), .level
(which stores levels), and .solution (which stores solutions). The .class file
is a plain text file; the other three are Hamster archives. These are four
logically distinct parts of a puzzle set; this allows you to split them apart,
to create symlinks to share class definitions with puzzle sets, to substitute
your own graphics, to work with multiple solution sets (e.g. per user), etc.
If you need to do more than that, then you can of course extract the lumps if
needed. For class definitions, you can just copy and paste the text.

MegaZeux used a fully custom format before, but now it uses a ZIP archive with
the stuff inside being custom formats (one of which is the "MegaZeux Property
List" format, which I have documented in Just Solve The File Format wiki; the
authors of MegaZeux did not seem to document this format themself anywhere, so
I figured it out and did it by myself).

For some cases, SQLite database is a good application file format; other
times, I think other formats (such as text formats) may be better. It depends
on the application. XML is too often used for stuff that isn't text markup
stuff, and XML is especially bad for stuff that isn't text markup stuff, I
think.

If you use SQLite though, you will get more than just the database access. It
also gives you the string builder functions, the sqlite3_mprintf function, a
page cache implementation, memory usage statistics, and a SQL interpreter; the
SQL interpreter can be used as one way to allow user customization and user
queries (including batch operations), without having to make an entirely new
scripting language to embed.

They mention interfaces of SQLite are available for many other programming
languages, although at least one that doesn't seem to have a interface to
SQLite is PostScript (although you can use %pipe%, it doesn't work so well
especially since it is only a one way pipe), and I am not sure if awk has it
either.

~~~
chrismorgan
.doc, .ppt, &c. were fully custom. .docx, .pptx, &c. are ZIP files. Remember
that this was written in 2014, and many people were still using the old
formats then, because many people didn’t have Office 2007 or newer or software
that could cope with this new format. (Now I think very few people still use
the old Office formats.)

~~~
zzo38computer
No, .doc, .ppt, etc are wrapped file of files, although not using a ZIP
container format. I also believe Visual Pinball uses the same container format
as Microsoft Word. 7-Zip just calls this format "Compound", although there are
other names for it, such as "DocFile" and "POIFS". I just tried now such a
file, and I got the following listing from 7-Zip:

    
    
         Date      Time    Attr         Size   Compressed  Name
      ------------------- ----- ------------ ------------  ------------------------
                          .....        35842        36352  WordDocument
                          .....          106          128  [1]CompObj
                          .....         4096         4096  1Table
                          .....         4096         4096  Data
      2007-12-25 22:33:00 D....                            ObjectPool
                          .....         4096         4096  [5]DocumentSummaryInformation
                          .....         4096         4096  [5]SummaryInformation
      ------------------- ----- ------------ ------------  ------------------------
                                       52332        52864  6 files, 1 folders

------
renewiltord
GeoPKG is an open format that is just a SQLite schema. It's cool!

------
catalogia
Is there any good way to use SQLite as an application file format while also
using it in WAL mode? They seem sort of mutually exclusive; near as I can
figure the way to do this would be to execute `PRAGMA journal_mode=DELETE`
when the user requests a save file, then copy the DB to the location the user
specifies. I think this negates a few of the advantages of using sqlite as an
application file format though.

~~~
Redoubts
I think the backup api or VACUUM INTO are the preferred ways, though you're
gonna probably want to turn off wal on the exported copy still.

~~~
HelloNurse
Shrinking the database is the equivalent of saving the file when the
application shuts down or saving an extra copy to make a backup.

------
hexmiles
are there any other "file format oriented" data storage system?? I see
software tend to use zip file, sqlite or good old ole structured storage.

In my job we used a fat filesystem as a storage system and recently switched
to sqlite, and while i love it i didn't really find any alternative.

------
mythz
Where's a good place to find some useful public SQLite data sets?

------
barking
Are we talking about good old fashioned CRUD applications here or is this
something different?

------
geff82
Haha!! I thought it was only me using sqlite as my go-to file format in apps.

------
waynesonfire
What is an application file format?

~~~
detaro
The first section of the article explains it.

------
booleandilemma
SQLite is a database that tries so hard to not be a database :)

------
ori_b
Please no. This is an incredibly complicated file format that is not
particularly well suited for file saving. It's also very slow ((edit:
correction) ~10 transactions per second) if you aim for integrity, or unsafe
if you tune for performance (~50k transactions per second, but if your program
or computer dies half way, your file is hosed). So, keeping files up to date
in place makes your ui janky.

You can work around it by working in memory and writing out a whole new
database from in memory structures on save and then do the atomic rename. But
if you do that, you are probably better off with json, protobuf, or similar.
The libraries around these formats are similarly battle tested, but they fit
the needs better, supporting working in ram fully and then saving cleanly and
easily.

~~~
tzs
> Please no. This is an incredibly complicated file format that is not
> particularly well suited for file saving. It's also very slow (~100
> transactions per second) if you aim for integrity, or unsafe if you don't.

The kind of application files they are talking about (things like word
processor documents, spreadsheets, drawings, source code control system data)
would only be writing sporadically. During one of those sporadic writes they
might need to update thousands of rows but those could all be done in one
transaction.

~~~
ori_b
I made a mistake -- it's an order of magnitude slower. That means a single
transaction is within the threshold for human perception. If you block the UI
thread on a transaction, you're now dropping 6 frames at 60fps. If you block
other operations on the transaction completing, you end up with lag.

Reference: [https://www.sqlite.org/faq.html](https://www.sqlite.org/faq.html),
question 19. (I've seen similar when testing on SSDs locally).

~~~
lostmyoldone
If you have an FPS or interactivity target that needs to be guaranteed to be
better than about 0.1-1 second - depending on platform - you can't have any
disk IO at all on the UI thread, or any thread that's supposed to react at
interactive rates at all times.

It has generally become much better in the last decade or two, but one should
still expect most OS's to _sometimes_ pause for excessive amounts of time on
disk IO unless the API is specifically guaranteed to never pause. Even then
one would be wise to measure/log deviations if it's critical for the
application. OS guarantees might also be contingent on driver/subsystem
guarantees, and bad drivers might sometimes affect what seems completely
unrelated upstream systems.

~~~
ori_b
> _If you have an FPS or interactivity target that needs to be guaranteed to
> be better than about 0.1-1 second - depending on platform - you can 't have
> any disk IO at all on the UI thread, or any thread that's supposed to react
> at interactive rates at all times._

Yes, and other file formats encourage doing things in memory, so you don't
have any disk i/o in the common path.

Using sqlite as a file format strongly discourages the simple, jank-free until
you press save workflow of slurping your content, operating on it in memory,
and then outputting it all in one operation as a response to an explicit user
action. Instead, your whole application gets small but perceptible delays
across all operations and interactions.

~~~
lxdesk
That conflates the use of the schema with the use of the format features. You
can, in fact, use your custom in-memory structures and then "slurp it out" to
BLOB when you're ready to save.

~~~
ori_b
Yes. But if you go that way, though, sqlite is just a slow and clunky format
compared to, eg, json or protobuf. I doubt that any program out there uses
sqlite this way, other than those which realize they painted themselves into a
performance corner and hacked around it for compatibility.

------
adamnemecek
I have been thinking about this as well lately. For my app
([http://ngrid.io](http://ngrid.io)) I'm using a custom built Entity-
Component-System
([https://en.wikipedia.org/wiki/Entity_component_system](https://en.wikipedia.org/wiki/Entity_component_system))
based UI framework in Rust.

In the ECS paradigm, everything is stored as struct-of-arrays as opposed to
array-of-structs
([https://en.wikipedia.org/wiki/AoS_and_SoA](https://en.wikipedia.org/wiki/AoS_and_SoA))
and as a result, your serialization becomes trivial. You are not chasing
pointers to other objects when serializing. Apple's Core Data takes the object
graph approach and it's a real clusterfuck.

ECS is very similar to databases in that your data is normalized and things
are referenced by offsets rather than pointers
[https://floooh.github.io/2018/06/17/handles-vs-
pointers.html](https://floooh.github.io/2018/06/17/handles-vs-pointers.html)).

Why roll your own thing as opposed to use say SQLite? If you use SQLite, you
will probably have some OOP on top of it which introduces a serious impedance
mismatch. With ECS, you cut that whole layer out.

I'm guessing that I'm losing on some atomicity but for my use case that
doesn't matter that much.

~~~
lxdesk
ECS is useful for real-time, but it means that you are rolling your own query
logic. This is fine for the runtime of game engines because they generally
aren't dealing with that many data types(for the very most complex AAA games,
perhaps a few hundred components) and the queries used at runtime are simple
and tend to demand optimization at the data structure level.

If your purpose is an editing tool you may want to reconsider. Editing changes
the goal in a very substantial way and the complexity of your queries goes way
up, which is where SQL syntax absolutely shines.

~~~
TeMPOraL
FWIW, a game I'm writing uses in-memory SQLite as its ECS, mostly because I
wanted to see if I could. SQLite is surprisingly performant. My game is not
realtime, but I tested it, and I could actually redraw the screen at 60+ FPS
while querying for render data every frame, and I still had lots of frame time
to spare.

