Hacker News new | past | comments | ask | show | jobs | submit login
SQLite as an Application File Format (2014) (sqlite.org)
567 points by afiori 30 days ago | hide | past | favorite | 301 comments




The great thing about applications that use sqlite as their application file format is that you can write secondary apps and utilities to supplement the applications themselves.

For example, Adobe's Lightroom uses sqlite as their application file format, which means that it's almost trivial to write an application to help it do things it's either really bad/exceedingly slow at (like removing images from a catalogue that no longer exist on disk) or literally can't do (like generating a playlist of files-on-disk for a specific collection or tag), without being locked into whatever janky scripting solution exists in-app. If there even is one to begin with.

All you need is a programming language with a sqlite connector, and you're in the driving seat. And sure, you'll need to figure out the table schemas you need to care about, but sqlite comes with sqldiff so it's really easy to figure out which operations hit which tables with only a few minutes of work.

Good luck reverse engineering proprietary file formats in the same way!


I have an application that uses sqlite as their file format but they've cleared the header so you can't open it directly.

I'd love to be able to make secondary applications like you've described but being enterprise software they don't want to make it too easy.

They obviously want to keep people locked in with their $40k per seat application!

I guess the first step is figuring out the page size and other bits the other meta data you set in the header [1].

I know I just have to sit down and understand the format better and I will eventually figure it out...

[1] https://www.sqlite.org/fileformat.html


That's mean, but you'll just have to live with that. If someone takes active steps towards preventing people from making secondary apps/utils, then they're (a) jerks, and (b) allowed to do that.

Might be worth actually contacting them to ask why, if you can make the case that secondary applications will increase the value of their app, not decrease it.


I'm surprised that nobody's pointed out that there are actually valid reasons other than greed to obscure your file format. It's an implementation detail, not a contract. If customers begin relying on the implementation details, you end up with angry customers when you change the implementation details. A SQLite db without the header is basically a statement saying, "we are using the obvious file format here for our convenience, not for general purpose access. Screw around in here at your own risk."

If you modified their app's internal state db and screwed it up because they have designed their software with certain assumptions that aren't clear from just reading their db schema, that would be a nightmare for them to support. The easiest thing for them to do is just to try to discourage tampering with their internal state.

This is especially true if there's a chance that a market for secondary apps/utils will spring up. If that's to happen and be viable, they absolutely would want to put thought into what their supported interfaces are for those apps/utils, otherwise they will end up painted into a corner and unable to change their architecture without destroying a marketplace.


I really don't get this line of reasoning. If I do something with a product that is (maybe even explicitly) unsupported by the manufacturer I don't have a reasonable leg to stand on. We've recently had this with a customer where they used the concrete calls of our app to automate some of their stuff (we had had no public API for that action at that point because we hadn't had the need so far and so did no other customer or to be precise: no customer was prepared to shell out the cash for us to develop it) and after we changed something they suddenly weren't able to script the creation of customer entries in their installation. We had told them at least two years ago that the way they interacted with our system was but supported (we only noticed them automating stuff back then because it was throwing exceptions in our backend) and while we were nice enough to fix it this time because the fix was trivial we recommended that they should switch to our supported API. Two weeks ago theyir stuff broke again and we told them to use the API or fuck off.


So you see the problem and you've wasted time on it, and now you've come to the point where you're ok with being put in a situation where you have to tell your users to fuck off.

Just because they don't have a leg to stand on doesn't mean this isn't potentially a huge pain in your ass that you'd probably rather avoid. Imagine if you had a LOT of customers contacting you with this type of problem to the point where you felt like you were painted into a corner and had to support some ad-hoc APIs that weren't designed to be customer-facing and which you might have been planning to remove altogether because they're part of a design that's changing.

This is exactly the situation the obfuscation is attempting to avoid. They're just doing it with a technical solution rather than a human telling another human "don't do that."


I've never understood the "We can ship this but it's not contract" mentality. To my mind if you ship something it's contract. I think developers keeping that in mind would make the world full of more resilient code. You're making a contract w/ your future self, if nothing else.

To the point re: modifying internal state and screwing-up the application - If you're writing anything out to persistent storage you should assume that it's untrusted data when you read it back in. If for no other reason than physics itself is a malicious (or, at best, ambivalent) actor.


If your app ships with a dynamically loaded library and I dig through the exposed functions and find something undocumented that looks useful and I figure out how to use it, that's a contract to you? That's utterly insane.

Re: the concept of untrusted data, this is off in the weeds argument for argument's sake IMO. Do reasonable validations of the state data, sure, but picking nits about the nature of trust and internal application state is an infinite hole I'm not jumping into with you.


Sometimes a programmer may like to create such a contract, if only with their future self, but have reason to fear their boss will force them to break that contract in some not unlikely future.


+1. I recently made a variation of this argument on my FOSS app. [1] If it were commercial software with support I'd feel even more strongly.

Philosophically, people should be able to do what they want on their machines. But expecting support (eg figuring out how their third-party software has corrupted my database) is another matter, so I can see why people would install a speed-bump or two...

[1] https://github.com/scottlamb/moonfire-nvr/issues/44#issuecom...


>That's mean, but you'll just have to live with that

I mean I know we're all on board with the idea of intellectual property actually being a thing now, but surely there are limits? I've seen people take the hard-line stance that if something is your property you should be able to dictate exactly under what situation it can be used, but there have to be limits to IP holders rights on some level, and I feel like reverse engineering a file format is a pretty reasonable place to draw that line.


> we're all on board with the idea of intellectual property actually being a thing now

We most certainly are not. I personally believe that intellectual property as a whole doesn't make sense in the 21st century and should be abolished.

> there have to be limits to IP holders rights on some level

There are. The laws generally recognize fair use and reverse engineering for interoperability.

> I feel like reverse engineering a file format is a pretty reasonable place to draw that line

Absolutely. Unfortunately, in the US it seems corporations can force people to give up their rights by making them agree to it. Therefore, "you must not reverse engineer our software" is a standard clause in every contract and it's not negotiable.


I'd be more on board with it if it were taxed like other property. That patent's worth $1B, you say? Better let Accounting know. Conversely: you're suing for $1B for a patent violation but you only paid taxes on it being worth $300? $300 max it is, then.


> taxed like other property.

Most property owned by companies is not taxed.


Actually physical property is at least in my state both physical plant and inventory are taxed.


Interesting, I was not aware of those taxes.

https://taxfoundation.org/does-your-state-tax-business-inven...


Nice introduction. Thank you.

I think taxes are very difficult. Personally, I think when Google or Facebook places an ad on its own (public) platform, they should have to pay a tax similar to a sales tax as they have sold themselves. Probably shouldn’t apply to the ads on google.com homepage as they don’t sell those as far as I understand.

The article mentions neutrality in taxes which is either silly or disingenuous though. Our tax code is not neutral. We openly use tax code as a way to motivate public behavior.

Ad space as inventory would be an interesting concept. It is tricky though. What is the value of something before it is sold? What is a fair tax rate? How often does this inventory adjust as you have more or fewer users? Messy.


Yep, it's a great idea - treat intellectual property like any other property. However, corporate lobbyists would shoot this sort of legislation down, so i don't have any faith that it can ever come to fruition.


> we're all on board with the idea of intellectual property actually being a thing now

Admittedly I was being a bit facetious about that.


> I personally believe that intellectual property as a whole doesn't make sense in the 21st century and should be abolished.

Do you believe that models should have no right to be compensated if some corporation takes a picture of them off the internet and uses it in their own ad campaign?

Do you believe that McDonalds should be free to make and sell happy-meal toys of whatever kids’ movie is hot lately, even using the logo of the movie in their advertising, with no obligation to compensate the people who made the movie?

Do you believe that if an inventor comes up with a new system for drug delivery and tries to sell it to some pharma company—but the pharma company turns around and does industrial espionage to get access to the technique themselves—then the pharma company should be able to just walk away with the new technology, with the inventor left with no legal recourse?

Those are all “intellectual property”, too. IP isn’t just software patents and overextended copyright terms. A world truly without any IP law wouldn’t be a utopia for innovation; it’d be a dystopia of every middleman having the legal right to produce and sell their own fakes of everything, to the point that brands cannot exist.

It’d be a world where every store, even the brick-and-mortar ones, even the ones selling things like drugs, work like shopping on Wish/AliExpress.

It’d be a world where e.g. the Coca-Cola company makes fake Pepsi products that looks exactly like the ones PepsiCo makes but which taste much worse, and puts them in stores beside the real PepsiCo products, to get Pepsi drinkers to buy those, taste them, think “Pepsi isn’t the same any more”, and switch.

It’d certainly be a world where every drug is just a generic, because you couldn’t maintain a drug brand in the face of identical dups—but it’d also be a world where even the generic store brands of drugs could be switched out at every step of the supply chain for cheaper/nastier alternatives, with no legal consequence (as long as the resulting drugs still met FDA standards.) Without IP law, there’d be no legal recourse to the suppliers who did that. It’d be, at best, a contractual dispute; and so would effectively always come down to the relative depth-of-pockets of the buyer vs. the elements of their supply chain.

Is that a world you want to live in?

Certainly, we need IP reform. But not even the most hardcore libertarian really wants to live in a world where every kind of IP right is unilaterally abolished. Having global capitalism entirely unfettered by IP rights, is like having a car entirely unfettered by brakes. You don’t get a faster car; you get a car that crashes into trees a lot.


More competition can't lead to nastier alternatives, more competition leads to both higher quality and lower prices. And power of copyright that protects pretty much just the wealth of rich corporations doesn't simply disappear. Taking it away creates new power for everyone else equally distributed, leading to more competition, more independent creators and small companies, not more rich corporations, as those won't be able to concentrate wealth without it.

The world you are describing is impossible and part of it already exists precisely because of the power of copyright, not the other way around (models are already screwed and have to give up rights to corporations, they certainly aren't powerful enough to monitor where their images are used, inventors too have to give up rights to corporations and do get their stuff stolen, remember how Google did that? And it was just one public occurrence).


> More competition can't lead to nastier alternatives

Pre-FDA "patent medicines" beg to differ. Austria's wine industry (https://en.wikipedia.org/wiki/1985_diethylene_glycol_wine_sc...) begs to differ. Those were competitive markets! And they certainly increasingly optimized for something as competition increased. What they increasingly optimized for, though, was the naive experience that made people buy the product (e.g. taste; "feeling good"), at the expense of health/welfare outcomes. It turns out that some poisons taste good, and make your product more popular!

Over the long term, statistically, yes, companies that do bad things in the name of profitability probably get boycotted and die.

In the medium term, though, the people that bought the products die as well. That's a necessary step in that long-term equilibrium. The "state of nature" of a capitalist market is one where companies cut corners until the corners kill people, and then people get mad and get together to kill those particular companies.

This is a dynamic equilibrium though. It's not that you end up with corporations afraid to cut corners. You end up with a constant stream of new corporations forming, going well for a while, cutting corners, killing people, and then being taken apart. It's the reverse of P.T. Barnum's "a sucker is born every minute": an unethical entrepreneur is born every minute, to take advantage of those suckers. On average, the set of corporations in existence at any given moment would be half "one step away from killing people", and half "already killing people but nobody's noticed yet."

And indeed, this is how things were for much of the Victorian era (with cyanide-based paints, nitrocelluloid plastics, and other such already-well-known hazards continuing to be sold on the open market) up through to the 1950s.

Being humans rather than animals, we have the unique capability (if not often the motivation) to learn an object lesson without having to personally suffer its negative consequences even once. We can see someone else burn their hands on a stove, and then make a rule about not touching stoves, such that nobody ever has to actually burn their hand on a stove to personally re-derive that rule again.


I apologize but I cannot continue this thread. I've been warned before about ideological discussion on HN so I shall limit myself to expressing my opinion without elaborating.


I didn't take "you'll just have to live with that" as a claim that the user couldn't reverse engineer it. It simply means that they've made it harder for the user to do so, and so the user has to live with it being harder.


this.


> I know we're all on board with the idea of intellectual property actually being a thing now

I don't know that that's true at all! I'd say the pendulum is swinging in the opposite direction.


>we're all on board with the idea of intellectual property actually being a thing now

No. Intellectual property is not genuine property. It is a state granted monopoly and is antithetical to free market principles.


"Intellectual property" isn't a thing, "Intellectual Property Rights" are. They're not "antithetical to free market principles" if you accept that the "free market" requires regulation and that there are benefits to society by the cost of temporarily restricting free use with the subsequent benefit of ideas and expressions not being lost on the death/loss/bankruptcy etc of the IPR holder.

IPRs are (in general) things that are protected by law and open to licensing and civil suits for damages if used outside those laws and licenses:

* Trade Secrets and NDAs

* Copyrights

* Patents

* Trademarks


NDAs, etc. are contract law. Contracts need to be entered in without duress as they would be felonies otherwise. The others are societal contracts which operate much more akin to taxes from a philosophical point of view. But calling today's copyright law a temporary restriction is cynical at best.


Every government protection starts as a temporary relief. Then the corporations that benefit from those regulations use the extra revenue to lobby for more exclusively beneficial regulations that harm their competitors. The result is copyrights that last for infinity minus one years, infantile industries requiring temporary protections yet never leaving infancy, and regulatory boards staffed by CEOs from the industries they regulate. Copyright law, much like communism and eating ice cream for dinner every night, sounds good only in isolation without a historical perspective.


They aren't saying you have to live with the legal restriction, but rather the technical restriction of having to reverse engineer it without the headers


I think the limit is when they are using their IP to prevent you from owning your own data.


Even when a person is on board with intellectual property rights, there is still a distinction between the vendor created program and the user created data. Encoding the data in an undocumented or obfuscated file format may not exert legal rights over that data, but it effectively does so.

I can only see three reasons why someone would support intellectual property rights with respect to file formats: they believe the manipulation of data done by software implies a transfer of ownership of the data, at least in its modified form; they are making a cynical grab for control over the data; or they are incredibly naive.

(There are border cases, such as novel compression schemes, where how the data is stored is the product. That does not really matter when someone is using a file format as a simple container for data. If a file format is truly a border case, there should also be ample forewarning to the end user.)


I can understand why they do it, it's enterprise software so the more open you make it, then the less licences they can sell.

They do have a module you can purchase to run API calls and access their files/software but as you probably guessed that's another $40k license!

Most of my apps I build use this API, but for me to provide to other companies they need them to also buy the API extension.

I'd love to cut out the middle man and I'll do it eventually when I reverse engineer the header!


Although the circumventing of security measures would be illegal in many countries. (Like removing the copy protection of a game)


At $40K/seat, that must be some special software. Simple economics invites competition. Depending on the software, it may not take too many seats to incent the funding of an alternative.

Maybe this is the kind of software that requires huge development costs. But maybe it would be worth 20 seats' worth of customers joining forces to fund a team of 5 people to build you a competing app tailored to your specific needs/wants and completely under your control.

Granted, that could bump your costs from $800K/year to $1.6M+/year. But only short-term. Once your software is production-ready, you drop the costs of your current software. So think of it more like going from $8M/10 years to $6-10M/10 years but having complete control to add the features you want. And perhaps having the opportunity to recoup $millions/year by licensing to others. Or, open source it and give others the same kind of control while benefiting from the features they add. Spread your development costs across more seats to further lower your $/seat.

Or, look at the 100 employees your vendor currently has and lose heart, then hope somebody with deep pockets funds a competitor.


This particular software is for water utilities to model and simulate their water and waste water networks.

It's mostly used so utilites can forecast growth in their areas for the next 25+ years and see the impact on their networks and feed into their capital work projects.

A decently sized utility may spend up to $200M/yr on capital works so $40k isn't even a line item!

There is completion in the market but consultants are forced to use what their clients pick and most utilites aren't that price sensitive.

There are also open source alternatives by the EPA[1][2], and most commercial operators are just wrappers around this public domain software.

I'm trying to create FOSS to help view and run these models.

[1] https://en.m.wikipedia.org/wiki/EPANET

[2] https://en.m.wikipedia.org/wiki/Storm_Water_Management_Model


Thanks for sharing! Your FOSS projects [1] look awesome, wish you the best in building them out. I’m fascinated by examples of niche but critical software applications. It reminds me of Patrick MacKenzie’s thesis that there are more B2B software opportunities out there than you’d expect. [2]

[1]: https://github.com/modelcreate/

[2]: https://twitter.com/patio11/status/1213188681843892224


Thanks! If you're interested I wrote an article listing all 7 FOSS apps I created this year on LinkedIn [1], I also have a site for my main library [2].

I'll eventually convert part of one of these into a B2B app to keep what I do sustainable but I want to keep as much free and open source.

[1] https://www.linkedin.com/pulse/seven-water-modelling-apps-on...

[2] https://epanetjs.com


It sounds like they are using just a basic Linear Regression to forecast growth.

And to the uninitiated, this sounds like a very fancy word, that makes the software seem smart.

But really, it’s just drawing a straight line. And as you add more counts to your x axis, the Linear Regression “forecasts” what the value is on the y axis. This is the magical number, given by the computer, and is used to determine future load or capacity needs.


shrug I used to own a company (since sold) that sold software for $50k/seat. Most people who bought it also bought the automatic failover option, which was another $50k/seat.

The movie business doesn't even blink at that sort of cost if it there's even a small chance to prevent having to set up the remote shot again. The logistics, time, hiring, transport, accommodation, equipment, wages, etc. etc. etc. all make $50k a drop in the ocean.

We spent 2 years writing the software, developing the add-on hardware that helped, and touting it around various Post-production houses. It was used on Star Wars I, The Matrix, etc. Post houses started to take it on board as well. Then we were bought, and the product discontinued. C'est la vie.


I’ve been thinking that the next realm for movie studios is to use the random AI face generated people, that look real, but don’t exist.

These faces can be planted on other fake AI bodies, that move like real people.

Then, what you have is a background full of fake AI people. They look like real people. They move like real people. No more need to hire extras.

You can just have your primary actors act in a green screen. And virtually change the world all around them.

The first company that can commercialize this, is going to make a ton of money. And might even be able to gain first-mover advantage, as they lock in all the studios.


I might be wrong but I thought Dropbox used to do that as well. All the sync state is stored in a sqlite file with the header changed or removed.


Completely agree - I wrote about some fun I had with Apple Photos last month: https://simonwillison.net/2020/May/21/dogsheep-photos/


Enjoyed this post a lot and had no idea apple use SQLite u see the hood for a lot of stuff - thanks for sharing!


Most Mac apps are backed by a SQLite database.


It opens up so many possibilities. Nice write-up!


Heck, I wrote https://github.com/Pomax/lightroom-catalog-helper exactly because lrcat is an open format that takes barely any time to figure out, which means the only limit to what you can do with "lightroom" is what your programming skills allow for.


The only thing stopping SQLite from becoming the primary application file format is browsers. Every single other client supports SQLite. Embedded, Mobile, Lightroom, Desktop Apps and Cross platform mobile frameworks have really good support for SQLite.


On the one hand, browsers dislike access to files (regardless of format) for security reasons. On the other hand, browsers use SQLite a lot. For example Firefox uses SQLite at least for IndexedDB (various files) and for history and bookmarks (places.sqlite): https://developer.mozilla.org/en-US/docs/Mozilla/Firefox_Ope...


Wasn't there an effort a few years back to try to create a W3C standard for accessing data that stalled because all of the browsers used SQLite and no one wanted to create a different implementation? IIRC, the standard required two different implementations and SQLite was used by everyone?


You're think of WebSQL. My take on it dying was a combination of "key-value stores can replace relational semantics and key-value stores are cool", "only oldsters use SQL", and "wheel-reinventing is fun". The whole "we need multiple independent implementations" was a convenient excuse to kill it in my view.



Primary application file format for... what kind of files? Because it's certainly not going to (nor should it ever) replace hypertext documents.

Being able to load a read-only .sqlite database might seem cool, but I can't think of a single instance in which that's smaller and/or more efficient than calling data endpoints that use gzip/brotli for transport.

Or are you thinking "as replacement for IndexedDB"? In which case, hard agree but then it's an actual database, not used as file data container.


> sqldiff

You just blew my mind. Thank you!

I can't believe the things I find out from random HN comments.


I tried to do that with a Firefox SQLite database files but it had an exclusive lock.


This is the case for any application that constantly writes to their dbs, so you usually run your own application/utility on them when the "parent" application isn't running.


You can usually work around that by creating a copy - I run "cp /path/to/locked.db /tmp" all the time for this.


Make sure to also copy locked.db-wal if it's present. (Is there a better way to do all this? I feel like there really ought to be a better way.)


Is it any different from JSON or XML?


that's like asking whether .zip is any different from .txt - yes, it's rather quite a lot different. A databases--that-is-a-file-on-disk, used as application file format for containing arbitrary -and arbitrarily complex- data is almost nothing like a plain text markup document.


Databases, XML and JSON are structured formats. They contain embedded metadata (tag names/keys/table&column names) so reverse-engineering is much easier.

I don't really see any difference or advantage of using SQLite over XML, unless you need full RDBMS engine power for your configuration (highly unlikely).


Then it kind of feels like you're willfully ignoring the differences just so you can claim "they're structured formats".

1. SQLite database files are self descriptive, which JSON and XML are not.

2. SQLite is a compressed binary format. JSON and XML are plain text.

3. building on that, SQLite files were designed to contain arbitrary data, and you can store whatever you want using the BLOB datatype. JSON and XML can't, they have no notion of types, everything has to be syntax-compliant strings.

4. SQLite files can be encrypted. JSON and XML can't.

5. SQLite files were designed to be queried. JSON and XML are not.

And I already hear you try to object, so let's expand:

1. JSON and XML cannot describe their own structure, their syntax is prescribed, but any schema has to come from either external files, like XML's DTD, or from literally nothing because JSON has no official schema language. Yes, https://json-schema.org/ exists, but it's certainly not an official spec - maybe "yet", maybe just "full stop".

2. Yes, you can compress JSON/XML using zip/etc but now it's no longer JSON/XML. Now it's an archive file like any other archive file, and it's "nothing" you can work with until you unpack it, incurring delays, and then repack it once you're done, incurring even more delays (because packing is far more work than unpacking), and worse: the bigger the file gets, the longer the delay becomes.

3. Sure, you can convert binary to a text format and then put that in JSON/XML too, but there no types: you have no way of universally indicating which field is plain text, and which field is binary-as-text nor a way to universally indicate which encoding you've used.

4. Same as (2): sure, you can encrypt the data yourself, but there is no universal spec for indicating what encryption you used for which fields, or parts of the file, or the entire file, in JSON or XML.

5. JSON and XML are data serialization formats: they are perfect for transport, but they are not data repositories and even at the spec level make zero affordances towards efficient data retrieval, storage, and representation. Which for an application file format is critical.

Can you use JSON/XML as application file formats? No, not really. They're terrible for that purpose. They might work for configs (as you point out), and they are fantastic for moving small quantities of data from one system to another (for large quantities, they make no sense: even something as dumb as CSV becomes more efficient when dealing with lots of data) but they are absolutely unsuitable for storing arbitrary application data, because they are horrendously inefficient for almost everything an application needs out of a good file format, without rolling your own standards, for which there is no enforcement tooling unless you write that yourself, too. And then get others to adopt it as well. And now you're started down the path of replicating what SQLite already does, and doing so poorly.


With XML, the schemas don't have to be external - if you're designing the format, you can accommodate for embedded XSD (i.e. the schema is inline, and references itself).


It's very annoying when the applications use sqlite with encryption. That is not open source, so regular language bindings can not be used.


I'm not sure how slqite as your application file format has anything to do with open source, honestly. It's certainly super easy for open source projects to use, but a closed source proprietary application using an encrypted sqlite file sounds perfectly sensible to me?

We all win when folks decide to leave it accessible, but I'm not going to hold "encrypting a file format so that people can't easily reverse engineer it" against folks who are trying to sell software.


It looks like the actual library to read/write an encrypted SQLite DB is licensed separately from the open source SQLite[1], so even if you have the encryption key, you can't access the encrypted format:

1: https://www.sqlite.org/see/doc/release/www/readme.wiki


> I'm not going to hold "encrypting a file format so that people can't easily reverse engineer it" against folks who are trying to sell software.

I think I'd mostly disagree. Selling an application is one thing, but the data itself is usually customers' and holding their data hostage is not a proper thing to do.


That's a tenuous argument at best, heavily relying on the application in question not offering any way to export your data, which many applications do come with. Just because the file format is locked, doesn't mean the data in it is locked. You just need the application to unlock it.


This is self contradictory.

Listen to what you're saying: the data isn't locked, you just need the... application, to... unlock it.

That's beyond tenuous, it's invalid.


Almost as if different clauses can impart different context to the same word. The data is not locked, provided you have the application that manages that data. If you want to call it "locked" then "the application itself turns it into unlocked data for you".

That argument that a proprietary file format "locks" data ignores the fact that it's the pair {application, file} that determines whether your data is locked away or not. For instance: Microsoft Word data? not locked. You can trivially export it in quite a lot of ways. Can you easily get it from a traditional .doc file? Hell no, but that doesn't mean the data itself is inaccessible. Instead of using sqlite, or xslt, or PERL, or whatever, you use word.


You can say it's invalid but it's the fact of the matter, the reality, and since the data in question is only in the app because the owner of the data has access to the app it means that the data is not locked away from the owner, because they have access to the export function.


I guess he/she meant that the extension that encrypts the database is proprietary. I believe that was the only option for a while but now there is an open source alternative.


Is the open-source alternative bidirectionally interoperable with the closed-source version (i.e. it can decrypt the proprietary encryption and the proprietary version can decrypt the FOSS version)? If so, then that's handy.


The "officially blessed" (my wording) encryption for SQLite is SEE (proprietary):

https://www.sqlite.org/see/doc/release/www/index.wiki

There are third party encryption approaches for SQLite. One of the most popular is SQLCipher:

https://github.com/sqlcipher/sqlcipher/

Those two don't have compatible file formats. eg no interoperability

Not sure if any others do, but yeah it would be handy if there are.


“Atomic transactions” is a feature needs formal support in random file formats way more often than people realize. Simply writing to a file at all in an guaranteed-atomic way is much harder than it looks. That guarantee becomes important when your app gets widely distributed. If you have a million users of your free mobile app, 1 in a million events happen every day. For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

I’ve used SQLite on mobile apps to mitigate this problem. I’ve used LMDB on a cloud app where the server was recording a lot of data, but also rebooting unexpectedly. Would recommend. I’ve also gone through the process of crafting an atomic file write routine in C. https://danluu.com/file-consistency/ It was “fun” if your idea of fun is responding to the error code of fclose(), but I would not recommend...


Posix (and I think windows) guarantee this atomically and durably overwrites a file:

tempfile = mkstemp(filename-XXXX)

write(tempfile)

fsync(tempfile)

close(tempfile)

rename(tempfile, filename)

sync()

Assume the entire write failed if any of the above return an error.

In some systems (nfs and ext3 come to mind), you can skip the fsync and/or sync, but don’t do that. It doesn’t make things significantly faster on the systems where it’s safe, but it definitely will lose data on other systems.

The only loophole I know of is that the final sync can fail, then return anyway. If that happens, the file system is probably hosed anyway.


Here's the article I couldn't find before on the process https://lwn.net/Articles/457667/

Which links full source for the process you described https://lwn.net/Articles/457672/


You need a recovery step on startup to retry the rename if tempfile is complete, or delete it if it isn't.

That means you need a way to verify that tempfile is complete. I do that by removing filename after completing tempfile. And that requires a placeholder for filename if it didn't already exist (e.g. a symlink to nowhwere).

On crash, rename may leave both files in place.

This technique doesn't work if you have hardlinks to filename which should refer to the new file.


Regardless of whether the tempfile is complete, you can just ignore (or delete) it on startup. From the caller's perspective, the save operation doesn't succeed until the rename is done and written to disk.


I second the recomendation of LMDB. With one important caveat: under heavy write load it is perfect demonstration of brokenness of semaphore implementation on freebsd and macos.


In what way is LMDB better than eg SQLite or Redis? For what kinds of use cases would you recommend it?


It's very simple. Single C file implementation. Binary blob keys : binary blob values. The end. If all you need is a bunch of blobs written and read back reliably while in a statistically unreliable situation, LMDB is great.


Well, sqlite is also just a single .c with a single .h.



You're both right... kinda.

SQLite is dozens of files, if you're browsing it, or modifying it. Which you can do, as long as you're ok with contributing your changes being impossible, since it's open-source but closed-contribution.

If you're compiling it, it is, in fact, one .c and one .h:

https://www.sqlite.org/amalgamation.html

If you're choosing between LMDB and SQLite, the latter sense is the relevant one, so they're identical along this particular axis.


LMDB and SQLite are not directly comparable. LMDB is a transactional B+tree-based key/value store. SQLite is an implementation of a transactional SQL data model on top of a B+tree-based key/value store, so it is logically at least one abstraction layer higher than LMDB. (Key/value stores underlie pretty much all of the other data models you'll ever use.)

That aside - LMDB is not just smaller, faster, and more reliable than SQLite, it is also smaller/faster/more reliable than SQLite's own B+tree implementation, and SQLite can be patched to use LMDB instead of its own B+tree code, resulting in a smaller/faster footprint for SQLite itself.

Proof of concept was done here https://github.com/LMDB/sqlightning

A new team has picked this up and carried it forward https://github.com/LumoSQL/LumoSQL

Generally, unless your application has fairly simple data storage needs, it's better to use some other data model built on top of LMDB than to use it (or any K/V store) directly. (But if building data storage servers and implementing higher level data models is your thing, then you'd most likely be building directly on top of LMDB.)


For my applications: latency. The fact that writers do not block readers is just a nice bonus.


Do you reckon LMDB would be reasonably performant compared to raw mmaped files for zero-copy passing large images between processes? I want something like a ring buffer but mitigate the risk of use-after-free if the consumer lags. Seems like lmdb automatically re-uses memory that's long been freed but is sensible to detect that an incoming write needs more space.

Python/c++.

I've looked into mmap and flock but it's messy and not highly portable.


LMDB is in fact not much more than large mmaped file and clever synchronization mechanism on top of that that also serves as dictionary implementation and transaction mechanism. My uses of LMDB are essentially replacements for mmaped file and LMDB gives me the ability to sanely do partial updates.


Sounds exactly like what I'm looking for, definitely will be digging into it more. Thanks!


> For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

Don't filesystem journals ensure that you can't get a garbled file from sudden shutdowns?


They ensure you don't get a garbled filesystem.

They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns. But most programmers at most times would rather just let the OS make smart decisions about performance at the risk of garbling the file, or if they really need Durability, just use a library that provides a higher level API that takes care of it, like LMDB or an RDBMS like SQLite.

To not get your file garbled, you need to use an API that knows about legal vs. illegal states of the file. So either the API gets a complete memory image of the file content at a legal point in time and rewrites it, or it has to know more about the file format than "it's a stream of bytes you can read or write with random access".

Popular APIs to write files are either cursor based (usually with buffering at the programming language standard library level, I think, which takes control of Durability away from the programmer) or memory mapped (which realllly takes control of Durability from the programmer).

SQLite uses the cursor API and is very careful about buffer flushing, enabling it to promise Durability. Also, to not need to rewrite the whole file for each change, it does it's own Journaling inside the file* - like most RDBMSs do.

* Well, it has a mode where it uses a more advanced technique instead to achieve the same guarantees with better performance


> They ensure you don't get a garbled filesystem.

Well, they do that, but they also protect data to reasonable degrees. For example, ext3/4's default journaling mode "ordered" protects against corruption when appending data or creating new files. It admittedly doesn't protect when doing direct overwrites (journaling mode "journal" does, however), but I'm pretty sure people generally avoid doing direct overwrites anyway, and instead write to a new file and rename over the old one.

I'm not sure if it would protect files that are clobbered with O_TRUNC on opening (like when using > in the shell). I would imagine that using O_TRUNC causes new blocks to be used and so the old data isn't overwritten and it isn't discarded because the old file metadata which would identify the old blocks corresponding to the file would be backed up in the journal.

> They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns.

As far as I see for the general case, being "very careful and really knowing what you're doing" consists of just avoiding direct overwrites. Of course, a single file that persists data by the needs of software similar to a web server (small updates to a big file in a long-running process) is going to want the performance benefits of direct overwrites. I can totally see SQLite needing special care. However, I don't think those needs apply to all applications.


When I said "allows you, if you're very careful and really know what you're doing, to write performant code that won't garble files", by "performant" I was alluding to direct overwrites. If you don't need direct overwrites (perhaps because your savefiles are tiny), then no problem. If you do, you should use SQLite or LMDB or something, unless you work at Oracle or somewhere else where your job is to compete with them.

The example I had in mind was Word, which gave up on direct overwrites and managing essentially their own filesystem-in-a-file in favor of zipped XML, which is really good enough when writing a three-page letter, but terrible when writing a book like my mother is. Had they used SQLite as a file format, we would've gotten orders-of-magnitude faster save on software billions of people use every day.


> As far as I see for the general case, being "very careful and really knowing what you're doing" consists of just avoiding direct overwrites.

That's a dangerous thing to say. There are many ways to mess up your data, without directly overwriting old data.

If you write a new file, close, then rename, on a typical linux filesystem, mounted with reasonable options, on compliant hardware, I think you should have either the old or new version of the file on power loss, even if you don't sync the proper things in proper order, but that's only because of special handling of the common pattern. See e.g. xfs 0 size file debacle.

Not an expert.


> Don't filesystem journals ensure that you can't get a garbled file from sudden shutdowns?

For once, I have a blog post that goes in detail into why that is not true!

(Or rather, not true unless you implement it in a very sophisticated manner rarely used in practice.)

https://espadrine.github.io/blog/posts/file-system-object-st...


Not really. If you have a file format that requires, e.g., changes to be done in two places then it's reasonable to write to one place, have the system shut down never having written to the second place, and now have a corrupt file.

The journal ensures (helps ensure?) that individual file operations either happen or don't and can improve write performance, but it can't possibly know that you need to write, e.g., two separate 20TB streams to have a non-corrupt file.


For a single file, I thought that write operations were committed when e.g. closing the file or doing fsync, but now I'm not sure. I wonder if the system is free to commit immediately after a write() ends.

Based on your scenario, if an application-level "change" involves updating 2 files, interpreting the update of only one file and not the other as a corruption, you're right that filesystem journaling wouldn't suffice. However, in that case it wouldn't be that a single file was corrupted.

Still, I wonder about the other case, about when the filesystem decides to commit.


The system is free to commit immediately after a write() -- fsync or closing the file simply guarantees commits.

>Based on your scenario, if an application-level "change" involves updating 2 files

It could be two parts of the same file too. E.g. if you're using a single file with something like recutils with a single file to implement double-entry accounting and only commit one entry. You'll at least be able to detect the corruption in that case (not that you can in general), but you won't be able to fix it using only the contents of the file.


Windows had a feature that had snapshot-based transactions across the entire file system (NTFS-only, though). Unfortunately, it has been deprecated... it's such a shame that we can't seem to move on from the notion of filesystem from 40 years ago, apparently.


I've been reading through many of the other pages on the SQLite website. I didn't realize that SQLite is developed by a company (Hwaci) and funded by selling support, licenses, and testing services. I had always assumed it was an open source project backed by a non-profit, similar to the Apache Software Foundation.

This is another very interesting example of an open-source business. I would be interested to learn more about how the Hwaci company operates (revenue, number of employees, etc.). I find this very interesting:

> We are a 100% engineering company. There is no sales staff. Our goal is to provide outstanding service and honest advice without spin or sales-talk.

They list some "$8K-50K/year" and "$85k" price tags directly on their "Pro Support" webpage. These would usually be behind a "Schedule a Call" or "Get Quote" button. I've been thinking about doing something similar with my own on-premise licenses and support contracts. I'm not very good at sales and I don't really want to hire a sales team, so I'd be interested to know how this worked out for them.

I also liked this sentence, which is very similar to Basecamp's philosophy (and both companies were started around the same time - 1999 vs 2000):

> Hwaci intends to continue operating in its current form, and at roughly its current size until at least the year 2050.

It's interesting to think that SQLite could have raised money and grown into a billion-dollar public company with thousands of employees.

I'm going to listen to this Changelog interview with Richard Hipp now [2], and also this talk on YouTube [3].

[1] https://sqlite.org/prosupport.html

[2] https://changelog.com/201/

[3] https://www.youtube.com/watch?v=Jib2AmRb_rk


Also important to note is that enterprise pricing is way higher than you'd think because when an enterprise needs a solution, something like $1000 is literally too cheap to justify the cost of procurement (which involves not just getting a license, but also internal documentation, which in turn needs to be entered into whatever process certification programme is in place, etc.)


That’s why many companies have a policy where licenses below a certain value can just be expensed.


As someone who has had to evaluate and recommended enterprise software for purchasing, a straightforward price tag is going to make me come your product in a better light.

Plus, I have to wonder how much extra profit the "call us" route actually takes in, after you've subtracted costs for the marketing staff it requires (especially if you try to renegotiate the cost after the subscription/license expires).


Thank you for writing up about the company behind SQLite. I never knew!

I knew SQLite is well-respected by many, considered one of the best examples of software engineering. I just assumed that it was created by an individual or small team of brilliant minds - and developed/maintained by a user community - as such well-designed software often is.

The company sounds great. Their approach to business is refreshing, and reminds of a few other exemplary companies with principles, daring to tread their own path to success.

As an addendum: Hipp, Wyrick & Company, Inc. (Hwaci) is based in North Carolina, USA.

http://hwaci.com/


Yes, it's a very smart business model. SQLite is extremely useful at all levels; I get it for free (not even merely open source, they took pains to dedicate it into the public domain, which is part of why they don't take patches), and enterprises pay to have their procurement boxes ticked, legal documents signed, and so forth.

The hypothetical Hwaci which tried to be a trendy billion dollar company would be so much worse, and you probably wouldn't even be able to buy support. No one would benefit but institutional money.


From experience, I can also recommend using SQLite as an application file format. I landed on SQLite after looking for solutions for a file format for an educational app we made for simulating biological neural networks. The app is cross-platform, written in Qt and the simulations needed to be stored as a JSON describing the network, a thumbnail and some metadata. It was also intended to be extensible with more features and backwards compatible if new versions were released. I considered creating our own simple format, using ZIP files, HDF5, Qt resource files or SQLite.

I landed on SQLite for many of the reasons outlined in this article and in particular because of how easy it was to implement and maintain. SQLite is supported natively in QtSql, which made it extremely easy to write the save and load functions, and later extend these with more data fields. In addition, we did not have to worry about cross-platform support since this was covered by SQLite and Qt already.


It sounds like future schema changes may be a potential concern for your application. One thing you can look into using is the SQLite user_version pragma. We use this right now to roll our own migrators and it's light years better than how migrators work for Entity Framework, et. al.

https://www.sqlite.org/pragma.html#pragma_user_version


Interesting, I used HDF5 in a similar situation because we needed to save a lot of same-sized rows of data (simulation time steps), so a matrix-oriented format seemed to make sense but it wasn't entirely without some need for cross-referencing between tables, so it does make me wonder now if sqlite would have been a comparable or better choice. Any reason for rejecting HDF5 in your case?


Is it cheap to insert/update/delete data in HDF5? If not, that should be the answer. I'm also curious if Parquet would fit your requirements.


SQLite is amazing – trivial to embed in almost every context I've tried, rock solid, incredibly flexible, performs well enough for most use cases and for the others it probably still performs better than the solution I had in mind. I wouldn't dream of using anything else for application data, unless you need something text based for human readers as well, and can't deal with having people use a SQLite client. And even then, really consider whether you need this.

With SQLite embeddable in websites thanks to wasm, and the ability to create object URLs it's also pretty trivial to make full blown (read-only) SPAs delivered as a single HTML file. This last bit might seem crazy – and it kind of is – but if your clients are all on a LAN in a corporate network so bandwidth and latency aren't really an issue it makes a bit more sense.

I love SQLite, hands down one of my favorite tools.


I think SQLite is underrated.

With WAL mode + increasing the page cache you can get some excellent concurrency, even if doing reads and writes at the same time.

With rqlite it's easy to make it a server database and have a cluster of SQLite databases (https://github.com/rqlite/rqlite).

I wouldn't try to create the new Instagram with it, but I think it'd be capable enough for many apps that are built on top of more complex DBs.


You are right! In our Delphi/Pascal community we have https://github.com/synopse/mORMot, which is a full client/server DB, remote ORM and RPC system based on top of SQLite.


While we are on the topic of file formats, I stumbled across a cool project the other day that aims to make parsing well-known binary formats much easier: http://kaitai.io/

In high school I was obsessed with the game Starcraft. It came with its own map editor that would save to its own proprietary format. Some smart people came along and reverse engineered that format and allowed us to do all kinds of neat things we weren't supposed to. I see modding communities for games are more popular than ever, and finding this project brought back lots of great memories.


I'm super-on-board with declarative binary parsing schemes (i.e. not lua or python) but my own attempt to use kaitai was met with the same frustration that presumably caused PDF to be missing from the format gallery list (http://formats.kaitai.io/): jump or offset-based structures

At the time, I tried using a stunt like defining one huge blob that "eats" the main file, then reaching back into it as we learned more, but it looks like somewhere along the way they acquired a "substream" behavior (https://github.com/kaitai-io/kaitai_struct_doc/blob/c53060f7...) so maybe it's worth another look


Can you explain more? I deal with a shocking number of poorly supported binary formats, and this tool looks awesome. Buuuuttt... a lot of these formats are “headers with offsets to structured data”. They’re intermixed logs, so multiple incompatible bi art formats are freely mixed in the same file.


I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)

I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file

    meta:
      id: pdf
      file-extension: pdf
      endian: le
    seq:
    - id: magic_bytes
      contents: '%PDF-'
    instances:
      startxref_hack:
        type: startxref_hack0
        pos: _root._io.size - 24
        # pick a reasonable guess to wind backward
        size: 24 - 5
      eof_marker:
        # contents: '%%EOF'
        type: str
        encoding: ASCII
        size: 5
        # this isn't strictly accurate,
        # due to any optional CRLF trailing bytes
        pos: _root._io.size - 5

    types:
      eat_until_lf:
        seq:
        - id: dummy
          type: u1
          repeat: until
          repeat-until: _ == 0xA

      startxref_hack0:
        seq:
        # this isn't accurate, since we may have jumped
        # into "endobj\n" or worse
        - id: junk
          type: eat_until_lf
        - id: startxref_kw
          contents: 'startxref'
        - id: startxref_crlf
          type: eat_until_lf
        - id: startxref_offset
          type: str
          encoding: ASCII
          terminator: 0xA
As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does

Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either


There's also "poke" which is actively developed by a hacker on GDB and GCC:

http://www.jemarch.net/poke

https://kernel-recipes.org/en/2019/talks/gnu-poke-an-extensi...


Amongst all the praise we're seeing for this approach here, may we please just have a minute's silence for the various exploits sqlite has seen when loading arbitrary database files over the years?

My recent favourite being https://media.ccc.de/v/36c3-10701-select_code_execution_from...


This concern was just raised on the SQLite Forum (probably after showing up here). See my reply at https://sqlite.org/forum/forumpost/8beceed68e for additional insights into the problem and recent SQLite enhancements to address it.


Absolutely, there are many things that can be done to mitigate this kind of thing, I'm merely advising caution over people immediately reaching for a solution where they're executing thousands upon thousands of lines of code over untrusted files when they can get away with something much simpler.


Interesting. That one only affects systems which add their own application-specific SQLite functions to the database after loading it.

That's probably a niche case which doesn't affect most systems. But it'd be good to know about and watch out for in the systems that do.


Is there any way to mitigate code execution issues (besides just fixing bugs)? Because allow code execution might be a pretty big deal breaker in certain contexts.


This is a security flaw of the system or application using SQLite, though. It is trusting data it should not trust and executing code provided with that data.


If I use libpng or libjpeg to decode an image and that causes a buffer overrun or another security vulnerability, that is very much a bug in libpng or libjpeg.

The same is true when I open a database in SQLite: if that causes a security problem, it's a bug in that library. I don't even see how you could validate a database file before you hand it over to SQLite.


The exploit mentioned above relies on third party functions being loaded into the database. eg application specific functions

Those functions have to be explicitly loaded by the application after loading the database. eg: They're not stored in the database and loaded with it.

You'd validate the database (past the standard integrity checking), by loading it and not adding any third party functions. Then check it's not doing anything dodgy with views, or outright disable views. Then load the third-party extensions (if needed).

If you wanted to go even further, you could also add an authorizer callback function:

https://www.sqlite.org/c3ref/set_authorizer.html

That's called (multiple times) any time a SQL statement is prepared/executed, and catches things like functions being run, tables being accessed, etc. You can use that to only allow a whitelisted set of functions to run, which would help in some scenarios. eg:

https://github.com/sqlitebrowser/dbhub.io/blob/e1c5042f857e9...

The "Defense Against Dark Arts" page on the SQLite website has further good info about this:

https://www.sqlite.org/security.html

As a data point, I implemented the above "Defense against Dark Arts" stuff recently for an online SQLite data publishing platform to let people run free form queries on databases (dbhub.io). It wasn't all that hard to implement. Much less so that I'd been expecting. :)


Huh? If you're using SQLite as an application file format, isn't opening untrusted files a key use case? What else, never share files with people? That must be a pretty boring application.


Sure but just like a web browser, it’s the application’s responsibility to sandbox or verify the code, not the http library’s.


That translates to "don't use SQLite", right?

How would an application that uses SQLite as a file format be able to scan for malicious database files that trigger buffer overflows in the SQLite engine? I'm really not sure what you're suggesting.


From the description, I didn't watch the video, loading the database and querying it is enough to run the exploit. To compare with your example, it would be like having a remote exploit if your application queries a specific http endpoint.

For sure application developers could sandbox the http library, sqlite, or stop using libraries developed in so unsafe programming languages but it's a bit too early for that.


I was surprised, but also intrigued when I discovered that Manga Studio (Clip Studio these days) used sqlite for the native file format, including the pixel planes.

I wish my predecessor at Krita had made that choice, instead of choosing to re-use the KOffice xml-based file format that's basically a zip file with stuff in it. It would have made adding the animation feature so much easier.


So isn't it possible for Krita to create an updating process and switch? Btw. it's really fantastic Krita is on Android now!


What's wrong with xml + zip?


XML-based formats are export formats, not state-keeping formats. To use an XML-based format for storage, need to have a separate, canonical in-memory representation of the data, which you then snapshot and serialize into XML upon request. You may or may not be able to get away with serializing less than your full in-memory object graph upon save, using techniques similar to DOM reconciliation. Either way, you'll still likely need your entire document/project represented in memory.

If you're working with something analogous to a text document, this snapshot-and-serialize approach to saving works fine. If you're working with other types of data, though, this approach only works for trivial projects; once your document exceeds ~100MB, the overhead of snapshotting+serializing your object graph becomes bad enough that people stop saving very often (dangerous!), and it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

And, once your project size exceeds the average computer's memory capacity, an in-memory canonical representation quickly becomes untenable. You start to have to resort to hacks like forcing the user to "partition" their project, only allowing the user to work with one pieces at a time.

With an applicaton store-keeping format, you have none of these concerns; the store is itself the canonical data location. You don't have a canonical in-memory representation of the data; the in-memory representation is simply a write-through or write-back caching layer for the object graph on disk, and the cache can be flushed at any time. Or you may not have a cache at all; many systems that use SQLite as a file-format just do SQL queries directly whenever they want to know something, never instantiating any intermediate in-memory representation of the data itself, only retrieving "reports" built from it.

† You can fix fragile saving with a WAL log, but now the WAL log is your true application state-keeping format, with the XML format just being a convenient rollup representation of it.


> it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

This is one I take very seriously, after I got bit by it. I was saving state by writing s-expressions to a text file; it seemed a reasonable enough thing to do even with tens of megabytes of it, until my laptop turned off in the middle of a write. After recovering from a backup and losing several hours of work in the process, I switched to SQLite that evening.


I've never had my problem scale to the size that required a database/SQL, but I don't quite get the advantage of your solution. Having all your interactions with data have to go to disk though a cache muddles things b/c it makes it much harder to reason about performance (b/c when do you have a cache miss? and how do you configure a cache properly?) You introduce a lot more blackmagic variables to reason about.

If you're editing images I'd think it'd just makes more sense to have all of your stuff in RAM and then a saving-to-disk is done on a separate thread. I don't quite get why the users would stop saving in this example.

I'm not saying you're wrong - but more asking for some more details b/c I've never imagined using a DB on data that can fit in RAM


It's primarily a problem of inflexibility handicapping performance, not of "cache misses" and clever algorithms.

For example, imagine a word processing program opening a document and showing you the first page: you could load 50MB of kitchen sink XML and 250 embedded images from a zip file and then start doing something with the resulting canonical representation, or you could load the bare minimum of metadata (e.g. page size) from the appropriate tables and the content that goes in the first page from carefully indexed tables of objects. Which variant is likely to load faster? Which one is guaranteed to load useless data? Which one can save the document more quickly and efficiently (one paragraph instead of a whole document or a messy update log) when you edit text?


ah okay, incremental loading seems essential and I hadn't considered it. Thanks for explaining :)


One of the famous SQLite refrains is:

> SQLite does not compete with client/server databases. SQLite competes with fopen().

This undersells SQLite somewhat. Like Berkeley DB, SQLite was created as an alternative to dbm [1] and one of the main use cases is safe multi-process access to a single data file, typically CLI apps (written in C, TCL, etc.).

Client-Server databases tackle multi-user concurrency while embedded databases often tackle multi-process concurrency.

This article has long been part of the SQLite documentation found under the "Advocacy" section. There is also a short version. [2]

[1] https://en.wikipedia.org/wiki/DBM_(computing)

[2] https://sqlite.org/aff_short.html


I don't personally know anything wrong with it, but SQL queries are powerful powerful powerful, so although XML might not have any huge issues, in my personal opinion it is still sub-optimal compared to what is possible with SQL (at least for my uses, which could be very different to yours so take with a grain of salt). In particular, SQLite is a tiny executable with a HUGE amount of tests and is used very widely, so reliable for a lot of tasks as long as the data size stays relatively small and you caste your types properly.

SQLite also has a lot of tutorials and books and language support. I've used it with Python, C#, Perl, TCL, and Powershell with no issues. You can access it via the command-line or you can hook into it with a fully graphical SQL IDE like DB-Visualizer (I really recommend using an IDE for interactive SQL use). If your language doesn't have built-in or library support, even a novice programmer like myself can roll a few functions together to build the tables, update, delete, and run queries to analyze the data if you can run some system commands. It's a wonderful little technology that I feel comfortable reaching to when I need it.

One thing I've shied away from over the years are technologies which require running complex installers as it makes things more confusing and makes it harder for me to share with colleagues that aren't as interested in programming. Both SQLite and DB-Visualizer require zero installation. I just put each in a folder and then run the executable. This is really easy to use to me and easy to get others started too. Note that this is not commercial software, but internal business apps to help people do complex tasks easier. So you have a script that does some data processing pushes that data to SQLite and then the user can bring up DB-VISUALIZER, point it to the little SQLite .db file I created and then get to work. We have a lot of little apps like this and since most of our engineers are really good with SQL, they can do whatever they need efficiently.


Seems very hard to seek in XML? And how would you update it in place? Also seems like it would be massive?


Well, we're not storing pixels in xml, but now we've got two xml files, one with author data, one descibing the image and a folder tree containing various kinds of binary data, all in a zip file. And yes, updating in place is a bother.


It's really hard to update in place, we basically have to rewrite the whole file after a change, and, of course, we've got to do a lot of things ourselves, like inventing an xml format for describing the image.


Do people really pick XML as a go-to format these days? Probably easier to list what's not wrong with it


Yes, unfortunately, if you're in an environment where people have been using XML forever and are more comfortable with it than any other alternative. When you find yourself needing to write an XPath to update a simple app configuration parameter, you know you're in hell.


Well, these days 22 years have passed since the original file format was designed -- and not by me, though I probably would have done the same thing, 22 years ago.

Also, it's zip + xml files + binary files, not all xml.


Well, I was responding to the "what's wrong with xml + zip" question, not your remark on your predecessor


Recently I've been using SQLite to store state for a data-syncing CLI utility in Python where I was considering using CSV or JSON at first. The Python 'Dataset' library, https://dataset.readthedocs.io/en/latest/ has taken almost all of the friction out of using SQLite for trivial state persistence.

I don't have to manually manage schema or create tables - they're all lazily created from Dict keys on insertion, but indexes and SQL can still be used as desired.

This lends itself really well to Jupyter notebook assisted development, where functions can be quickly and interactively iterated without having to muck around with existing tables whenever data changes shape.

It's been a real productivity boost, and I've been looking around for something similar to use in Clojure.


Thanks for posting this, it's exactly what I was looking through this thread for.


How would SQLite go as for something like Audacity projects? Audacity projects are a pile of files: an XML file that declares the project metadata, tracks, channels, envelopes (an example of non-destructive editing), and references to the zillions of audio files that it drops in a directory beside the XML file (typically less than 1MB each).

I presume it splits all the audio up into small files so that most types of edits can only need to touch a small area.

If you directly ported that to SQLite, would it work fairly well, or would you want to restructure it somehow? Things like additions or deletions, would it need to write lots of extra data to the disk (would it be doing something like defragmenting, or would it grow larger than it should, or are there other tricks that I don’t know about to delete a chunk from the middle of a file without needing to rewrite all of the file beyond that point)?


Adobe Lightroom seems comparable (non-destructive image editing.) It uses a sqlite database that points to image files on the filesystem. I think precisely the same approach could be taken with an audio editing application.

You could also consider putting the audio files into the sqlite db, which might work alright. I've heard of image thumbnails being stored in sqlite dbs (maybe by Lightroom iirc?) though those are probably smaller than your audio clips I'm guessing.


Shouldn't this has 2014 inside the title? The story has also appeared several times in HN before.

Very interesting concept but now I think perhaps application file format using TileDB will be much better since it can support sparse data as well [1].

[1] https://github.com/TileDB-Inc/TileDB


TileDB uses a "pile of files", as the article calls it, it doesn't have SQL, and it has a rather embarrassing point of view on race conditions (https://docs.tiledb.com/main/basic-concepts/consistency). It could be a good choice for applications that rely on large and simple multidimensional arrays, hardly the same "market" as application file formats.

Moreover, database tables are a very good fit for sparse multidimensional arrays.


Perhaps SQlite or any SQL engine is not a good idea for application file format [1],[2], or is it just bad for web applications?

[1]https://news.ycombinator.com/item?id=23517663

[2]https://support.google.com/chrome/thread/17537877?hl=en


That's not at all similar it doesn't support SQL


We transitioned the file format for the images in our climbing-topo app from a pile of files to SQLite (the images are tiled).

Going from having high hundreds of tiny files on disk per topo photo to just one was an incredible boon to productivity for things like data backup and transferring files onto the device for testing.


Tell me more about your app


It acts as a digital window into our print guidebook catalogue. The books are mostly for the uk, with the rest spread across Europe.

I take the finished desktop-publishing documents and extract and package the data to put into the app, including tiling the images so we can have very high resolution photos on low-end devices.

There's a semi-technical article about the topo-view implementation here:

https://www.ukclimbing.com/articles/features/rockfax_app_dee...


Years pass, now decades, yet SQLite is still the most pleasant DB to work with.


Agree, but it also side-steps some complex topics like sharding, replication, high-availability, etc. I’m not suggesting those topics should be addressed by SQLite; that simplicity is an asset.

Has anyone used SQLite remotely over a network?


It lives in your process. It has no network features. It's not a DBMS.

People have built layers over it though


I think the main reason SQLite doesn't have them is because it's not a goal for the project. SQLite would be monumentally more complex if it supported networking, sharding, etc. If you need those use cases then perhaps it's a good idea to switch to a more scalable DBMS. My favorite is PostgreSQL. If I'm building something which will have user-generated content then I always pick Postgres. Otherwise, SQLite.


I am working on a flash card app and specifically did not want to have to create a SaaS to support cross-device data sync, so have opted instead to create a journal of the data mutations on each device and having those (i.e. it's CRDT) uploaded to a simple file store (like Dropbox) for each device to share. Each device stores its own local data using SQLite, but does a merge on all the journaled commands to stay up to date.

So, basically I rolled my own sync solution that just happens to use SQLite. It's worth noting that I started this project by storing the data in my own format (JSON, at first) and quickly realized it was growing too large and was taking too long to serialize/deserialize. I'm very glad to have ended up using SQLite instead because it is super easy to use and has been reliable.


You can use rqlite for that:

https://github.com/rqlite/rqlite

You can have a single over-the-network database or a cluster of replica dbs (through Raft consensus).


When Subversion changed their working copy format to rely on an SQLite database, I had a number of problems using working copies on network file systems.

No doubt this was an indication that the network file system was incorrectly configured, but I think the fact that it worked in practice with the file-based format and didn't with the SQLite format is a strike against the idea of using SQLite for what the user sees as saving a file.


SQLite recommends against sharing it on NFS for multiple access because locking is broken on NFS for all kinds of files, not just SQLite.

> But use caution: this locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. You should avoid putting SQLite database files on NFS if multiple processes might try to access the file at the same time.

Source: https://www.sqlite.org/faq.html#q5


Fair enough, but that means if someone's following the advice in the article ("It deserves your consideration as the standard file format on your next application design") they need to consider "will our users want to use our files on network filesystems?".


You could always have a lock file alongside the main to signal to other instances of the app that the file's in use. A bit janky, but workable.


That's no worse than what applications often already do when they use a simpler file format.

But you also have to worry about SQLite (reasonably) refusing to operate because it tries to make a locking call and gets an error response. I think there are options to turn that off, though.


Also WAL mode isn't going to work on network filesytems.


Working copies on network file systems are a horrible idea. You detected problems because you were using SQLite, but are you sure you didn't have silent corruption with older Subversion versions?


If you need sharding and replication, just use a different database. SQLite is not intended to solve those problems.

As for high-availability, isn't a single file on your disk the most available thing there is? :)


Dqlite exists - it's SQLite with some of these feature added. https://dqlite.io/


What happened to storing blobs in a database being giant rookie mistake? I'm currently dealing with this at the office, where now a SQLServer database has grown to unmanageable size from helpdesk software attachments.


That's the thing: this isn't "a database" anymore when it's used as an application file format, it's now "an application file format" that is conveniently also a sqlite db file so everything in it is trivially accessible.

Storing huge blobs then becomes a matter of "is this my data, or is this general data that I'm merely also making use of". Examples abound of both:

- Adobe Lightroom (a non-destructive photo editor) goes with "these are not my images" and leaves them out of its sqlite "lrcat" files, instead maintaining a record of where files can be found on-disk.

- Clip Studio (a graphics program tailored for digital artists) on the other hand stores everything, because that's what you want: even if you imported 20 images as layers, that data is now strongly tied to a single project, and should absolutely all be in a single file.

So the key point here is that, yes: sqlite files are database, but because they're also single files on disk, their _use_ extends far beyond what a normal database allows. As application file format, the rules for application files apply, not the rules for standard databases.


Storing your blobs outside of the database when they are an inherent part of some larger transactional model is the giant rookie mistake IMO. We did this for a little while for the reasons of "oh my gosh look how big the .db is getting" and it ended up a complete shitshow. Simple tasks like cleaning up old entries can quickly turn into a nightmare if you have to go around disk and make sure you are not leaving garbage behind. There are also other concerns outside the box like VM snapshots. This could leave you in a very inconsistent state regarding what blob you have on disk vs what the database is expecting.

Is having everything in one gigantic .db file an actual downside? What makes a database's size "unmanageable"? Presumably, you'd have to store that business information somewhere anyways, right? I don't understand how 1 unified file is unmanageable, but splitting that concern up into 2 different domains magically makes it better.


It can trash db performance. A relatively simple alternative is to store references to immutable blobs in the db so that everything remains transactionally consistent (and easily restorable to any point-in-time). This is what many large companies do. I happen to work at a company that runs the largest + most "shall not be named" databases and has struggled for years to evict all the blobs people shoved in em.


If you index on the file_name in a table where you have file_name and a blob then why would it trash performance? SQLite can quickly locate where to start reading the blob.


Did you read the article? This isn’t about storing blobs as data in SQLite, but rather using SQLite-formatted databases as the file format for an application.

As a simple example, Word documents are just zipped XML text files (try unzipping a .docx and looking inside). Instead of using this, you could a SQLite .db file (probably with a different extension), translating the XML files into tables, and folders into databases. The OpenOffice case study has more details: https://sqlite.org/affcase1.html


Apparently I'm misreading the following paragraph.

Any application state that can be recorded in a pile-of-files can also be recorded in an SQLite database with a simple key/value schema like this:

    CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);


I agree that is unclear for an introduction, and that the article could have been written better. The “pile-of-files” they are referencing isn’t a filesystem of arbitrary and unstructured binary data stored for humans to look at later (like your Helpdesk attachments), but rather structured and specific data for an application to read in a well-defined manner.

Your Helpdesk example used SQLServer problematically because a SQL database shouldn’t be used as an arbitrary file store. But if you know what the file structure is and have a reasonable grasp for how it might scale (that each binary blob is small, that each user only adds one row to the database, etc), there are huge advantages to “a SQLite table with lots of binary and text columns” versus “a folder with lots of binary and text files.” And if those text files are just small key-value pairs then maybe they should also go in SQLite.


It is reasonable to adopt that style if it is data you don't control. Storing blobs is probably a mistake if it is data you have control over.

Sometimes all that is known about data is it exists - in that case, into the database as a blob it shall go. If it can be decomposed it probably should be.


The OpenOffice case study is a fascinating thought experiment.

We don't really create too many new file formats these days, and if we did they're highly performance specialized (parquet, arrow).

Just wondering aloud, what recent file format would have benefited from being based on sqlite?


Storing blobs is good if the blobs are what you actually want to store; the rookie mistake is storing blobs instead of the well-structured, meaningful, indexable and much more compact columns and additional tables that you should be storing but you are too rookie to to design.


I don't think there is a consensus that storing blobs in the database is always the wrong decision. Yes, there are drawbacks, but it really depends on the circumstances, and it might be the right choice.

Suppose, for example, that you were going to make your own implementation of Git. The normal implementation has a bunch of blobs under one directory. There's not much use in manipulating this directory separately from everything else under .git. The blobs and most of the non-blob stuff are only useful together anyway, so you already manage them as a single collection.

It could create problems for certain backup tools, so that's a disadvantage. But it also simplifies application development. So it's not a slam dunk one way or the other.


Relatedly, this is (from what I understand) exactly how Fossil works for version control; it's literally a single SQLite file storing blobs representing changes to the repository.


The suggestion here is to use it as a file format, not as a file system.


It's less of a rookie mistake than it seems these days: https://danluu.com/file-consistency/.


I've never faced this problem so perhaps I'm missing something obvious, but what's the big difference storing binary blobs on the DB vs, say, storing URLs on the DB and hosting the blobs elsewhere? Is the size of the database the main concern?


Both approaches have their advantages.

Storing blobs on the DB makes the data more consistent (the database is always in sync with the file content), and allows you to use advanced features of the DB on these files (transactions, SQL queries, data rollback ...). You also only have to backup the DB.

Storing links to the objects is usually more scalable, as your DB will not grow as fast. DBs are usually harder to scale, and also more expensive (at least 10x per GB).

It really depends on the project, but I'm favoring more and more storing the data in BLOBs, as it makes backups easier and as I can use SQL queries directly on the data. Databases as a service also make it easy to scale DBs up to +/- 2TB. But the cost might still be an issue.


>and also more expensive (at least 10x per GB).

Expensive in what way? Memory/Compute? It can't be licensing money since SQLite is public domain.


raphaelj is probably talking about classic database machines like PostgreSQL and MariaDB.

In my view, storing blobs in sqlite doesn't have a huge disadvantage tied to it. Sqlite grows pretty linearly and as it stores the whole db in a single file, hosting provider doesn't even have to know about you using it.


I'm just comparing the per stored GB cost of on demand database services vs on demand storage services.


I think that refrain usually applies to when the database is used to store relational data, rather than when it's being requisitioned for use as a file format.


attaching oversized unreadable scanned pictures of digital text files I guess


I used to advocate for this, then tried opening a sqlite file I'd been using on another computer and it said it was corrupted. I took it back to the original computer and it opened just fine. So there was something going on that made it more complex and less portable than something simple like json, which is super frustrating because otherwise it's a really useful format for datasets, and now I can't really trust it.


If you use certain new features of SQLite, the database file gets "upgraded". At that time, you must use a version of SQLite that understands these new features.

So, the other computer you used was probably running an older version of SQLite. Just update it to make it work.


I think the computer it worked on was one with an older version of SQLite rather than a newer one, but I’m not certain. It was a while ago. I thought I’d checked for the upgrade issue, but my memory is shit.


That’s a good catch. I’d there any way to tell which version of SQLite created the file? Aside from trial and error, I mean?


The version is stored in byte 96 of the header of the sqlite file[1].

I haven't looked but there will be some sqlite command to query it and I'm sure some viewer tools will display it as well.

[1] https://www.sqlite.org/fileformat.html


You can print the version with this magical incantation:

    $ python -c 'import sys, struct; print(struct.unpack(">I", open(sys.argv[1], "rb").read()[96:100])[0])' foo.db


Recentish versions of file can show you the version:

    $ file example.db
    example.db: SQLite 3.x database, last written using SQLite version 3007017


Quickbooks for Mac uses sqlite for its file format and this allowed users to query the books with any SQL browser!

Alas, they encrypted the file starting two years ago.


The article has great points.

Main reason for existence of lot of file formats is that enterprises don't want just about everyone access their files and modifying them. It greatly reduces the usage of their proprietary software hence their revenues.

I remember the days when open office trying to render doc file but formatting used to suck big time.

Open source softwares should leverage this kind of file formats for inter operating.


> Main reason for existence of lot of file formats is that enterprises don't want just about everyone access their files and modifying them. It greatly reduces the usage of their proprietary software hence their revenues.

I'm sure this does happen, but it seems more like MBA-paranoia than a legitimate concern. For instance I sincerely doubt Adobe has lost any revenue by using SQLite with lightroom, despite various open source tools being able to interact with their lrcat (sqlite) files.


it's the old business paradox, if everything was totally lean and solved, how do people earn a living ? you need a strong force to overcome this hurdle and show how to reorganize the system in the large


SQLite is a fine single-file application file format, but if you want average users to be able to inspect and perhaps fix application data when things go wrong, a tree of JSON-format files is more friendly.

True, this requires care to ensure that such files are updated reliably, but that's not quite rocket science.


Is anyone doing this in Javascript (via electron) and encrypting the file in such a way that the data is protected and not an obvious sqlite db?

I have some ideas on how to do this, but I'm curious if there's a "preferred" way to do it.


SQLite has encryption plugin support, or you could encrypt the file yourself at rest. However, no matter what you are doing, as the application needs to be able to decrypt the file, you have to ship the key and thus the protection will be able to be circumvented.

It’s really not worth the effort IMHO


That's a good point... I don't want to have the game rely on "phoning home" to decrypt the db.


On top of that, even if you did have the game "phone home", that doesn't do much against someone inspecting the memory of a running game; one could just wait until the game does have the decryption key.


Why would you want to "protect" the data in this way? It's user-hostile, and anyone determined enough will be able to get at it no matter what you do.

If it's popular and valuable enough, instructions and/or code to break it automatically will then be published, regardless of how much money or time you invested into the protection. (For proof, look no further than the game industry's DRM over the past 30 or so years)


It's not meant to be user hostile, it's for a game. I allow data exports, but I don't want to expose the inner workings of the game via direct access to the DB.


Data mining is an old gamedev worry. Nobody who's been doing it for a while particularly cares. If the game is remotely popular, data mining will happen. Disassembly to find the decrypt function is a talent possessed by many, many people who have the reverse-engineering bug. If you then obfuscate the code, you have made things interesting and then increasingly talented people will try to have a go at it. If you combine a changing obfuscating technique with updates, you can slow down community possession of the game with respect to modding etc., and Minecraft worked that way in its beta phases. But it's really all a question of your purpose at that point.

If you haven't already, I suggest cruising through TCRF: https://tcrf.net/The_Cutting_Room_Floor


You do realize, exposing the "inner workings of the game" seems to be at least somewhat responsible for massively increasing the appeal of the game in other cases? see Minecraft, Rimworld, KSP as a few examples...


... see Minecraft, Rimworld, KSP as a few examples...

I mean, I agree that games should be moddable, but not from the premise of game profitability and popularity.

The highest-grossing video game franchise list is over-ran by products that are , pretty much, famously unmoddable.[0]

[0]: https://en.wikipedia.org/wiki/List_of_highest-grossing_video...


That list has some old and long-running franchises (e.g. Mario), some semi-interactive movie experiences (e.g. CoD), and some games that have in fact been modded to ridiculous extent (Warcraft, StarCraft, The Sims). I wouldn't read too much from it.


The unmoddable ones are mostly there cause they are big names or old names. I mean, heck, LoL was literally inspired by how moddable Warcraft 3 was. If you aren't using a massively well known source material, and you don't have a AAA game studio, your main chance to make a big game according to that list seems like either moddability or cheesy facebook game tactics.


I used it with a different extension to save application for a product I wrote.

Any classes that need to be saved had serialize() and deserialize() functions. Serialize before saving to SQLite and if read in, deserialize after reading it from the DB.


Maybe. I've used XML as a file format for all my recent projects. It's well supported in the .Net framework, I don't need any 3rd party assemblies. Using SQL to write out an object graph would involve more work.


A tree I could understand since after all XML uses a tree-like structure, but for a graph you'll need to store references anyway so flattening and denormalizing it completely doesn't sound like a lot more work to me.

Also XML makes me weep and I'll never willingly opt to use it if there's an alternative, but that's my own prejudice.


It depends on how large and complicated your XML file is.

Once your XML file hits a certain size, minor updates incur a gigantic performance penalty because you have to write out the entire XML file every time.

Random writes in the middle of an XML file are impossible. For example, if you were to change an attribute so that it's one character longer, you still have to rewrite the remainder of the file in order to shift everything over by one character.

That's the main reason why SQLite is so popular for applications.

(I know that's from personal experience. I had to support an old version of an application that constantly wrote out XML while the new version that we hadn't shipped used SQLite. A customer that made heavy use of the old version basically hit the limits of XML but the version that used SQLite wasn't ready for them yet.)

But, I'm going to be honest here: We had some tables that only had a few rows, so I moved those to XML files. QE really liked it because it was easy to diagnose issues.


I think it depends on whether you need a file format or a database. All the benefits of a database are wiped out if you are implementing File->Save for a document based application.


File->Save is an extremely outdated workflow leftover from when disk drives took multiple seconds and you couldn't edit the document while you were writing to the disk.

Now your application should continuously save without any manual steps.


Depends on the size or structure of the data.

If you're having to do things like updates over a large dataset, SQLITE can be nice because of the performance boosts from indexing.


I just converted a 90's-vintage xBase app to SQLite. I just love it. Should have done this years ago.

The only thing I miss is the ability to jump X% into an index. Not a deal-breaker, though.


A lot of well known apps are exactly doing this, using a sqlite db as the internal file format: Whatsapp, WeChat, Google Keep, Apple iMessage, Notes and Calendar...

Probably tons more if you care to dig around.


I worked with an SQLite Database on a personal C#/WinForms app. It was a bit complicated to get started, but afterwards, what a handy trick in comparison to using something like the discontinued SQL Server Compact.

I did, however, "enhance" it a bit (or "proprietarize" it) by encrypting it with a short password and an AES algorithm with some uncommon settings. I never shipped it that way, but the output file looked like any other proprietary app's file - a mess of random symbols.


Everything old is new again. I can recall a similar movement at the end of 90s regarding databases as application file formats. That time it was Microsoft Jet Database engine.


Concurrent access to SQLite is the main reason I haven't really looked into Deno. For that matter, I really think that WebAssembly file system implementations should have enough flexibility to be able to work with a SQLite database file in the host OS.

The needs for concurrent SQLite pretty much covers the need for a robust file system.


I keep switching between csv, parquet and sqlite. I dont need another article to encourage me to switch again! :)


Does SQLLite have a more text-friendly storage format? This would enable Text editors to do edits/modifications, which is critical to debugging and testing and learning. Savefiles don't need the same ongoing access optimizations, or size optimizations for wire protocols, so a less efficient format that is text editor friendly would help this cause.

Also, does SQLLite have libraries in the native code for the major languages to read and write to them? XML (ugh), JSON, and YAML have managed to get decent implementations in almost all major languages.


Using a text editor to mess with a SQLite database doesn't make any sense. Do you use text editors to edit, say, MPEG media?

SQLite databases can be edited with with generic GUI and command line tools, both SQL-based and tabular editors, which are safer and more convenient than a text editor could ever be.


When I was reversing the XMDF E-Book format, I was surprised to find an embedded SQLite database... in an E-Book?? After reading this article, it makes more sense why someone would make that decision


One minor point that needs to be raised: Using SQLite as your application file format locks you into the SQLite source code.

(And in most cases that's not a bad thing, it's free and open source.)

But, if you truly want an open file format, someone needs to be able to independently write a program that can read your file without relying on third party dependencies. This is why the browser vendors decided not to put SQLite into the HTML and JavaScript standards.

It is an awesome product! I've worked with it for over a decade and I'm a fan!


> it's free and open source

It's more than "open source" it's in the public domain.

The file format is fully documented and if someone like ISO or ANSI wanted to, they could make a standard out of it. It's also forward compatible since inception and versioned.

The browser vendors decided not to put SQLite into their browsers because "key/value good, SQL bad" and "not invented here".

IndexedDB is a clumsy reinvention with minimal ACID properties and isn't far advanced from ISAM. They could have used the SQLite file format and implemented IndexedDB as an API over the top. They could have allowed both standards to be implemented and then let reality take its course to choose which one was successful.


To my very limited knowledge it is very common for iOS and android apps to store data in SQLite.

Is it possible to extract or change these databases?

I have a few (offline) apps on my phone that I’d love to append data to


As a sibling comment says, if you are on Android and have root access, it is very easy.

Just have a look at /data/data/appname/.

For example, this is what I copy to make backups of my contacts on my Android phone:

> # pwd

> /data/data/com.android.providers.contacts/databases

>

> # ls -1

> calllog.db

> contacts2.db

> profile.db

This is a simple way for moving data around, restoring applications, performing backups, editing your data (if you know your way inside the app).

Beware of the selinux labels if you are moving files across different devices, as recent android version now run with selinux in enforcing mode.

Just be sure to adjust them (useful commands if you need a reference: ls -lZ, chcon, semanage).


For unjailbroken iOS the only way to extract these databases is to make an unencrypted device backup in iTunes (Finder in Catalina) and then look at the backup file. The backed up files won't be conveniently named but you can still find it.


Can I push my edits back onto my iPhone or is that a pipe dream?


I suppose it's possible, but you need to study the way backups are organized and verified.


It's possible on Android if you have root access, I think. Android is very serious about apps not accessing each other's data.


It's also super common on MacOS. According to lsof(1), my laptop has 95 sqlite .db files currently open; deduping by path shows 56 unique database files open at the moment.


Would such an app hold the sqlite database open while the user has the document "open", and live-write user changes back to the database immediately? Or would it follow the traditional model of the user choosing to "Save"?

I worked on an app that did the former many years ago (to an Access database, not sqlite), and it did not go well because this broke user expectations on the usual "open/save/save as" model.


The save model at the UX level is completely orthogonal to the application file format. You could implement either model with an SQLite DB file.


One quick-and-dirty way to maintain the traditional model would be to copy the on-disk DB into an in-memory DB, then make any changes in memory; when the user clicks Save, the application would then move the old on-disk DB, open a new one with the old's original filename, copy everything from the in-memory DB into the new on-disk DB, and delete the old on-disk DB.

Another option would be to keep both the in-memory and on-disk copies open, and then update the on-disk version in-place with the in-memory version's data when the user clicks Save. SQLite has built-in support for connecting multiple DBs (such that you can query both in the same statement) to make this straightforward.


I think the former is preferable and most modern users will prefer it. However if your users are upset about it, you could do the later, or you could do the former and give them a dummy/placebo "save" option that does nothing, or maybe VACUUMs the DB.


The problem is not Save, the problem is when you close the app without using Save - the user will expect that to work like Cancel and discard all session changes.


I guess you could send changes to the DB on-the-fly inside a transaction, so when the user clicks "save" it's just a matter of running COMMIT? Not sure what effect that would have on performance as the transaction grows, though.


Or discard the "save" metaphor and move to infinite undo.


I think it would depend on the app (e.g. how document-like it is), but nowadays people will be much more used to the instant-save model due to things like Google Docs, OneNote, and phone apps.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: