
We’re happy with SQLite and not urgently interested in a fancier DBMS (2016) - nailer
http://beets.io/blog/sqlite-performance.html
======
wenc
This is just a thought: SQLite is a really good file format. Why aren't we
replacing CSV with it, especially for big data applications?

CSV can be difficult and ambiguous to parse correctly (because there's no real
standard) and isn't extremely performant. The only thing it has going for it
is its universality.

SQLite is lightweight, structured, supports indexing for performance and is
extremely easy to use.

~~~
Someone1234
CSV's biggest competitor is XML.

But both XML and SQLite have the same issue: They give you just enough rope to
hang yourself with. While SQLite is a fantastic micro-database engine and a
file format, it isn't a very good universal B2B format because there are too
many features you'd have to support to interoperate.

I'd argue CSV's biggest strength is that it is easy to parse correctly,
because the format is so simple and the standard is largely the superset. The
biggest problem for CSV is that the most popular desktop application for view
CSV (Microsoft Excel) sucks at it, and causes data corruption on save (and has
for at least twenty years).

If people wanted a relational database as a file, I can think of nothing
better than SQLite, but in B2B you want automated tooling that can parse the
format without human involvement. If you need some structure there are some
simpler XML-based formats which accommodate that, but still offer far fewer
features you'd have to support than SQLite does.

Ajax might become popular in this space, but I'm yet to see it.

~~~
wenc
> I'd argue CSV's biggest strength is that it is easy to parse correctly

Not sure I resonate with that. I work with large CSVs of varying provenance
every day (I work in big data) and there's always some CSV edge case that
stymies my analysis pipeline.

Timestamp parsing is extremely hard if it's not ISO-8601, as well as handling
of unicode encoding, missing data, type inference, European usage of , as a
decimal point, hidden characters, etc. One of the costs of almost complete
freedom in input is the "interesting" possibilities people come up with to
stymie your code.

The Pandas read_csv method has tons of switches to deal with all kinds of CSV
parsing definitions precisely because there is no standard. Fortunately this
covers 80% of the use-cases. [https://pandas.pydata.org/pandas-
docs/stable/generated/panda...](https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.read_csv.html)

Excel's CSV parsing isn't the greatest, but does a surprisingly decent job
considering how ill-defined CSV is.

XML isn't really on anyone's radar in the big data world. It's an interchange
format yes but is hugely inefficient for dataframe-type data.

~~~
Someone1234
> Excel's CSV parsing isn't the greatest, but does a surprisingly decent job
> considering how ill-defined CSV is.

No, it really doesn't. Save this as a CSV, open it in Excel, hit save, and
then review the raw CSV:

"1000000000012345", "Hello",

"2000000000067890", "World",

Here's what Excel (O365) does to it for me:

1E+15, Hello,

2E+15, World,

That data is now permanently lost. And this corruption occurs for almost all
EAN/UPCs. There are many ways to get around this in Excel, but the default
experience is data corruption and has been for most of my lifetime. It is a
terrible application for CSV. If they followed the standard every column would
be what it is: Text.

> XML isn't really on anyone's radar in the big data world.

But is a central theme in the business to business back-end systems world. I'm
talking about ordering, invoicing, remittances, hospital records, hospital
billing, and so on. No clue what "big data" is, we deal with billions of
transactions a year, but that's likely not what you're referring to.

~~~
wenc
1\. I think in the first instance, Excel performs type-inference and coerces
it into a numeric type and writes a truncated precision version of numeric
fields in CSV, which is an abominable sin. There are workarounds for this [1],
but agree this is terrible behavior due to CSV's lack of types.

Arguably, there should have been an explicit Excel switch which forces all CSV
data to be parsed as raw strings.

I've seen problems with Excel CSVs that are worse than that: I have serial
numbers that have leading 0's that have semantic meaning, like
000002324122323. Most CSV parsers cannot tell that this isn't a numeric type
and so handle it wrongly. So I resort to [1].

2\. Big data is of course somewhat ambiguous nomenclature as well, but is
nowadays typically understood to mean the Hadoop ecosystem or similar. Data is
typically ingested into a distributed file system (HDFS, S3, etc.) in formats
such as Parquet, Avro, JSON and often CSV. A schema-on-read database like Hive
sits on top of this layer and presents a SQL interface to the user. Tools like
Apache Spark provide programmatic transformations that operate on the data on
the large.

CSV is often promoted as a format for storing structured data due to ease of
ingestion and inspection (all you need is a text editor for troubleshooting).
However, you pay a performance penalty every time an analytic query is run
because CSV doesn't support indexes, predicate pushdowns, compression, etc.
and records can sometimes be uninterpretable under certain schemas (in which
case they are simply excluded or dropped).

Sometimes having one too many commas in a row can completely mess things up
(all the columns get shifted in a record) -- I found this out the hard way
when Spark exported malformed CSVs with un-escaped commas.

[1]
[http://support.pitneybowes.com/SearchArticles/VFP06_Knowledg...](http://support.pitneybowes.com/SearchArticles/VFP06_KnowledgeWithSidebarTroubleshoot?id=kA280000000TTCfCAO&popup=false&lang=en_US)

------
batmansmk
We use SQLite in an app with about 1,000 active users. It took us:

\- 0h to manage backups ("cp"),

\- 0h to manage seeds and tests fixtures ("cp"),

\- 0h to configure and secure (void),

\- 0h to write the deployment scripts (void),

\- 0h monitoring/watchdog jobs (void),

\- 1h to rsync for failover ("rsync")

My last projects always spent at least a good 100h to do all of this the right
way. Then, if it is not good enough, we'll move to RDS or equivalent.

~~~
jstimpfle
> 0h to manage backups ("cp")

Are you aware that's unsafe? To make a safe backup use sqlite3's ".dump"
command (or filesystem snapshotting, but I've had bad experiences with that,
at least on btrfs).

~~~
ufo
I'm not very familiar with the details here. What are the problems with cp?

~~~
jstimpfle
cp does not make atomic snapshots. It copies by reading (usually sequentially)
chunk by chunk from the source file, and writing these chunks to the
destination file. This takes time. If the database has writes at the time of
backup, the backup might be invalid (it contains some old parts and some new
parts).

(Unless you use e.g. the --reflink option of GNU cp, in which case it makes
atomic snapshots on filesystems that support it).

~~~
candiodari
Isn’t one of the more important points of POSIX that reading sequentially
results in an atomic copy ? As long as you keep the file handle open and the
fs supports it.

~~~
jstimpfle
I don't know which point you mean, but that would mean that any writer would
be blocked indefinitely by any other reader. That amounts to a read lock. You
don't get a read lock just by opening a file for reading. You can test that
with a simple shell script

    
    
        {   
            echo line1
            echo line2
        } > test.txt
        {   
            read line
            echo got "$line"
            sleep 2
            read line
            echo got "$line"
        } < test.txt &
        # concurrent write
        sleep 1;
        {   
            echo line2
            echo line1
        } > test.txt
        wait
    

This script does a concurrent write while the reader is in the "sleep 2"
phase. The output should be

    
    
        got line1
        got line1
    

(given that the sleeps do the expected thing).

POSIX might contain something that requires aligned blocks of 512 bytes or so
to be read or written atomically. But only if you do that in a single system
call, of course.

------
tomc1985
As a postgresql nut, the people suggesting those other platforms (incl. pgsql)
are idiots. Postgres in particular would be a bad idea unless he wanted to
incorporate VACUUM() into his loading screen

------
simonw
SQLite is a fantastic database for desktop applications.
[https://www.sqlite.org/appfileformat.html](https://www.sqlite.org/appfileformat.html)
is a great essay on its benefits for this kind of use-case.

~~~
redwolf2
Not only desktop, but most mobile platforms as well. Beeing coded in a single
C file makes it portable as hell. Their documentation is also a prime example,
with great insight.

~~~
fnord123
Not only desktop and mobile, but it can be used as the storage format for
larger scale distribued systems. For example it's the on disk format for
FoundationDB.

>Beeing coded in a single C file makes it portable as hell.

It's not coded in a single C file...

~~~
simonw
It kind of is... the SQLite "amalgamation" process takes all of SQLite's
source code (across 100+ files) and concatenates it together into a single
sqlite3.c file:
[https://www.sqlite.org/amalgamation.html](https://www.sqlite.org/amalgamation.html)

~~~
fnord123
That's horrifying and interesting. Thanks for linking it!

I wonder if the alleged 5-10% performance gains are compared with usage of
LTO.

------
dividuum
Can confirm. SQLite is great. I've written an e-commerce website for a
customers a few years ago. Most of the data was static and only updated a
couple times per day in bulk. It was easy to do the main import/data update
completely offline, then copying over the db file to multiple servers and
doing mostly read-only queries on the DB. SQLite was more than fast enough and
not having to manage another service was totally worth it. It's good having
SQLite in your toolbox.

------
makmanalp
We use sqlite in data-heavy visualization website
([http://atlas.cid.harvard.edu/](http://atlas.cid.harvard.edu/)): We currently
have an approximately 22GB sqlite file with the two largest tables going up to
150 million rows each. It still works just fine. The limiting factor seems to
be the I/O throughput of the instance and disk of the server (EBS volumes).

Read-only workloads, most of which eventually get cached by the webserver, but
still. Each visualization ends up pulling a large amount of data (sometimes in
the 10 thousands of rows) so the network / serde overhead and the
administration costs of an external database server adds up, though things
have changed recently so perhaps a revisit is in order.

Where it hurts is: The query optimizer sometimes stumbles and does silly
things (e.g. not always very smart about column / index selectivity
statistics). The data import takes a long time (compared to e.g. postgres'
COPY), so that's another pain point.

~~~
infogulch
You can help the optimizer stumbling by running ANALYZE [1] with
representative data present, and turning off auto-analyze. This was
intentionally designed as a feature so you could set up the statistics for a
db, empty it, deploy it out to the end user and know that the optimizer isn't
gonna go off and do something funky.

[1]:
[https://www.sqlite.org/lang_analyze.html](https://www.sqlite.org/lang_analyze.html)

------
captain_perl
FYI: SQLite doesn't have ALTER TABLE.

~~~
coldacid
And this pisses me off every time a schema change needs to be made in mobile
apps I develop.

------
michaelmior
Could (2016) be added to the title?

------
laumars
_> The idea is that a more complicated DBMS should be faster, especially for
huge music libraries._

How "huge" are they talking about? I built a tool that imports Apache logs
into sqlite for quick analysis and that easily handles several million
records.

~~~
loeg
Just a ballpark: my beets DB is 15 MB for ~1000 albums (well, folders). This
post[0] claims 10M albums ever released. So _maybe_ 150 GB upper bound?
Assuming my db has been vacuumed recently (~~probably not true~~ Edit: nope,
still 15M after vacuum) and ignoring that indexes will scale slightly non-
linearly.

[0]: [https://www.quora.com/How-many-music-albums-are-available-
in...](https://www.quora.com/How-many-music-albums-are-available-in-the-world)

~~~
laumars
I think 10m is far too low to be the total number of albums. But it would be
impossible to actually calculate that figure given the number of producers
that have released content outside the scope of any particular authority (eg
NIN has released stuff to download only from their site. Aphex Twin uploaded a
load of stuff to SoundCloud. Etc). But we digress.

My music collection is massive. It's got 30 years of singles and albums in
there. And several years of DJing too. Plus a massive amount of DJ sets (like
several hundred gig of DJ sets alone). Yet XBMC / Kodi handles it fine
"despite" being sqlite backed. So does Subsonic and that just uses some Java
equivalent. And as I've said already, I can throw several million records of
Apache logs into sqlite without any issue.

The performance of sqlite is actually really good. It's super fast. Which is
why the author can make those boasts on its website. Plus the compactness of
the db (ie one file produced) and the ease to create a db makes it a perfect
choice for desktop and mobile applications.

I think some people read that WordPress runs on MySQL or spend 10 minutes
building their first CMS in PostgreSQL and then think the world and their
mother should be using this miraculous new RDBMS they've just discovered
themselves. (I'm probably being too harsh there)

~~~
loeg
Yeah. To be clear, I think even a 150GB sqlite database would perform just
fine for beets' purposes.

~~~
snuxoll
Even as a heavy PostgreSQL fan and promoter I don't see why you would choose
anything BUT sqlite for single-user applications unless you needed something
that only other databases provide (richer procedural language support or other
extensions that the fairly basic but still adequate featureset SQLite
provides) - size of the dataset is only one factor to consider when choosing
the storage layer for your application, and there's a _lot_ of valid solutions
for handling 1TB or less of data easily.

~~~
ptman
Real typing? [https://firebirdsql.org/](https://firebirdsql.org/)

------
kimi
Related note: in the last year I started using SQLite with my Clojure scripts
instead of using a remote database, and I'm storing everything in git. Works
like a charm for single-user tasks.

~~~
kozhevnikov
Do you store the binary .sqlite in Git or source .sql and build the .sqlite?

~~~
bachmeier
Wouldn't tracking the binary with Git make a new copy on every commit?

~~~
aphx
The .db file can change, even if the data/records don't. One approach is to
dump the sql to text and store that in git. (e.g.
[https://stackoverflow.com/a/846665](https://stackoverflow.com/a/846665))

------
ufmace
I am on the train that the idea of using SQLite for a webapp, even one where
SQLite can handle the traffic just fine, isn't nearly as good of an idea as
some would have you think. Nevertheless, the idea of switching a local end-
user application off of SQLite in favor of a client-server RDBMS is bonkers
IMHO. You would be making everything about the application dramatically more
complex for no real benefit. This blog post is considerably nicer than the
idea deserved.

------
kodablah
Abstract the persistence layer, document the contract, build a conformance
suite, and write your one impl that uses sqlite. I know sometimes there seems
to be an impedance mismatch between callers and databases, but often it's not
that bad. And no, don't use an ORM or existing DB abstraction to do this. Sure
it takes work, but pluggable storage is nice on some of these types of things.

~~~
sametmax
Congratulation, you just rewrote the base for an orm.

~~~
kodablah
Sorry, put in my no-ORM edit as you may have been writing this. No, I mean
higher level than that. ORMs suck as solving your problems specifically. Think
more like an API than an ORM. I just started a side project and here's the
beginning of an example of the iface I mean [0] and here's the beginning of an
example of the sqlite impl I mean [1] (all untested and really early stage of
course, but the idea is there).

0 -
[https://github.com/cretz/yukup/blob/eecb3b24b33f40e6a1eefaa5...](https://github.com/cretz/yukup/blob/eecb3b24b33f40e6a1eefaa5a3a6ebcd1d2cb084/yukup/storage/storage.go)
1 -
[https://github.com/cretz/yukup/tree/eecb3b24b33f40e6a1eefaa5...](https://github.com/cretz/yukup/tree/eecb3b24b33f40e6a1eefaa5a3a6ebcd1d2cb084/yukup/storage/sqlite)

~~~
sametmax
But they abstract the underlying db very well and have an uniform api that you
can rely on. It's also tested, documented and battletested.

It has drawbacks but if accepting several db is an important goal, you don't
want to handcraft this unless you can measure inaceptable perfs.

~~~
kodablah
> they abstract the underlying db very well

I am admittedly unfamiliar with this specific case, but even a cursory glance
showed me some sqlite-specific code embedded in the main library portion [0].

0 -
[https://github.com/beetbox/beets/blob/4f7c1c9beda6c2a5b0705e...](https://github.com/beetbox/beets/blob/4f7c1c9beda6c2a5b0705e868fb61e9b0af0e16d/beets/library.py#L107)

------
octorian
I'm actually working on a project that's in the process of migrating from
SQLite to MySQL as the preferred DB. I'll admit that the reasons we're doing
it are probably not applicable to a desktop application like this.

The real wins for us are multithreaded and multiprocess DB I/O. This is
something MySQL can handle, but SQLite really isn't designed to do well or
efficiently.

------
mamcx
The mistale here is thinking that a "fancier DBMS" MUST be off-process.

I totally wish for a fancier DBMS that can be embebed (firebird fit here!) and
work on mobile (but not here yet).

And not only "fancier" in the limited sense that most believe. I truly mean A
LOT FANCIER.

~~~
dexen
MySQL Embedded[1] is a thing; used by some KDE apps. IIRC, it insists on using
multiple files rather than a single file like SQLite does, but otherwise is a
pretty much a turn-key component. No such luck with PostgreSQL so far.

[1]
[https://www.oracle.com/mysql/embedded.html](https://www.oracle.com/mysql/embedded.html)

------
z92
Unless for example, your run a background process which periodically imports
data from an external source and inserts into sqlite database, while your web
application continues to render pages for visitors based on existing old data.

Your web page will fail, as it will time out on database lock because of that
background process.

~~~
jasonjayr
You'll want to turn on write-ahead logging for that scenario[1]

[1]: [https://www.sqlite.org/wal.html](https://www.sqlite.org/wal.html)

------
jahvo
Kudos to this guy. As the maintainer of an open source project, the hardest
job is to tell people "your idea is stupid and I won't do it"

~~~
glenda
That seems pretty easy. It's much harder to try to understand where someone is
coming from and engage with them in a positive way, even if you're rejecting
their idea in the end.

------
nebulous1
Somebody clue me in on "asymptotically better database behavior"

------
joeschmoe3
If you use Rust.
[https://crates.io/crates/rusqlite](https://crates.io/crates/rusqlite)

------
marknadal
I agree wholeheartedly with this article, despite being somebody who has
written a "fancy" popular database system (
[https://github.com/amark/gun](https://github.com/amark/gun) ).

SQLite is good enough for pretty much everything, just use it!!!

~~~
dang
You've posted countless comments like this to promote your database. Often you
include a lot of low-substance, barely relevant verbiage as packing, if not
camouflage, for the promotion. That doesn't help.

I don't say it lightly, but you're basically spamming HN, as users have been
complaining for a long time (e.g. [1], [2]). Even when you don't appear to be
doing it, you're doing it [3]. It's past time for this to stop; please stop.

1\.
[https://news.ycombinator.com/item?id=16893429](https://news.ycombinator.com/item?id=16893429)

2\.
[https://news.ycombinator.com/item?id=16677493](https://news.ycombinator.com/item?id=16677493)

3\.
[https://news.ycombinator.com/item?id=16749503](https://news.ycombinator.com/item?id=16749503)

